<a href="https://colab.research.google.com/github/arman76/hello-world/blob/master/NLP981_Phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP981 Final Project - Phase #1

*   Instructor: Javad PourMostafa
*   Teaching Assistant: Parsa Abbasi
*   University of Guilan, 1st semester of 2019
*   GitHub repository : *https://github.com/JoyeBright/NLP*

In [0]:
!pip install stopwords_guilannlp

In [0]:
import pandas as pd
from itertools import chain
from collections import Counter
import re
import math
import stopwords_guilannlp
import numpy as np

STOP_WORDS = stopwords_guilannlp.stopwords_output('Persian', 'list')


It's the first phase of your final project for the *NLP981* course. The main idea behind this phase is to portray the develope side of *NLP*.

You must code inside of this python notebook. I highly recommend you to use the *Google Colab* environment. 

If you have any questions, feel free to ask.
You can use [*Quera*](https://quera.ir/course/4385/) platform for your general questions.



## Introduction

A category predictor is going to build at this phase of the project.

The predictor gets a text as input and predicts a category for that.

For this purpose, you need to :

1.   Load the dataset
2.   Preprocess the text data
3.   Implement a word representation method to represent each text as a numeric vector
4.   Implement a classification model and train that using the training set
5.   Predict a category for each of validation data using implemented model
6.   Measure your work using confusion matrix and some common metrics

**Important Note:** You can use any library you want in sections 1 and 2. But everything in section 3-6 need to be coded purely.



## 1) Dataset

The dataset you will use in this phase is called *Divar* that released by the *CafeBazaar* research team.

It contains more than 900,000 posts of the *Divar* ads platform. We split this dataset into training, validation, and testing sets.

The testing set is not accessible for you, and we use them to evaluate your work on the presentation day.

You can download the dataset files (training and validation sets) directly from the following link :

> *https://drive.google.com/open?id=1oj-fqpymjDr8QsOK-zQliiqXbVqakrFo*


### 1.1) Import

In [0]:
# Import the training and validation sets here
train_set = pd.read_csv('./drive/My Drive/trainset.csv')
train_set = train_set.drop(['Unnamed: 0', 'Unnamed: 0.1', 'archive_by_user',
                            'brand', 'cat2', 'cat3', 'city', 'created_at', 'id',
                            'image_count', 'mileage', 'platform', 'price', 'type',
                            'desc', 'year'], axis=1)

valid_set = pd.read_csv('./drive/My Drive/validationset.csv')
valid_set = valid_set.drop(['Unnamed: 0', 'Unnamed: 0.1', 'archive_by_user',
                            'brand', 'cat2', 'cat3', 'city', 'created_at', 'id',
                            'image_count', 'mileage', 'platform', 'price', 'type',
                            'desc', 'year'], axis=1)


In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 1.2) Analyzing

Display the top 10 rows of the train set.

In [74]:
print(train_set.head(n=10))

                 cat1                                     title
0            personal                                   سایز 40
1  electronic-devices                      گوشی سامسونگ a3 2016
2            personal                     ساعت زنانه اسپریت اصل
3            personal  فروش یک عدد دوچرخه مارک پرادو بسیار سالم
4     leisure-hobbies                  14 اسکناس کلکسیونی code1
5     leisure-hobbies                       فوتبال دستی حرفه ای
6        for-the-home                                     لوستر
7            vehicles                           رینگ اهنی ال 90
8            vehicles                        نیسان سرانزا مدل۸۳
9            vehicles                                پراید سفید


How many data (rows) stored in the training and validation sets?

In [0]:
train_len = len(train_set)
valid_len = len(valid_set)
train_set = train_set[:3500]
valid_set = valid_set[:500]

How many posts are in each category (First level categories)? (Based on training set)

In [76]:
cats_counts = Counter(chain.from_iterable([[i] for i in train_set.cat1]))
print(cats_counts)

Counter({'for-the-home': 1057, 'vehicles': 767, 'electronic-devices': 606, 'personal': 504, 'leisure-hobbies': 343, 'businesses': 223})


## 2) Preprocessing

There are two kinds of text data in the dataset: *Title* and *Description*.
You can use one or both of them as text inputs of your classification model. Choose a composition that gives you a higher measuring score.

You need to apply some preprocessing procedures on your text data first. We want at least **4** preprocessing step from you. It can be removing stop words, removing punctation, removing or replacing digits, stemming, lemmatizing, normalization, and so on.

You can use the [*Stopwords Guilan NLP*](https://github.com/JoyeBright/stopwords_guilannlp) library to access a collection of Persian stop words.

In [0]:
def preprocessing(text: str):
    text = text.replace('\u200c', ' ')
    text = text.replace('\t', ' ')
    text = text.replace('$NUM', ' ')
    text = re.sub(re.compile('[/(){}\[\]\|@,;]'), ' ', text, )
    text = re.sub(re.compile('[✔️۰۱۲۳⚡۴۵\-۶۷=۸❌۹؛0%-9*.⤵٢٠١٥❤?&٩$#+!٦٨"،٧٤؟●٣]'), ' ', text)
    text = re.sub(r'\s+', ' ', text)
    cleared_text = ' '.join([i for i in text.split() if i not in STOP_WORDS])

    return cleared_text


In [0]:
train_set['title'] = [preprocessing(t) for t in train_set.title]
valid_set['title'] = [preprocessing(t) for t in valid_set.title]

## 3) Word Representation

As you know, classification models can't deal with strings directly, and you have to represent your texts in a numerical form.

### 3.1) Tf-idf

You have to implement the tf-idf vectorization method from scratch in this step. 

Furthermore, a function must be implemented that gives a text input and return a tf-idf vectorized representation.

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

*tf* (term-frequency) is the count of occurrences of the word `t` in specific text `d`.

*idf* (inverse document-frequency) is term that is inversely proportional to the number of texts with the given word. It can be calculated this way:
$$\text{idf}(t) = \text{log}\frac{1 + n_d}{1 + n_{d(t)}} + 1$$
where $n_d$ is the whole number of texts and $n_{d(t)}$ is the number of texts with the word `t`.

In [0]:
def doc_dict(BofW):
    d = dict.fromkeys(unique_Words, 0)
    for word in BofW:
        try:
            d[word] += 1
        except:
            pass
    return d


def tf_computation(document, bag_of_words):
    tf_doc = {}
    bow_count = len(bag_of_words)
    # print(bow_count)
    for w, count in document.items():
        tf_doc[w] = float(count / bow_count)
    return tf_doc


def idf_computation(docs):
    n = len(docs)
    idf_dict = dict.fromkeys(docs[0].keys(), 0)
    for document in docs:
        for w, val in document.items():
            if val > 0:
                idf_dict[w] += 1
    for w, val in idf_dict.items():
        idf_dict[w] = math.log((n + 1)/float(val + 1)) + 1
    return idf_dict

def tf_idf_computation(tf, idfs):
    tf_idf = {}
    for w, val in tf.items():
        tf_idf[w] = val * idfs[w]
    return tf_idf

def tf_idf(text):
    text_BoW = text.split(' ')

    doc = doc_dict(text_BoW)
    tf_doc = tf_computation(doc, text_BoW)
    vector = tf_idf_computation(tf_doc, idf_s)
    return vector


In [0]:
unique_Words = dict()
idf_s = dict()
n = 0
def fit_transform(X):
    global unique_Words
    global idf_s
    global n
    BoW = []
    unique_Words = dict()
    for row in X:
        BoW.append(row.split(' '))
        unique_Words = set(unique_Words).union(set(BoW[-1]))

    unique_Words = sorted(unique_Words)
    print('length of unique words: ' + str(len(unique_Words)))

    Docs = [doc_dict(x) for x in BoW]
    n = len(Docs)
    idf_s = idf_computation(Docs)

    return pd.DataFrame([tf_idf(x) for x in X])

def transform(X):
    global n
    global idf_s

    Bow = []
    for row in X:
        Bow.append(row.split(' '))
        for word in Bow[-1]:
            if word not in idf_s:
                idf_s[word] = math.log((n + 1)/float(0 + 1)) + 1
    Docs = [doc_dict(x) for x in Bow]
    
    return pd.DataFrame([tf_idf(x) for x in X])

def labels(x):
    if x == 'vehicles' or x == 0:
        return 0
    elif x == 'for-the-home' or x == 1:
        return 1
    elif x == 'electronic-devices' or x == 2:
        return 2
    elif x == 'leisure-hobbies' or x == 3:
        return 3
    elif x == 'businesses' or x == 4:
        return 4
    else:
        return 5

In [173]:
tf_idf_docs = fit_transform(train_set.title)
tf_idf2 = transform(valid_set.title)

length of unique words: 3038


## 4) Classification

![alt text](https://cdn.lynda.com/course/578082/578082-637075371482276339-16x9.jpg)

### 4.1) Logistic Regression

The Logistic Regression classifier must be implemented from scratch here.

You can fit the training data into the classifier after implementing linear regression.

In [0]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def cost(theta, x, y):
    h = sigmoid(x @ theta)
    m = len(y)
    cost = 1 / m * np.sum(
        -y * np.log(h) - (1 - y) * np.log(1 - h)
    )
    grad = 1 / m * ((y - h) @ x)
    return cost, grad


def fit(x, y, max_iter=5000, alpha=0.1):
    x = np.array(x)
    x = np.insert(x, 0, 1, axis=1)
    thetas = []
    classes = np.unique(y)
    costs = np.zeros(max_iter)

    for c in classes:
        binary_y = np.where(y == c, 1, 0)

        theta = np.zeros(x.shape[1])
        for epoch in range(max_iter):
            costs[epoch], grad = cost(theta, x, binary_y)
            theta += alpha * grad

        thetas.append(theta)
    return thetas, classes, costs


def predict(classes, thetas, x):
    x = np.array(x)
    x = np.insert(x, 0, 1, axis=1)
    preds = [np.argmax(
        [sigmoid(xi @ theta) for theta in thetas]
    ) for xi in x]
    return [classes[p] for p in preds]


In [182]:
y = [labels(x) for x in train_set.cat1]
valid_y = [labels(x) for x in valid_set.cat1]
print(y[:20])
thetas, classes, costs = fit(tf_idf_docs, y)


[5, 2, 5, 5, 3, 3, 1, 0, 0, 0, 3, 2, 4, 1, 2, 3, 5, 1, 2, 3]


## 5) Prediction

Now you can predict a category for each of the validation data using the implemented classifier.

In [0]:
def ilabels(x):
    if x == 'vehicles' or x == 0:
        return 'vehicles'
    elif x == 'for-the-home' or x == 1:
        return 'for-the-home'
    elif x == 'electronic-devices' or x == 2:
        return 'electronic-devices'
    elif x == 'leisure-hobbies' or x == 3:
        return 'leisure-hobbies'
    elif x == 'businesses' or x == 4:
        return 'businesses'
    else:
        return 'personal'

In [186]:
y_p_train = predict(classes, thetas, tf_idf_docs)
print('valid set: ')
yp = predict(classes, thetas, tf_idf2)
print(yp)
a = ['هندزفری', 'مبل', 'کفش', 'گوشی الجی']
ps = predict(classes, thetas, transfom(a))
for i in range(len(a)):
    print(a[i], '\t\t:', ilabels(ps[i]))

valid set: 
[0, 0, 0, 1, 5, 4, 5, 2, 1, 2, 0, 0, 1, 1, 5, 0, 1, 1, 0, 1, 3, 1, 3, 1, 1, 3, 4, 3, 0, 1, 5, 0, 0, 2, 5, 2, 0, 1, 0, 1, 3, 1, 2, 2, 4, 5, 5, 4, 5, 3, 0, 1, 1, 1, 5, 0, 1, 1, 0, 1, 1, 1, 2, 0, 1, 5, 5, 2, 2, 0, 2, 1, 1, 5, 1, 0, 5, 2, 1, 1, 1, 2, 5, 5, 1, 5, 4, 3, 0, 1, 1, 3, 0, 1, 1, 2, 1, 0, 1, 5, 0, 0, 2, 5, 5, 1, 0, 4, 1, 2, 2, 1, 5, 1, 0, 0, 1, 0, 1, 2, 1, 1, 2, 5, 1, 2, 2, 4, 2, 3, 1, 5, 1, 1, 2, 2, 1, 2, 5, 0, 1, 1, 1, 3, 1, 1, 2, 4, 0, 1, 3, 2, 5, 5, 2, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 3, 2, 1, 1, 1, 0, 1, 1, 1, 1, 5, 0, 2, 2, 1, 1, 1, 5, 1, 0, 1, 1, 0, 1, 5, 1, 5, 1, 1, 1, 2, 4, 3, 1, 0, 3, 5, 1, 1, 0, 0, 1, 5, 5, 1, 1, 1, 0, 0, 1, 1, 2, 0, 1, 1, 1, 0, 1, 0, 0, 1, 5, 0, 2, 0, 1, 0, 1, 5, 2, 0, 5, 2, 1, 1, 3, 2, 1, 1, 0, 2, 1, 1, 1, 0, 0, 5, 4, 1, 1, 5, 1, 3, 1, 1, 1, 0, 2, 2, 1, 0, 1, 2, 5, 2, 0, 5, 2, 1, 5, 0, 2, 1, 5, 1, 3, 5, 5, 2, 1, 2, 1, 5, 1, 5, 3, 1, 0, 3, 0, 0, 1, 1, 0, 0, 0, 1, 1, 2, 1, 1, 2, 1, 5, 1, 0, 0, 1, 5, 1, 2, 4, 0, 3, 0, 5, 1, 0, 0, 1, 0, 2, 0, 3, 

## 6) Evaluation

It's time to evaluate your model using predicted categories for validation data.

You need to create a confusion matrix based on your prediction and the real labels. Then you can use this confusion matrix for calculation other measuring metrics. 

As this problem is a multi-class problem, the calculation formula is a little different from the general case. Read [this article](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2) for more information.

### 6.1) Confusion matrix

In [187]:
def confusion_matrix(y_true, y_pred):
    m = []
    for a in set(y_pred + y_true):
        m.append([])
        for b in set(y_pred + y_true):
            m[-1].append(sum([1 for i in range(len(y_true)) if y_true[i] == b and y_pred[i] == a]))
    return np.matrix(m)

cm_valid = confusion_matrix(valid_y, yp)
cm = confusion_matrix(y, y_p_train)
print(cm)
print('valid set:')
print(cm_valid)

[[ 745    5    7    2    1    4]
 [  10 1034   10   11   56   19]
 [   5    4  587    3    6    3]
 [   1    1    0  324    1    0]
 [   1    5    2    1  151    2]
 [   5    8    0    2    8  476]]


### 6.2) Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}$$

In [194]:
accuracy = lambda cm: sum([cm[i, i] for i in range(len(cm))]) / cm.sum()
print('train set:', accuracy(cm))
print('valid set:', accuracy(cm_valid))

train set: 0.9477142857142857
valid set: 0.816


### 6.3) Precision

$$\text{Precision} = \frac{TP}{TP + FP}$$

In [196]:
print('train:')
precisions = lambda cm: [cm[i, i] / int(cm.sum(1)[i]) for i in range(len(cm))]
print(precisions(cm))
precision = lambda cm: sum(precisions(cm)) / len(precisions(cm))
print(precision(cm))

print('valid:')
print(precision(cm_valid))


train:
[0.975130890052356, 0.9070175438596492, 0.9654605263157895, 0.9908256880733946, 0.9320987654320988, 0.9539078156312625]
0.954073538227425
valid:
0.8189172914568089


### 6.4) Recall

$$\text{Recall} = \frac{TP}{TP + FN}$$

In [198]:
print('train:')
recalls = lambda cm: [cm[i, i] / int(cm.sum(0)[0,i]) for i in range(len(cm))]
recall = lambda cm: sum(recalls(cm)) / len(recalls(cm))
print(recalls(cm))
print(recall(cm))

print('valid:')
print(recall(cm_valid))

train:
[0.9713168187744459, 0.978240302743614, 0.9686468646864687, 0.9446064139941691, 0.6771300448430493, 0.9444444444444444]
0.9140641482476987
valid:
0.7299482236968752


### 6.5) F1 score

$$\text{F1 score} = 2\times \frac{(Recall \times  Precision)}{Recall + Precision}$$ 

In [202]:
print('train:')
f1_scores = lambda cm: [2 * (recalls(cm)[i] * precisions(cm)[i]) / (recalls(cm)[i] + precisions(cm)[i]) for i in range(len(cm))]
print(f1_scores(cm))
f1_score = lambda cm: sum(f1_scores(cm)) / len(f1_scores(cm))  # or :   2 * (recall * precision) / (recall + precision)
print(f1_score(cm))

print('valid:')
print(f1_score(cm_valid))

train:
[0.9732201175702155, 0.9412835685025033, 0.9670510708401977, 0.9671641791044776, 0.7844155844155845, 0.9491525423728814]
0.93038117713431
valid:
0.7615377325462703


## 7) K-Fold Cross Validation *(Optional)*

Evaluate your model based on the K-Fold Cross Validation approach. This step is optional and has a few extra points.

In [0]:
# Your implementation