![University of Tehran](./img/UT.png)
#   <font color='red'><center>AI CA 3<center></font> 
## <center>Dr. Fadaei<center>
### <center>Daniyal Maroufi<center>
### <center>810098039<center>


## Aim

This assignment aims to use Naive Bayes networks to build a classifier model to predict the category of an article from Digikala using its excerpt.


In [119]:
import math
from collections import defaultdict
import pandas as pd
import matplotlib.pyplot as plt
import re
from __future__ import unicode_literals
from hazm import *


# Load Data

First, we read the training and test data from csv files.

In [83]:
train_df=pd.read_csv('./Data/train.csv')
test_df=pd.read_csv('./Data/test.csv')
train_df

Unnamed: 0,content,label
0,فیلم‌های در حال اکران؛ موزیکال شاد خاله قورباغ...,هنر و سینما
1,پنج فیلمسازی که کوئنتین تارانتینو را عاشق سینم...,هنر و سینما
2,جانی آیو از اپل رفت جانی آیو دیگر نیازی به معر...,علم و تکنولوژی
3,احتمال عدم پشتیبانی iOS ۱۳ از آیفون ۵ اس، SE و...,علم و تکنولوژی
4,دزدان مغازه نماینده ژاپن در اسکار ۲۰۱۹ شد فیلم...,هنر و سینما
...,...,...
5195,امپراطوری اپ (فصل اول/بخش دوم) فصل اول – بخش د...,سلامت و زیبایی
5196,عدم ارتباطات اثربخش و تعارض در محیط کار وجود س...,سلامت و زیبایی
5197,اپل در سال ۲۰۲۰ چهار آیفون معرفی خواهد کرد! طب...,علم و تکنولوژی
5198,مارتینز: بلژیک باید مقابل فرانسه بدون ترس بازی...,سلامت و زیبایی


In [84]:
train_df.isnull().sum()

content    1
label      0
dtype: int64

In [85]:
test_df.isnull().sum()

content    0
label      0
dtype: int64

Because only one sample is null, we simply remove it from the training data.

In [86]:
train_df=train_df.dropna(how='any',axis=0) 

# Phase 1 - Data Preprocessing



In Natural Language Processing, it is agreed that using the root of the words is better for classification accuracy. The prefixes and postfixes of the words are not that necessary to be in sequence, and sometimes they may even inversely affect the accuracy because all forms of a word have the same meaning. For this purpose, there are two methods, Stemming and Lemmatization. Stemming removes the most common prefixes and postfixes of a word to find the root, while, Lemmatization uses an entire dictionary and finds the actual root of the words.

In [87]:
def clean_data(df):
    normalizer = Normalizer()
    lemmatizer = Lemmatizer()
    df['content']=df['content'].apply(lambda x: normalizer.normalize(x))
    df['content']=df['content'].apply(lambda x: word_tokenize(re.sub(r'[^\w\s]', '', x)))
    stp_words=set(stopwords_list())
    df['content']=df['content'].apply(lambda x: [lemmatizer.lemmatize(a) for a in x if a not in stp_words])
    return df

In [None]:
train_df=clean_data(train_df)
test_df=clean_data(test_df)


In [89]:
train_df

Unnamed: 0,content,label
0,"[فیلم, اکران, موزیکال, شاد, خاله, قورباغه, بزر...",هنر و سینما
1,"[فیلمسازی, کوئنتین, تارانتینو, عاشق, سینما, کم...",هنر و سینما
2,"[جان, آیو, اپل, جان, آیو, نیاز, معرف, تقریبا, ...",علم و تکنولوژی
3,"[احتمال, پشتیبان, iOS, ۱۳, آیفون, ۵, اس, SE, آ...",علم و تکنولوژی
4,"[دزد, مغازه, نماینده, ژاپن, اسکار, ۲۰۱۹, فیلم,...",هنر و سینما
...,...,...
5195,"[امپراطوری, اپ, فصل, اولبخش, فصل, دوماپ, گنجین...",سلامت و زیبایی
5196,"[ارتباطات, اثربخش, تعارض, محیط, کار, سازمان, و...",سلامت و زیبایی
5197,"[اپل, سال, ۲۰۲۰, آیفون, معرف, گزارش, JPMorgan,...",علم و تکنولوژی
5198,"[مارتینز, بلژیک, مقابل, فرانسه, ترس, بازی, سرم...",سلامت و زیبایی


# Phase 2 - Problem Procedure

In this assignment, we use the Bag of Words strategy. In this strategy, the position of the words in the sentence is not considered, and only the existence of the words is important. This assumption is not the best one as the order of the words in the sentence is essential too, but on our data, it is good enough to get good accuracy.



The basic formula of the Naive Bayes is shown as bellow:

![Naive Bayes](./img/NaiveBayes.jpg)

where *evidence, likelihood, prior, posterior probabilities, and predictor prior probability* in our problem are:

- The **evidence**(x) is the text input to the model, and the query is the category of the text
- The **posterior probability** is the probability of category(c) concerning given evidence(x). 
- The **likelihood** is the reverse of Posterior probability, the probability of the evidence(x) in category(c).
- The **prior probability** is category(c) probability among all categories.
- The **predictor prior probability** is the probability of the evidence(x) in a general text.

As predictor prior probability is the same for all classes, we only have to compare the nominator, and we can ignore the denominator.

## Mapping Categories to Numbers

To convert categorical columns to numerical, we simply use map() pandas method.

In [93]:
cats=defaultdict(None)
for i, cat in enumerate(train_df['label'].unique()):
    cats[cat]=i
    print(i,' --> ',cat)


0  -->  هنر و سینما
1  -->  علم و تکنولوژی
2  -->  سلامت و زیبایی
3  -->  بازی ویدیویی


In [None]:
train_df['label']=train_df['label'].map(cats)
test_df['label']=test_df['label'].map(cats)


## Dividing Train Data to Classes

In [98]:
train_df_classes=[]
for i in range(len(cats)):
    train_df_classes.append(train_df.loc[train_df['label'].isin([0])])


## Calculating the Likelihood of the Words

In [106]:
def calc_liklihood(class_df):
    words_prob=defaultdict(lambda: 1)
    for _,row in class_df.iterrows():
        for j in range(len(row['content'])):
            word=row['content'][j]
            words_prob[word]+=1
    num_all_words_class=sum(words_prob.values())
    for word in words_prob:
        words_prob[word]/=num_all_words_class
    return words_prob


In [120]:
class_words_prob=[]
for i in range(len(cats)):
    class_words_prob.append(calc_liklihood(train_df_classes[i]))


## Naive Bayes Classifier 1

In this classifier, we ignore the unseen words in the test data.

In [141]:
def calc_class_prob_1(test_sample, words_prob):
    prob=math.log(1/len(cats))
    test_words_prob=[]
    for word in test_sample:
        if word in words_prob:
            test_words_prob.append(math.log(words_prob[word]))
    prob = prob + sum(test_words_prob)
    return prob


In [142]:
def classifier_1_predict(test_sample, class_words_prob):
    category_chance=[]
    for i in range(len(cats)):
        class_prob = calc_class_prob_1(test_sample,class_words_prob[i])
        category_chance.append(class_prob)
    return category_chance.index(max(category_chance))


In [143]:
def classifier_1_evaluate(class_words_prob):
    predictions = []
    correct_labels=0
    for _,row in test_df.iterrows():
        prediction=classifier_1_predict(row['content'],class_words_prob)
        predictions.append(prediction)
        if prediction==row['label']:
            correct_labels+=1
    accuracy=correct_labels/test_df.shape[0]
    return predictions, accuracy


In [145]:
_, acc=classifier_1_evaluate(class_words_prob)
print('The accuracy of the model is: ',acc)

The accuracy of the model is:  0.2082294264339152


## Unigram, Bigram, and n-gram

In the previous classifier, we used unigrams, in which we consider the single words independently. But different words may have different meanings together implicitly. Hence, using bigrams and even n-grams helps us better understand the meaning of the expressions in the sentence. For example:
- Why is this so hard to use?
- This glass is hard enough to not be broken.

The word "hard" is in two different sentences with different meanings.

## Additive Smoothing

For example, the word "screen" may be in both the technology and gaming categories. But suppose it is present in a gaming article in the training data and not in a technology one. As a result, the classifier assumes that this word belongs to a gaming class that is not valid in practice. So, as the word "screen" is not present in the technology class, its probability is zero, in this case, minus infinity because of the logarithm. Therefore, other words that may have a good probability in the technology class be eliminated, and the sentence's probability would be zero for that particular class.
- (Class Gaming in train data) The Game plays well on all screen resolutions.
- (Class Technology in test data) The brand new mobile phone has a great touch screen.

If a word did not exist in our training data, it would get the probability of log(0), which is minus infinity, and that class's likelihood of being the one would be almost zero. Therefore by applying the additive smoothing, we ensure that the probability never will be minus infinity. So, for the above particular sentence, the likelihood of classifying the second sentence as technology would be considerable.

In statistics, additive smoothing, or Lidstone smoothing, is a technique used to smooth categorical data.
Given an observation x = (x1, …, xd) from a multinomial distribution with N trials and parameter vector θ = (θ1, …, θd), a "smoothed" version of the data gives the estimator:

![Additive Smoothing](./img/AdditiveSmoothing.png)

where the pseudo count α > 0 is the smoothing parameter (α = 0 corresponds to no smoothing), additive smoothing is a shrinkage estimator, as the resulting estimate will be between the empirical estimate xi / N and the uniform probability 1/d. Using Laplace's rule of succession, some authors have argued that α should be 1 (in which case the term add-one smoothing is also used), though, in practice, a smaller value is typically chosen ([Source](https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf)).
