# Naive Bayes Classifier
## FALL-99 A#3
### Bahar Emami Afshar
### STD number: 810197662
#### Abstract: in this project, we are going to implement a predictor model based on Naive Bayes algorithm to classify comments on a website into two groups, recommended and not recommended.

In [219]:
from __future__ import unicode_literals
from hazm import *
import pandas as pd
from collections import Counter


# 1. Reading Data

In [220]:
train_df = pd.read_csv("./CA3_dataset/comment_train.csv")
test_df = pd.read_csv("./CA3_dataset/comment_test.csv")


# 2. Preproccesing

### To preprocess datasets we do the following processes:
    1- Combining "title" and "comment" column and create a new column called "all".
    2- Normalizing data: which will remove semi spaces.
    3- Tokenizing each line of dataset to a list of words
    4- Lemmatizing each word in the dataset
    5- Removing stopwords: we have used hazm stopword list and remove them from our dataset.
    
### Question No.1:
### Lemmatization VS Stemming 
*Lemmatization* replaces each word with it's root, causing different form of words from the same root such as nouns, verbs, adjectives and etc to act as if they are the same, like changing "می‌روم" to "رفت#رو". this will increase the accuracy of our model so we have used it.

*Stemming* acts the same az lemmatizing witha difference that it tries to remove suffix and prefixes froma word and creating it's root. However hazm Stemizer did not work so accurate and in somecases it detect suffix and prefixes wrongly, like changing "عالی" to "عال". Stemming part is commented so it wouldn't decrease accuracy. 

In [221]:
def Normalize(df):
    normalizer = Normalizer()
    df["all"] = df.apply(lambda line: normalizer.normalize(line["all"]), axis=1)
    return df

def Stem(df):
    stemmer = Stemmer()
    df["all"] = df.apply(lambda line: [stemmer.stem(word) for word in line["all"]], axis=1)
    return df

def lemmatize(df):
    lemmatizer = Lemmatizer()
    df["all"] = df.apply(lambda line: [lemmatizer.lemmatize(word) for word in line["all"]], axis=1)
    return df

def tokenize(df):   
    df["all"] = df.apply(lambda line: word_tokenize(line["all"]), axis=1)
    return df

def remove_stopwords(df):
    stopwords = stopwords_list()
    df["all"] = df.apply(lambda line: [word for word in line["all"] if word not in stopwords], axis=1)
    return df

def pre_process(df):
    df = df.copy(deep = True)
    df["all"] = df["title"]+" " +df["comment"]
    df = Normalize(df)
    df = tokenize(df)
#     df = Stem(df)
    df = lemmatize(df)
    df = remove_stopwords(df)
    return df

def tokenize_df(df):
    df = df.copy(deep = True)
    df["all"] = df["title"]+" " +df["comment"]
    df = tokenize(df)
    return df
    


### Question No.2:
    1- posterior : the probability to classify a comment as recommended(or not_recommende) with observing a word as evidence. it is calculated as below:

$P(recommended|word) = \dfrac{P(word|recommended) * P(recommended)}{P(word)}$

    2- prior: the probability of a message to be classified as recommended or not_recommended without any evidence. it is calculated as below:
    
$P(recommended) = \dfrac{number\_of\_recommended\_comments}{all\_comments}$

    3- likelihood: the probability of a word to appear in a recommended(or not_recommende) comment. it is calculated as below:
    
$P(word|recommended) = \dfrac{count\_of\_word\_in\_all\_recommended\_comments}{count\_of\_all\_words\_in\_recommended\_comments}$

    4- evidence: count of each word repeated in a comment is our evidence.
    
    
### 3.Naive bayes(train and test)
after preprocessing data, we start training.

to train our data we calculate the frequncy of each word appeared in a recommended(or not_recommended) comment in the train set, then we store it in a dictionary.

for each row in our test dataset, we calculate the probability of the row to be classified as recommended and not recommended and compare these two, each of them which is greater will be the predicted lable.

the probability of each line to be labled will be calculated based the bayes rule. it will be equal to multiplication of frequencies of evidences in the row.

we use liklely hood and evidences to calculate posterior probablity, and because $P(word)$ is the same for both recommend and not_recommend we ignore the division.

In [222]:
def clac_word_frequency(row,frequencies):
    d = dict(Counter(row))
    for key,value in d.items():
        if key in frequencies.keys():
            frequencies[key] += value
        else:
            frequencies[key] = value
    return frequencies
def frequency_dict(df):
    frequencies = {}
    for i in range(df.shape[0]):
        frequencies = clac_word_frequency(df.iloc[i]["all"],frequencies)
    return frequencies
        
def train(df,recommend):
    df = df[df["recommend"] == recommend]
    return frequency_dict(df)


### Question No.3:
in this model we have defiend the probability of each word as it's frequency(number of times the word is repeated). so if our model observes a word which is not in the train set recommended comments it will assume it's probability is zero to be recommended and it will classify it as not_recommended no matter what other words are.

### Question No.4:
### Additive Smoothing

to solve the problem mentioned in question no.3, we use addetive smoothing. in order to implement it, we increase the frequency of all words by 1. in this case if the model observes a word wich is not in the train set recommended, it assumes its frequency to be 1. and by this when we multiply its frequency by other words it wouldn't cause them to be zero.

In [223]:
def Naive_Bayes_test(train_df,test_df,smoothing = 0):
    recom_freqs = train(train_df,"recommended")
    not_recom_freqs = train(train_df,"not_recommended")
    
    prior_recom = sum(test_df["recommend"] == "recommended") / len(test_df)
    prior_not_recom = sum(test_df["recommend"] == "not_recommended") / len(test_df)
    
    result = []
    for i in range(test_df.shape[0]):
        p_recom = prior_recom
        p_not_recom = prior_not_recom
        for word in test_df["all"].iloc[i]:
            if word not in recom_freqs.keys():
                recom_freqs[word] = smoothing
            else:
                p_recom *= (recom_freqs[word] + smoothing)
                
            if word not in not_recom_freqs.keys():
                not_recom_freqs[word] = smoothing
            else:
                p_not_recom *= (not_recom_freqs[word] + smoothing)
                
        if p_recom > p_not_recom:
            result.append("recommended")
        else:
            result.append("not_recommended")
        
    test_df["predict"] = result
    return test_df

### 4. Evaluation

### Question No.5:

#### Precision:
if our model detects only a few comments as recommended and if it detects them correctly, our model precision would be 100% although it is not that good and have misdetected so many rows as not_recommended.

#### Recall
if our model detects all comments as recommended the recall value would be 100%, however we know our model has misdetects do many not_recommended comments as recommended and it is not as good as 100%.

by this statements, precision or recall alone can not be a great measurement of our model.

### Question No.6:

#### F1

The F1 score is the harmonic mean of the precision and recall.The highest possible value of an F-score is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The F1 score kinda combines recall and precision and it's a better score to meausure our model predictions.

In [224]:

def Accuracy(df):
    return sum((df["recommend"] == df["predict"]))/ len(df)

def Precision(df):
    return sum((df["recommend"] == df["predict"]) & (df["predict"] == "recommended"))/ len(df[df["predict"] == "recommended"])

def Recall(df):
    return sum((df["recommend"] == df["predict"]) & (df["predict"] == "recommended"))/ len(df[df["recommend"] == "recommended"])

def F1(df):
    return 2*(Precision(df) * Recall(df)) / (Precision(df) + Recall(df))

def print_result(df):
    print("Accuracy: " + str(Accuracy(df)*100) +"%")
    print("Precision: " + str(Precision(df)*100) +"%")
    print("Recall: " + str(Recall(df)*100) +"%")
    print("F1: " + str(F1(df)*100) +"%")

### Question No.7:
the results are as below:

In [225]:
print("Preproccesing and Additive Smoothing")
preprocess_train_df = pre_process(train_df)
preprocess_test_df = pre_process(test_df)
print_result(Naive_Bayes_test(preprocess_train_df,preprocess_test_df,smoothing = 1))

print("\nAdditive Smoothing")
train_df_token = tokenize_df(train_df)
test_df_token = tokenize_df(test_df)
print_result(Naive_Bayes_test(train_df_token,test_df_token,smoothing = 1))

print("\nPreproccesing")
preprocess_train_df = pre_process(train_df)
preprocess_test_df = pre_process(test_df)
print_result(Naive_Bayes_test(preprocess_train_df,preprocess_test_df,smoothing = 0))

print("\nUsing None")
train_df_token = tokenize_df(train_df)
test_df_token = tokenize_df(test_df)
print_result(Naive_Bayes_test(train_df_token,test_df_token,smoothing = 0))


Preproccesing and Additive Smoothing
Accuracy: 91.875%
Precision: 88.50574712643679%
Recall: 96.25%
F1: 92.21556886227546%

Additive Smoothing
Accuracy: 88.125%
Precision: 83.51648351648352%
Recall: 95.0%
F1: 88.8888888888889%

Preproccesing
Accuracy: 89.625%
Precision: 87.82816229116945%
Recall: 92.0%
F1: 89.86568986568987%

Using None
Accuracy: 87.75%
Precision: 83.70535714285714%
Recall: 93.75%
F1: 88.44339622641509%


### Question No.8:
according to the results we got in the previous part, it is obvious that using additive smoothing and preprocessing will lead us to a greater accuracy, that's because in this case our model calculates the result based on every word and if a word is not in the train set it will not be able to cause the whole line probability to be zero.

as it can be seen both additive smoothing and preprocessing have increased the acuuracy, so we can conclude that we have used a good way to preprocess our data.



### Question No.9:
some cases where we have misclassify the comments are printed below.

these comments contain positive adjectives with negative verbs or negative adjectives with positive verbes, and we have detect them as recommended because of our preprocessing.
in our poreprocessing we convert each verb to its root which in some cases will cause negative and positive verbes to be treated the same. and then our model sees the posistive adjectives and will classify the comment as recommended.
another problem is when our preprocessor distingushes the negative verbs from positive ones, but because we have ignored the position of words, the model will be confused observing positive verbs which lead it to classify the comment as recommended and negative adjectives which will cause it to do the opposite. all of this thing can cause our model to misclassify some comments.

comment number 2 is wrongly labeled as recommended, and our model has correctly predict that.


In [231]:
df = Naive_Bayes_test(preprocess_train_df,preprocess_test_df,smoothing = 1)
wrong_pred = preprocess_test_df[df["predict"] != df["recommend"]].tail()
for i in range(wrong_pred.shape[0]):

    print("recommend: ",wrong_pred["recommend"].iloc[i])
    print("title: ",wrong_pred["title"].iloc[i])
    print("comment: ",wrong_pred["comment"].iloc[i])
    print()

recommend:  not_recommended
title:  پاور بانک
comment:  باسلام خدمت دوستان  من تعجب میکنم از چیه این تعریف میکنن
گوشی من سامسونگ اس ۶ هستش ۲۵۵۰ 
حالا ۳.۵ بار شارژ میکنه کنار 
بحثم اینجاست 
۲ساعت نیم میکشه شارژ کامل که واقعا خوب نیست فاجعه هستش 
و مورد دیگه اداپتور من فست هستش تازه با فست قشنگ ۷الی۸ ساعت میکشه شارژ بشه 
چیه این خوبه اخه تعریف میکنید 
نه شکل ظاهر مناسب  نه ابعاد خوب 
دیر شارژ میشه 
شارژ کند انجام میده 
تنها مزیت این گارانتی هستش 
تموم شد رفت

recommend:  recommended
title:  پیشنهاد نمیدم
comment:  من دوسه ماهی هست این کفشدازردیجی گرفتم متاسفانه کیفیت چسب کفی خوب نیست و از جلو بلند شده و اینکه بنداش کیفیت لازم رو نداره و پا داخلش بو میگیره قالباشم دقیق نیست بنظرم ارزش این پول نداره ...

recommend:  not_recommended
title:  اندرمعایب آچارلوله گیر14اینچ ایران پتک
comment:  این آچار لوله گیر خیلی سنگینه،برای کارمداوم وکسانی که دست وبازوی ضعیفی دارند اصلا مناسب نیست.اگرقبل ازخریدبه دست می گرفتم،ازخریدمنصرف می شدم..

recommend:  not_recommended
title:  کلنیل
comment: