# CA3
## Mohammad Ali Zare
### 810197626

In this assignment we must infer if a Digikala review is positive or negative using naive bayes. The iniital information is gathered from a train dataset and then we use it on a test dataset.

Naive bayes uses Bayes rule and conditional independence as its base. It's called naive because of its assumption of independence of the features. For example in a spam filter model, it doesn't care about order of the words in a message or sentence and assumes them independent (aka bag of words). This model uses bag of words similar to spam filter.

Naive bayes, although simple, still can get an acceptable result therefore it's used mostly for a fast, easy and experimental model.

In [79]:
from __future__ import unicode_literals
import hazm
import numpy as np
import pandas as pd
import time
import collections
import math

------------
### Loaded Data

The comment_train.csv file contains our train data and comment_test.csv contains the test data.

In [80]:
train_data = pd.read_csv('comment_train.csv')
test_data = pd.read_csv('comment_test.csv')

---------
### Stop Words

Conjuctions, punctuation marks and whitespaces are chosen as stop words. Hazm's stop word list was also tried and the results didn't change much.


In [81]:
stop_list = ['و', 'یا', 'را', '!', '؟', '?', '.', ',', '،', '\r', '\n', '\t', 'به', 'از', 'ُ', 'ً', 'ٍ']
# stop_list = hazm.stopwords_list()

--------
### Title and Comment

The title of reviews were appended to the comment with a weight of 2. Meaning they were repated twice in the comment so they can have higher effect on the result.

In [82]:
train_data['comment'] = train_data['comment'] + ' ' + 2*(train_data['title']+ ' ')
test_data['comment'] = test_data['comment'] + ' ' + 2*(test_data['title']+ ' ')

------------
## Pre-Process

Diffrent approaches were tried for the pre-process including removing stop words and Hazm's stemmer, lemmatizer and normalizer. Hazm's stemmer didn't function very well as it removed ending letters of most of the words that it shouldn't have. The final pre-process is removing stop words, then normalizing each word and then lemmatizing that word.

### Lemmatization vs Stemming

Both lemmatization and stemming try to find the root of a word so different variation of words and verbs become same words. This way we can get more accurate histogram and frequency of those words.

Stemming mostly relies on cutting of the suffixes and prefixes of a word to return its root, for example changing **می‌رفتم** to **رفت**. But sometimes it fails and cuts some parts that are in the root, Eg. changing **پایان** to **پا**. An english example would be **clearly** to **clear**.

On the other hand, lemmatization tries to find root of the word in the context and takes its meaning into account, it may use a database of words and their meanings to do its job. As an example it can change **می‌روم** to **رفت** but stemming may fail and change it to **می‌رو**. Or an english example is changing **worse** to **bad**.



In [83]:
stemmer = hazm.Stemmer()
lem = hazm.Lemmatizer()
normalizer = hazm.Normalizer()

In [84]:
def filter_stop_words(x):
    return x not in stop_list

In [85]:
def lem_n_norm(x):
    splitted = lem.lemmatize(normalizer.normalize(x)).split('#')
    return splitted[0]

In [86]:
def pre_process(words):
    result = list(map(lem_n_norm, filter(filter_stop_words, words)))
    return list(filter(None, result))

----------
## Naive Bayes Classifier

As explained in the introduction it uses Bayes rule and assumption of conditional independence as its base.

### Model

This model has two classes (**recommended** and **not_recommended**). Features of the classes are the words used in each class, ie. the words appearing in each review of the class so if a word is repeated many times in recommended reviews, a comment with that word gets higher chance of being classified as recommended.

#### Bag of Words

The bag of words model is used so we treat all the words the same regardless of their position in a sentence. To make the bag of words, we combine all the recommended reviews and tokenize them, then put them in a single list of words. We do the same for not_recommended reviews. We also treat probability of a word independent of another word given its class.

### Process

First we count occurance of each word in the words list for recommended and not_recommended.


In [87]:
def get_freqs(rec_list, not_rec_list):
    rec_freq = dict(collections.Counter(rec_list))
    not_rec_freq = dict(collections.Counter(not_rec_list))
    return rec_freq, not_rec_freq, len(rec_list), len(not_rec_list)


#### Prior Probability

The initial probability of a message being recommended or not_recommended without having any evidences. It's calculated by dividing number of each class by the total reviews:

$P(recommended) = \dfrac{recommended\_review\_count}{total\_review\_count}$

It's 0.5 for both classes:

In [88]:
(train_data['recommend'] == 'recommended').sum() / len(train_data)

0.5

#### Likelihood
The probability of appearance of a word in a review given it is labaled recommended or not_recommended. It would be (it's easy to calculate):

$P(word\ |\ recommended) = \dfrac{frequency\_in\_recommended\_words(word)}{total\ recommended\ words\ count}$

#### Evidence

Appearance of each word in a review is an evidence

#### Posterior

Finding probability of being recommended given seeing a word. It isn't easy in a direct way but using the Bayes rule we can calculate it with the other informations:

$ P(recommended\ |\ word) = \dfrac{P(recommended)*P(word\ |\ recommended)}{P(word)} $

P(word) is the evidence here.

### Labeling in this problem

We can use the said equations to get posterior probablities of each review for beaing recommended or not_recommended and by comparing them, we label that review based on which of them has a higher value. Note that we don't need to calculate the evidence as it is equal in both classes. 

We assumed each word conditionally independent given recommended/not_recommended so we just multiply the probabilities.

As an example for the sentance **word1 word2**:

$P(recommended\ |\ sentance) = P(recommended)*P(word1 | recommended)*P(word2 | recommended)$

We calculate this for not_recommended too and then compare them.


-------
### Additive Smoothing

Sometimes a word can appear in the not_recommended reviews training data but not in the recommended. For this reason the posterior probability of not_recommended would be equal to zero. For example it we didn't see **word2** in recommended reviews of training data this would happen:

$P(word2\ |\ recommended) = 0$

$P(recommended\ |\ sentance) = P(recommended)*P(word1 | recommended)*P(word2 | recommended) = 0$

So the review would be classified as not_recommended, no matter what the other words are. 

To solve this problem we use Additive Smoothing to eliminate these 0 probabilities.

We add an Alpha to the count of each word in all classes, and assign alpha to count of those missing words in each class. So in the example, **word2** count would be Alpha for recommended words list, and Alpha + prev_count for the not_recommended words lits.

Alpha = 1 was used for this problem.




In [89]:
def get_smoothed_freqs(rec_list, not_rec_list):
    rec_freq = dict(collections.Counter(rec_list))
    not_rec_freq = dict(collections.Counter(not_rec_list))
    rec_word_count = len(rec_list)
    not_rec_word_count = len(not_rec_list)

    for word in rec_list:
        rec_freq[word] += 1
        if word not in not_rec_freq:
            not_rec_word_count += 1
            not_rec_freq[word] = 1

    for word in not_rec_list:
        not_rec_freq[word] += 1
        if word not in rec_freq:
            rec_word_count += 1
            rec_freq[word] = 1
    
    return rec_freq, not_rec_freq, rec_word_count, not_rec_word_count

--------------
--------------
### Label function

Very small numbers were considered 0 by Python so the Log function was used for probabilites so we can sum them instead of multiplying them and avoid getting very small values.

In [90]:
def label(comment, info, pre_proc=None):
    words = hazm.word_tokenize(comment)

    if pre_proc:
        words = pre_proc(words)

    rec_freq, not_rec_freq, rec_word_count, not_rec_word_count = info

    rec_score = math.log(0.5) # prior
    not_rec_score = math.log(0.5)

    for word in words:
        if word not in rec_freq and word not in not_rec_freq: #ignore extra words
            continue
        
        if word not in rec_freq: # no smoothing
            rec_score = float('-inf')
            break

        if word not in not_rec_freq: # no smoothing
            not_rec_score = float('-inf')
            break


        rec_score += math.log(rec_freq[word] / rec_word_count)
        not_rec_score += math.log(not_rec_freq[word] / not_rec_word_count)
    
    
    if rec_score > not_rec_score:
        return 'recommended'
    else:
        return 'not_recommended'

--------
### Creating Bag of Words

All the comments are added to a single list for each class.

In [91]:
rec_words = []
not_rec_words = []
for i, row in train_data.iterrows():
    if row['recommend'] == 'recommended':
        rec_words += hazm.word_tokenize(row['comment'])
    else:
        not_rec_words += hazm.word_tokenize(row['comment'])

--------------

Here the label function is called with and without pre-processing and smoothing:

In [92]:
nothing_info = get_freqs(rec_words, not_rec_words)
smoothed_info = get_smoothed_freqs(rec_words, not_rec_words)
pre_info = get_freqs(pre_process(rec_words), pre_process(not_rec_words))
pre_smoothed_info = get_smoothed_freqs(pre_process(rec_words), pre_process(not_rec_words))

test_data['pre_smooth'] = test_data['comment'].apply(label, args=[pre_smoothed_info, pre_process])
test_data['smooth'] = test_data['comment'].apply(label, args=[smoothed_info])
test_data['pre'] = test_data['comment'].apply(label, args=[pre_info, pre_process])
test_data['nothing'] = test_data['comment'].apply(label, args=[nothing_info])

In [93]:
def print_results(label):
    correct_recs_detected = ((test_data['recommend'] == test_data[label]) & (test_data[label] == 'recommended')).sum()
    all_recs_detected = (test_data[label] == 'recommended').sum()
    total_recs = (test_data['recommend'] == 'recommended').sum()

    accuracy = (test_data['recommend'] == test_data[label]).sum() / len(test_data) * 100
    precision = correct_recs_detected / all_recs_detected * 100
    recall = correct_recs_detected / total_recs * 100
    f1 = 2 * (precision * recall) / (precision + recall)

    print('-------------------------')
    print(label + ':\n')
    print(f'{"Accuracy":>10}: \t {accuracy :.2f} %\n')
    print(f'{"Precision":>10}: \t {precision :.2f} %\n')
    print(f'{"Recall":>10}: \t {recall :.2f} %\n')
    print(f'{"F1":>10}: \t {f1 :.2f} %\n')



-----------
### Evaluation

## Precision

If we only use precision, in a case if we detect only on recommended and that is correct, we get 100%, so it can't be used alone. Generally if our model detects a few comments as recommended it can get a high precision although the model is not very good.

## Recall

If we detect lots of recommended comments including many correct and many wrongs ones, we get a high recall value but still the model is not good. For example if we label all the reviews as recommended, it gets 100% recall.

## F1

To combat the downsides of recall and precision, we get an average of these two values to generate F1 score. F1 is **harmonic mean** of recall and precision. Harmonic mean takes multiple parameters (in this context both recall and precision) into account. This value is a better representation of the correctness of our model.



In [169]:
print_results('pre_smooth')

-------------------------
pre_smooth:

  Accuracy: 	 93.38 %

 Precision: 	 92.63 %

    Recall: 	 94.25 %

        F1: 	 93.43 %



In [95]:
print_results('smooth')

-------------------------
smooth:

  Accuracy: 	 94.88 %

 Precision: 	 94.32 %

    Recall: 	 95.50 %

        F1: 	 94.91 %



In [96]:
print_results('pre')

-------------------------
pre:

  Accuracy: 	 90.25 %

 Precision: 	 90.25 %

    Recall: 	 90.25 %

        F1: 	 90.25 %



In [97]:
print_results('nothing')

-------------------------
nothing:

  Accuracy: 	 90.00 %

 Precision: 	 89.60 %

    Recall: 	 90.50 %

        F1: 	 90.05 %



### Results

We can see when we use additive smoothing, it improves our model with a noticeable difference. The reason is it prevents our model from deciding a label solely based on a word that wasn't in a class training data and lets the model take more words into account.

But our pre-process is not very effective, the reason can be because it takes some context away, for example negative and positive verbs become the same word or some words lose their meanings.

--------
### When our model makes mistake

In the example below both unique words in the comments has higher score given it's recommended although it's not_recommended. So our model labels it as recommended.

The reason can be the context that words are used in. Our Naive Bayes model ignores the context and the sentence completely. But in reality a positive word can have a negative meaning given context and the verb used. For example **ایراد** can be used as **ایراد ندارد** and **ایراد دارد**, these two sentences have different meaning but our model treat the word **ایراد** the same. Also the negative and positive verbs become same verbs in the pre-processing.

Another reason can be small stop words set. We haven't considered all neutral words so they give different weights to the classes although they are not really trustable for labeling.

In [127]:
rec_freq, not_rec_freq, rec_word_count, not_rec_word_count = pre_smoothed_info
wrongs = test_data[test_data['recommend'] != test_data['pre_smooth']].reset_index()

In [163]:
print('real label: ', wrongs.iloc[5]['recommend'])
print('our label: ', wrongs.iloc[5]['pre_smooth'])
print('comment:\n', wrongs.iloc[5]['comment'])
print('')
print('دستگاه score in recommended words     ', rec_freq['دستگاه'] / rec_word_count)
print('دستگاه score in not_recommended words ', not_rec_freq['دستگاه'] / not_rec_word_count)
print('ایراد score in recommended words     ', rec_freq['ایراد'] / rec_word_count)
print('ایراد score in not_recommended words ', not_rec_freq['ایراد'] / not_rec_word_count)

real label:  not_recommended
our label:  recommended
comment:
 ایراد دستگاه ایراد دستگاه ایراد دستگاه 

دستگاه score in recommended words      0.005051801007152709
دستگاه score in not_recommended words  0.0030817436815646303
ایراد score in recommended words      0.0010745100554896238
ایراد score in not_recommended words  0.0004992769093037669


In [171]:
for i, row in wrongs.tail().iterrows(): 
    print('real label: ', row['recommend'])
    print('our label: ', row['pre_smooth'])
    print('comment:\n',row['comment'])
    print('\n------------------\n')

real label:  not_recommended
our label:  recommended
comment:
 باسلام خدمت دوستان  من تعجب میکنم از چیه این تعریف میکنن
گوشی من سامسونگ اس ۶ هستش ۲۵۵۰ 
حالا ۳.۵ بار شارژ میکنه کنار 
بحثم اینجاست 
۲ساعت نیم میکشه شارژ کامل که واقعا خوب نیست فاجعه هستش 
و مورد دیگه اداپتور من فست هستش تازه با فست قشنگ ۷الی۸ ساعت میکشه شارژ بشه 
چیه این خوبه اخه تعریف میکنید 
نه شکل ظاهر مناسب  نه ابعاد خوب 
دیر شارژ میشه 
شارژ کند انجام میده 
تنها مزیت این گارانتی هستش 
تموم شد رفت پاور بانک پاور بانک 

------------------

real label:  recommended
our label:  not_recommended
comment:
 من دوسه ماهی هست این کفشدازردیجی گرفتم متاسفانه کیفیت چسب کفی خوب نیست و از جلو بلند شده و اینکه بنداش کیفیت لازم رو نداره و پا داخلش بو میگیره قالباشم دقیق نیست بنظرم ارزش این پول نداره ... پیشنهاد نمیدم پیشنهاد نمیدم 

------------------

real label:  not_recommended
our label:  recommended
comment:
 این آچار لوله گیر خیلی سنگینه،برای کارمداوم وکسانی که دست وبازوی ضعیفی دارند اصلا مناسب نیست.اگرقبل ازخریدبه دست می گرفتم،ا