Naive Bayes algorithms are a set of supervised statistical classification machine learning algorithms based on the **Bayes probability theorem**.
<br>

Bayes theorem states that:
<br>

P(A|B) = P(B|A) * P(A)/P(B)
<br>

An important assumption made by Bayes theorem is that **the value of a particular feature is independent from the value of any other feature for a given the class**.
<br>

I have done a sentiment analysis on the [Reddit dataset](https://www.kaggle.com/cosmos98/twitter-and-reddit-sentimental-analysis-dataset), which is a multiclass classification problem.

### About the dataset used:

It consists of two columns: `category` and `clean_comment`.

Category column:

*  1 : Indicating a Postive Sentiment
*  0 : Indicating it is a Neutral Sentiment
* -1 : Indicating a Negative Sentiment

Comment column:

These comments were made on political leaders as well as people's opinion towards the next prime minister of the Nation(in context with General Elections Held In India 2019).

### Basic workflow of building the classifier: 

(From each comment, build a vocabulary that consists of word, it's sentiment and it's frequency.
For any given string, it's sentiment is decided by multiplying the probability of each word occuring)

* Step 1: Calculate the prior probability for each class labels
* Step 2: Find likelihood probability with each attribute for each class
* Step 3: Put these value in Bayes Formula and calculate posterior probability
* Step 4: See which class has a higher probability, given the input belongs to the higher probability class.

# Table of contents 


1. Loading the dataset
<br>

2. Analysis 
<br>

3. Modelling
<br>

4. Evaluative results
<br>

5. Future scope

## Part 1: Loading the dataset

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split

import re

import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# import functions from helper.py
from helper import clean_text,find_freq,word_sentiment

The helper function(helper.py) consists of the following functions:

1. `clean_text` : The very first step in predict the sentiment of a given string(here, comments) is to tokenize it, i.e, represent it in a list of individual words present in the string. Also, words that don't add much meaning to the string can be removed(stop words) and the remaining words can be stemmed to generate the root words. To achieve this, in-built functions from **NLTK** have been used.


2. `find_freq` : We need to find frequency of each word occuring in the given dataset, this frequency is later used in the formula to compare it's occurences in different classes.


3. `word_sentiment` : This function gives the positive, negative and neutral words in the vocabulary that we built.

In [2]:
# Load stop words and function for stemming the data

stop_words = stopwords.words('english')
stemmer=PorterStemmer()

In [3]:
# Load the data

data=pd.read_csv("Reddit_Data.csv")

## Part 2: Analysis

In [4]:
data.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37249 entries, 0 to 37248
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   clean_comment  37149 non-null  object
 1   category       37249 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 582.1+ KB


There are 100 categories, which don't have corresponding comment, we drop those rows.

In [6]:
data=data.dropna()

In [7]:
data.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1


In [8]:
data.shape

(37149, 2)

In [9]:
data['category'].value_counts()

 1    15830
 0    13042
-1     8277
Name: category, dtype: int64

In [10]:
# positive sentiment

data.loc[data['category']==1, 'clean_comment'].iloc[0]

' family mormon have never tried explain them they still stare puzzled from time time like some kind strange creature nonetheless they have come admire for the patience calmness equanimity acceptance and compassion have developed all the things buddhism teaches '

In [11]:
# neutral sentiment

data.loc[data['category']==0, 'clean_comment'].iloc[0]

'what you have learned yours and only yours what you want teach different focus the goal not the wrapping paper buddhism can passed others without word about the buddha '

In [12]:
# negative sentiment

data.loc[data['category']==-1, 'clean_comment'].iloc[0]

'seriously don say thing first all they won get its too complex explain normal people anyway and they are dogmatic then doesn matter what you say see mechante post and for any reason you decide later life move from buddhism and that doesn suit you identity though you still get keep all the wisdom then your family will treat you like you went through weird hippy phase for while there didncha and you never hear the end pro tip don put one these your wall jpg '

An observation that can be made about the text is that it doesn't contain **capital letters** and punctuations or **user handles**. 

Unlike most of the models, naive bayes doesn't require encoding of y(here,category) variable, since it deals with discrete distribution.

## Part 3: Modelling

In [13]:
# splitting the data into train and validation set

X_train,X_val,y_train,y_val=train_test_split(data['clean_comment'],data['category'],test_size=0.2,
                                             stratify=data['category'],shuffle=True)

In [14]:
X_train.head()

5197                 could you tell bit about the crawler 
36685     city president the muslim rashtriya manch riy...
31018    bjp invest the max marketing than any other co...
24588    prediction 2019 aap wins least more state elec...
26571     know which neighbors went all maga ten years ...
Name: clean_comment, dtype: object

The training data consists strings. 
<br>
Given a string, the vocabulary returns a pair of tuple (word,string):frequency.
For example, consider a list of strings 
<br>`["it's  beautiful day", "life is beautiful"]`.

<br>The vocabulary would return
the following key valie pairs:
`{("beautiful",1):2 ("life",1):1 ("day",1):1}`

In [15]:
def vocabulary(voc,comments,ys):
    '''
    Input:
        voc: existing vocabulary 
        comments: the dataset containing comments
        ys: the label corresponding to each comment
    Output:
        voc: updated vocabulary(a dictionary containing (word,label) as the key 
                                and it's corresponding frequency as value)
        
        It is of the format : {(word,label):frequency}
    '''
    for y,comm in zip(ys,comments):
        # clean and tokenize each comm by passing through the helper function
        for word in clean_text(comm, stop_words, stemmer):
            pair=(word,y)
            
            # if pair already exists in vocabulary, increase frequency by 1
            if pair in voc:
                voc[pair]+=1
                
            #if the word is occuring for the first time, set the frequency to 1
            else:
                voc[pair]=1
    return voc

In [16]:
# get the vocabulary by passing training dataset

vocab=vocabulary({}, X_train, y_train)

In [17]:
vocab

{('could', 0): 85,
 ('tell', 0): 61,
 ('bit', 0): 19,
 ('crawler', 0): 1,
 ('citi', -1): 65,
 ('presid', -1): 98,
 ('muslim', -1): 414,
 ('rashtriya', -1): 3,
 ('manch', -1): 6,
 ('riyaz', -1): 1,
 ('khan', -1): 42,
 ('alleg', -1): 38,
 ('rss', -1): 112,
 ('bjp', -1): 1198,
 ('ignor', -1): 84,
 ('demand', -1): 43,
 ('claim', -1): 157,
 ('000', -1): 39,
 ('member', -1): 89,
 ('join', -1): 40,
 ('congress', -1): 546,
 ('along', -1): 41,
 ('functionari', -1): 1,
 ('invest', -1): 33,
 ('max', -1): 7,
 ('market', -1): 46,
 ('compani', -1): 78,
 ('come', -1): 420,
 ('predict', 1): 95,
 ('2019', 1): 192,
 ('aap', 1): 444,
 ('win', 1): 688,
 ('least', 1): 251,
 ('state', 1): 949,
 ('elect', 1): 960,
 ('seat', 1): 251,
 ('know', 0): 195,
 ('neighbor', 0): 5,
 ('went', 0): 28,
 ('maga', 0): 10,
 ('ten', 0): 24,
 ('year', 0): 154,
 ('amnesia', 0): 2,
 ('kick', 0): 9,
 ('gonna', 0): 52,
 ('make', 0): 207,
 ('big', 0): 64,
 ('poster', 0): 11,
 ('shame', 0): 22,
 ('vote', 0): 229,
 ('trump', 0): 88,

How does naive bayes predict classes?

* The objective is to predict the probability of a string( which essentially contains words).
* To do this, we calculate the probability of each string belonging to different classes, select the maximum and assign labels. This is given in (1).

$$P(pos|(w1,w2....wn)) = P(pos)*\prod_{i=1}^{n}P(w_{i}|pos)\tag{1} $$
$$P(neutr|(w1,w2....wn)) = P(neutr)*\prod_{i=1}^{n}P(w_{i}|neutr) $$
$$P(neg|(w1,w2....wn)) = P(neg)*\prod_{i=1}^{n}P(w_{i}|neg) $$



$$parameters(pos) =P(W|pos) = \frac{freq_{pos} + \alpha}{N_{pos} + \alpha*V }\tag{2} $$

$$parameters(neutr)= P(W|neutr) = \frac{freq_{neutr} + \alpha}{N_{neutr} + \alpha*V} $$

$$parameters(neg)=P(W|neg) = \frac{freq_{neg} + \alpha}{N_{neg} + \alpha*V} $$


where, 
1. `P(pos|(w1,w2....wn))`: denotes the probability of given string being positive
2. `P(pos)`: probability of occurence of positive class
3. `P(w_i|pos)`(parameters_pos): denotes the probability of word i being positive, which is calculated by (2) 
4. `N_pos`: total number of positive words for all comments in the dataset
5.  `freq_pos`: frequencies of that specific word in the positive
6. `alpha`: Laplacian smoothing parameter **(to avoid division by 0)**
7. `V`: length of vocabulary

In [18]:
# call the helper function to get a list of positive, neutral and negative words

pos_words= word_sentiment(vocab)[0]
neutr_words= word_sentiment(vocab)[1]
neg_words= word_sentiment(vocab)[2]

parameters_pos = {word:0 for word in pos_words}
parameters_neg = {word:0 for word in neg_words}
parameters_neutr = {word:0 for word in neutr_words}

In [19]:
len(parameters_pos)

11161

In [20]:
len(parameters_neutr)

15602

In [21]:
len(parameters_neg)

8588

In [22]:
def naive_bayes(vocab,X_train,y_train,alpha=1):  
    '''
    Input:
        vocab: vocabulary containing word, label and its frequency
        X_train: dataset containing comments
        y_train: label corresponding to each comment
        alpha: Laplacian smoothing (deafult=1)
    Output:
        parameters_pos,parameters_neg,parameters_neutr: calculations corresponding to each sentiment
                                                        (see formula)
    '''
    
    num_pos=num_neg=num_neutr=0
    # store the total number of each sentiment
    
    for key in vocab.keys():
        # get N_pos, N_neutr, N_neg
        if key[1]==1:
            num_pos+=vocab.get(key,0)
        elif key[1]==0:
            num_neutr+=vocab.get(key,0)
        else:
            num_neg+=vocab.get(key,0)
    
    for word in vocab_words:
        #get frequency of word belonging to the three classes
        pos_freq=find_freq(vocab,word,1)
        neutr_freq=find_freq(vocab,word,0)
        neg_freq=find_freq(vocab,word,-1)
        
        # calculate parameters(see formula)
        prob_pos=(pos_freq+alpha)/(num_pos+vocab_len*alpha)
        prob_neutr=(neutr_freq+alpha)/(num_neutr+vocab_len*alpha)
        prob_neg=(neg_freq+alpha)/(num_neg+vocab_len*alpha)
        
        parameters_pos[word]=prob_pos
        parameters_neg[word]=prob_neg
        parameters_neutr[word]=prob_neutr
        
    return parameters_pos,parameters_neg,parameters_neutr

In [55]:
parameters_pos,parameters_neg,parameters_neutr=naive_bayes(vocab,X_train,y_train,1)

In [56]:
len(parameters_neutr)

35351

In [57]:
parameters_pos

{'flac': 9.756835272910878e-06,
 'bahujanahitāya': 4.878417636455439e-06,
 'bahujanasukhāya': 4.878417636455439e-06,
 'lokānukampāya': 4.878417636455439e-06,
 'gautama': 4.878417636455439e-06,
 'vir': 7.317626454683159e-06,
 'sanghvi': 1.2196044091138599e-05,
 '611': 4.878417636455439e-06,
 'indefens': 7.317626454683159e-06,
 'inflam': 4.878417636455439e-06,
 'interfaith': 7.317626454683159e-06,
 'joy': 5.854101163746527e-05,
 'unquest': 4.878417636455439e-06,
 'hitwomen': 7.317626454683159e-06,
 'madraasa': 4.878417636455439e-06,
 'flyer': 9.756835272910878e-06,
 'homepag': 1.7074461727594038e-05,
 'dharmo': 4.878417636455439e-06,
 'rakshati': 4.878417636455439e-06,
 'rakshitaha': 4.878417636455439e-06,
 'geronimo': 4.878417636455439e-06,
 'percol': 7.317626454683159e-06,
 'prefac': 7.317626454683159e-06,
 'enlightenedcentr': 4.878417636455439e-06,
 'behest': 7.317626454683159e-06,
 'milch': 4.878417636455439e-06,
 'flatter': 7.317626454683159e-06,
 'edmund': 4.878417636455439e-06,
 '

In [58]:
# count the total number of each sentiment in the corpus 

pos_corpus=list(y_train).count(1)
neutr_corpus=list(y_train).count(0)
neg_corpus=list(y_train).count(-1)
total=pos_corpus+neg_corpus+neutr_corpus

The prior probability represents the underlying probability in the target population that a comment is positive, neutral or negative. In other words, if we had no specific information and blindly picked a comment out of the population set, what is the probability that it will belong to one specific class? That is the **"prior"**.

$$P(C_{pos}) = \frac{C_{pos}}{C}{\tag3}$$

Where,
1. `C_pos`: number of positive comments(pos_corpus)
2. `C`: total number of comments

In [59]:
# prior ratio calculation

prior_pos=pos_corpus/total
prior_neg=neg_corpus/total
prior_neutr=neutr_corpus/total

In [60]:
prior_pos

0.42612470136949426

In [61]:
prior_neutr

0.35105488071604024

In [62]:
prior_neg

0.2228204179144655

In [63]:
def predict_NB(comment):
    '''
    Input:
        comment: a string containing text
    Output:
        prob_pos,prob_neg,prob_neutr: probability of the input belonging to the three sentiments
    '''
    list_word=clean_text(comment,stop_words,stemmer)
    #prob_pos=prob_neg=prob_neutr=0
    prob_pos=(prior_pos)
    prob_neg=(prior_neg)
    prob_neutr=(prior_neutr)
    for word in list_word:
        #since essentially all the dic contain the same keys
        if word in  list(parameters_pos.keys()):
            prob_pos*=(parameters_pos[word])
            prob_neg*=(parameters_neg[word])
            prob_neutr*=(parameters_neutr[word])
    return prob_pos,prob_neutr,prob_neg

## Part 4: Evaluative results

### Testing on random strings

In [64]:
predict_NB('i am an amazing person') # the probability is highest for positive

(1.9664534576641236e-07, 7.576923559969156e-09, 1.950994940213455e-08)

In [65]:
predict_NB('feeling sad today') # the probability is highest for negative

(7.521018462641277e-11, 2.9158457955643073e-12, 8.46339140109764e-11)

In [66]:
predict_NB("it's a beautiful day") # the probability is highest for positive

(1.422067177746586e-07, 3.810110133013061e-09, 1.8095582068077885e-08)

In [71]:
# a wrong classification
predict_NB("terror attacks on the city") # the probability is highest for positive 

(7.543489378369466e-11, 2.984525932073635e-12, 3.036430438832738e-11)

In [72]:
def test_NB(X_val,y_val):
    '''
    Input:
        X_val: dataset containing comments to be used for testing the model
        y_val: dataset containing labels to be used for testing the model
    Output:
        accuracy: accuracy of the model
    '''
    accuracy=0 #initialise accuracy to zero
    y_pred=[]  #prediction
    
    for comment in X_val:
        # get probabilities corresponding to each class
        pos,neg,neutr=predict_NB(comment)
        
        if(pos>neg) and (pos>neutr): # given comment is positive
            y_pred_i=1
        elif(neg>pos) and (neg>neutr): # given comment is negative
            y_pred_i=-1
        elif(neutr>neg) and (neutr>pos): # given comment is neutral
            y_pred_i=0
        y_pred.append(y_pred_i)
    
    # error is the mean of the differences between y_actual(y_val) and y_pred
    error=np.abs(np.mean(y_val-y_pred))
    accuracy=1-error
    return accuracy 

In [73]:
print("Naive Bayes accuracy = %0.4f" %(test_NB(X_val, y_val)))

Naive Bayes accuracy = 0.5849


The accuracy that we get is around 60%, which is good, since it's a basic model and it doesn't take into account the semantic relationship between words.

## Part 5: Future scope

* Different smoothing methods can be tried( we have done Lalplacian, for which alpha=1, Lidstone smoothing, alpha<1)
* Resampling of model due to skewed distribution of different classes

Resources used for theory and building the model: 
* https://towardsdatascience.com/introduction-to-na%C3%AFve-bayes-classifier-fa59e3e24aaf 
* https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html
* https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn