<a href="https://colab.research.google.com/github/VighneshS/sentiment_prediction/blob/master/sentiment_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Prediction using Naive Bayes Classifier (NBC)
This is a notebook to understand how Naive Bayes Classifier (NBC) works and also how it is useful to classify text based on sentiment.

We will also see how it will be effective against missing data.

## Settings
Training Percentage

In [1]:
training_ratio = 80 / 100

## Importing the Data
We used the [kaggle dataset](https://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def) to get the ground truth of sample IMDB reviews.

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import display
import math

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

data = pd.read_csv(
    r"http://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def",
    delimiter="\t", header=None, names=["IMDB Review", "Sentiment"])
data = data.sample(frac=1).reset_index(drop=True)

### Split Data
We split the data into train, development and test

In [3]:
train = data[:math.floor(data.shape[0] * training_ratio)]

In [4]:
validation = data[math.floor(data.shape[0] * training_ratio):].sample(frac=1).reset_index(drop=True)
dev, test = np.array_split(validation, 2)

In [5]:
train, dev, test

(                                           IMDB Review  Sentiment
 0    The script is bad, very bad  it contains both...          0
 1    Both films are terrible, but to the credit of ...          0
 2    To be honest with you, this is unbelievable no...          0
 3           Both do good jobs and are quite amusing.            1
 4    I highly doubt that anyone could ever like thi...          0
 ..                                                 ...        ...
 593                              Very disappointing.            0
 594  I like Armand Assante & my cable company's sum...          1
 595                            It is a true classic.            1
 596                               Brilliance indeed.            1
 597                 I loved it, it was really scary.            1
 
 [598 rows x 2 columns],
                                           IMDB Review  Sentiment
 0                              I really liked that.            1
 1   The plot is nonsense that doesn'

## Generation of Vocabulary list

In [6]:
def split_words(review):
    return review.lower().replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace('\'s',
                                                                                                      '').replace(
        '.',
        '').replace(
        '!', '').replace('-', ' ').replace('/', ' ').split()


def get_word_count(review_data_frame: pd.DataFrame, column_name: str):
    vocab = review_data_frame["IMDB Review"].apply(lambda review: pd.value_counts(
        split_words(review))).sum(axis=0).to_frame()
    vocab.columns = [column_name]
    vocab.reset_index(inplace=True)
    vocab = vocab.rename(columns={'index': 'Word'})
    return vocab

In [7]:
def generate_naive_bayes_parameters(data_frame: pd.DataFrame, smoothening: bool):
    naive_bayes_parameters = get_word_count(data_frame, "Word Frequency")
    if smoothening:
        naive_bayes_parameters["Word Frequency"] += 1

    total_words = naive_bayes_parameters["Word Frequency"].sum(axis=0)
    if smoothening:
        total_words += 2

    total_sentiments = data_frame.count(axis=0)['Sentiment']
    if smoothening:
        total_sentiments += 2

    naive_bayes_parameters['P(Word)'] = naive_bayes_parameters["Word Frequency"].div(total_words)

    positive_sentiments = data_frame[data_frame['Sentiment'] == 1]
    positive_vocabulary = get_word_count(positive_sentiments, "Positive Sentiment Word Frequency")
    naive_bayes_parameters = naive_bayes_parameters.merge(positive_vocabulary, how='left', on='Word')
    if smoothening:
        naive_bayes_parameters["Positive Sentiment Word Frequency"] += 1
        naive_bayes_parameters["Positive Sentiment Word Frequency"] = naive_bayes_parameters[
            "Positive Sentiment Word Frequency"].fillna(
            value=1)

    total_positive_words = positive_sentiments.count(axis=0)['Sentiment']
    if smoothening:
        total_positive_words += 2

    probability_of_positive_sentiments = total_positive_words / total_sentiments
    naive_bayes_parameters['P(Sentiment = Positive)'] = probability_of_positive_sentiments

    naive_bayes_parameters['P(Word | Sentiment = Positive)'] = naive_bayes_parameters[
        'Positive Sentiment Word Frequency'].div(
        total_positive_words)

    negative_sentiments = data_frame[data_frame['Sentiment'] == 0]
    negative_vocabulary = get_word_count(negative_sentiments, "Negative Sentiment Word Frequency")
    naive_bayes_parameters = naive_bayes_parameters.merge(negative_vocabulary, how='left', on='Word')
    if smoothening:
        naive_bayes_parameters["Negative Sentiment Word Frequency"] += 1
        naive_bayes_parameters["Negative Sentiment Word Frequency"] = naive_bayes_parameters[
            "Negative Sentiment Word Frequency"].fillna(
            value=1)

    total_negative_words = negative_sentiments.count(axis=0)['Sentiment']
    if smoothening:
        total_negative_words += 2

    probability_of_negative_sentiments = total_negative_words / total_sentiments
    naive_bayes_parameters['P(Sentiment = Negative)'] = probability_of_negative_sentiments

    naive_bayes_parameters['P(Word | Sentiment = Negative)'] = naive_bayes_parameters[
        'Negative Sentiment Word Frequency'].div(
        total_negative_words)

    return naive_bayes_parameters

In [8]:
vocabulary = generate_naive_bayes_parameters(train, False)
vocabulary

Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,bad,65.0,0.005635,7.0,0.503344,0.023256,58.0,0.496656,0.195286
1,and,336.0,0.029129,188.0,0.503344,0.624585,148.0,0.496656,0.498316
2,in,150.0,0.013004,81.0,0.503344,0.269103,69.0,0.496656,0.232323
3,very,50.0,0.004335,21.0,0.503344,0.069767,29.0,0.496656,0.097643
4,normally,1.0,0.000087,,0.503344,,1.0,0.496656,0.003367
...,...,...,...,...,...,...,...,...,...
2682,twice,1.0,0.000087,1.0,0.503344,0.003322,,0.496656,
2683,cable,1.0,0.000087,1.0,0.503344,0.003322,,0.496656,
2684,sounded,1.0,0.000087,1.0,0.503344,0.003322,,0.496656,
2685,company,1.0,0.000087,1.0,0.503344,0.003322,,0.496656,


In [9]:
def get_probabilities(review: str, sentiment: bool, smoothening: bool):
    prob = 1
    column_name = 'P(Word | Sentiment = Positive)' if sentiment else 'P(Word | Sentiment = Negative)'
    individual_prob = 0 if not smoothening else 1 / (
        vocabulary['P(Sentiment = Positive)'][0] if sentiment else vocabulary[
            'P(Sentiment = Negative)'][0])
    for word in split_words(review):
        if word in vocabulary.values:
            individual_prob = vocabulary[vocabulary['Word'] == word].iloc[0][column_name]
        prob *= 0 if math.isnan(individual_prob) else individual_prob
    return prob * (vocabulary['P(Sentiment = Positive)'][0] if sentiment else vocabulary[
        'P(Sentiment = Negative)'][0])

In [10]:
def predict_calculate_accuracy(data_frame: pd.DataFrame):
    data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(
        lambda review: get_probabilities(review, True, False))
    data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
        lambda review: get_probabilities(review, False, False))
    data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[
        "P(Sentiment = Negative | Sentence)"]
    print("Train Accuracy: ",
          data_frame.loc[data_frame["Predicted sentiment"] == data_frame["Sentiment"]].count(axis=0)[
              'Sentiment'] * 100 /
          data_frame.count(axis=0)['Sentiment'])
    print("Wrong Predictions:")
    display(data_frame.loc[data_frame["Predicted sentiment"] != data_frame["Sentiment"]].reset_index(drop=True))

In [11]:
predict_calculate_accuracy(train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(


Train Accuracy:  98.16053511705685
Wrong Predictions:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"If you have not seen this movie, I definitely ...",1,4.182146e-13,1.26003e-12,False
1,"Predictable, but not a bad watch.",1,5.660326e-09,3.2285e-07,False
2,The aerial scenes were well-done.,1,1.351386e-09,1.559133e-09,False
3,This if the first movie I've given a 10 to in ...,1,8.040586e-14,9.073169e-14,False
4,Go rent it.,1,2.724283e-05,2.934654e-05,False
5,The directing and the cinematography aren't qu...,0,4.14528e-11,1.764361e-11,True
6,The result is a film that just don't look righ...,0,2.719094e-11,1.411362e-11,True
7,I don't think you will be disappointed.,1,1.584102e-10,1.897022e-10,False
8,"With great sound effects, and impressive spec...",1,0.0,0.0,False
9,But this movie really got to me.,1,2.561789e-08,4.17011e-08,False


In [12]:
predict_calculate_accuracy(dev)

Train Accuracy:  58.666666666666664
Wrong Predictions:


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,There were too many close ups.,0,9.388699e-12,0.0,True
1,Not too screamy not to masculine but just righ...,1,3.018647e-12,1.045193e-10,False
2,add betty white and jean smart and you have a ...,1,0.0,0.0,False
3,This is high adventure at its best.,1,0.0,3.396281e-11,False
4,but the movie makes a lot of serious mistakes.,0,2.087786e-11,0.0,True
5,Simply beautiful.,1,5.555617e-05,5.63044e-05,False
6,Highly entertaining at all angles.,1,0.0,0.0,False
7,"I think it was Robert Ryans best film, because...",1,0.0,0.0,False
8,See it with your kids if you have a chance--it...,1,0.0,0.0,False
9,About ten minutes into this film I started hav...,0,1.5851369999999998e-19,0.0,True


In [13]:
predict_calculate_accuracy(test)

Train Accuracy:  60.0
Wrong Predictions:


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,Hayao Miyazaki's latest and eighth film for St...,1,0.0,0.0,False
1,I have to say that this film was excellently p...,1,0.0,0.0,False
2,This movie is excellent!Angel is beautiful and...,1,0.0,0.0,False
3,But the duet between the astronaut and his doc...,1,0.0,0.0,False
4,It really created a unique feeling though.,1,1.975542e-09,2.262988e-09,False
5,Enough can not be said of the remarkable anima...,1,0.0,0.0,False
6,"Personally, I think it shows that people shoul...",1,0.0,0.0,False
7,I saw this movie and I thought this is a stupi...,0,2.155594e-10,9.670659e-11,True
8,"Not much dialogue, not much music, the whole f...",1,8.277893e-21,1.887405e-18,False
9,If there was ever a movie that needed word-of-...,1,0.0,6.947341e-16,False


## Smoothening

In [14]:
vocabulary = generate_naive_bayes_parameters(train, True)
vocabulary

Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,bad,66.0,0.004640,8.0,0.505,0.026403,59.0,0.498333,0.197324
1,and,337.0,0.023692,189.0,0.505,0.623762,149.0,0.498333,0.498328
2,in,151.0,0.010616,82.0,0.505,0.270627,70.0,0.498333,0.234114
3,very,51.0,0.003585,22.0,0.505,0.072607,30.0,0.498333,0.100334
4,normally,2.0,0.000141,1.0,0.505,0.003300,2.0,0.498333,0.006689
...,...,...,...,...,...,...,...,...,...
2682,twice,2.0,0.000141,2.0,0.505,0.006601,1.0,0.498333,0.003344
2683,cable,2.0,0.000141,2.0,0.505,0.006601,1.0,0.498333,0.003344
2684,sounded,2.0,0.000141,2.0,0.505,0.006601,1.0,0.498333,0.003344
2685,company,2.0,0.000141,2.0,0.505,0.006601,1.0,0.498333,0.003344


In [15]:
predict_calculate_accuracy(train)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(


Train Accuracy:  96.82274247491638
Wrong Predictions:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"If you have not seen this movie, I definitely ...",1,8.444458e-13,2.960232e-12,False
1,"Predictable, but not a bad watch.",1,1.106212e-08,4.495441e-07,False
2,This movie is so awesome!,1,1.251944e-05,1.594856e-05,False
3,I struggle to find anything bad to say about i...,1,2.307448e-13,1.921423e-12,False
4,"The film's dialogue is natural, real to life.",1,2.268493e-10,1.09825e-09,False
5,"Lifetime does not air it enough, so if anyone ...",1,4.100805e-34,4.422195e-34,False
6,The aerial scenes were well-done.,1,4.465546e-09,4.736441e-09,False
7,It's the one movie that never ceases to intere...,1,9.882571000000001e-33,1.040557e-32,False
8,This if the first movie I've given a 10 to in ...,1,2.725678e-13,3.480304e-13,False
9,Go rent it.,1,4.727205e-05,4.847075e-05,False


In [16]:
predict_calculate_accuracy(dev)

Train Accuracy:  68.0
Wrong Predictions:


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,There were too many close ups.,0,4.440173e-11,1.506422e-11,True
1,Not too screamy not to masculine but just righ...,1,6.381754e-12,2.013268e-10,False
2,add betty white and jean smart and you have a ...,1,0.0,0.0,False
3,This is high adventure at its best.,1,9.736793e-11,1.889741e-10,False
4,but the movie makes a lot of serious mistakes.,0,8.276208e-11,9.786082e-12,True
5,Simply beautiful.,1,9.90099e-05,0.0001003344,False
6,Highly entertaining at all angles.,1,1.03691e-09,1.321232e-09,False
7,"I think it was Robert Ryans best film, because...",1,1.369013e-58,3.192815e-56,False
8,See it with your kids if you have a chance--it...,1,3.991769e-43,3.7782819999999995e-42,False
9,Kathy Bates is wonderful in her characters sub...,1,0.0,0.0,False


In [17]:
predict_calculate_accuracy(test)

Train Accuracy:  66.66666666666667
Wrong Predictions:


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,Hayao Miyazaki's latest and eighth film for St...,1,0.0,0.0,False
1,This movie is excellent!Angel is beautiful and...,1,8.744971e-70,2.888945e-66,False
2,Enough can not be said of the remarkable anima...,1,3.052101e-13,2.604072e-12,False
3,"Personally, I think it shows that people shoul...",1,0.0,0.0,False
4,I saw this movie and I thought this is a stupi...,0,4.284981e-10,2.659799e-10,True
5,A cheap and cheerless heist movie with poor ch...,0,9.424839999999999e-64,1.9564420000000002e-66,True
6,"Not much dialogue, not much music, the whole f...",1,5.981346999999999e-20,6.483328e-18,False
7,If there was ever a movie that needed word-of-...,1,2.3768590000000003e-17,4.342272e-15,False
8,The movie is full of wonderful dancing (hence ...,1,9.583241e-11,2.682054e-10,False
9,"Feelings, thoughts...Gabriel's discomfort duri...",1,0.0,0.0,False
