<a href="https://colab.research.google.com/github/VighneshS/sentiment_prediction/blob/master/sentiment_prediction.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://vighnesh-studies.blogspot.com/2021/04/sentiment-prediction-using-naive-bayes.html" target="_blank">BLOG</a>

# Sentiment Prediction using Naive Bayes Classifier (NBC)
This is a notebook to understand how Naive Bayes Classifier (NBC) works and also how it is useful to classify text based on sentiment.

We will also see how it will be effective against missing data.

## Settings
Training Percentage

In [240]:
training_ratio = 80 / 100
k = 5
most_useful_limit = 20

## Importing the Data
We used the [kaggle dataset](https://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def) to get the ground truth of sample IMDB reviews.

In [241]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import display
import math
from sklearn.model_selection import KFold

In [242]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

data = pd.read_csv(
    r"http://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def",
    delimiter="\t", header=None, names=["IMDB Review", "Sentiment"])
data = data.sample(frac=1).reset_index(drop=True)

### Split Data
We split the data into train, development and test

In [243]:
train = data[:math.floor(data.shape[0] * training_ratio)]

In [244]:
validation = data[math.floor(data.shape[0] * training_ratio):].sample(frac=1).reset_index(drop=True)
dev, test = np.array_split(validation, 2)

In [245]:
display(train, dev, test)

Unnamed: 0,IMDB Review,Sentiment
0,It has northern humour and positive about the ...,1
1,"But it is entertaining, nonetheless.",1
2,The result is a film that just don't look righ...,0
3,Highly unrecommended.,0
4,"Technically, the film is well made with impres...",1
...,...,...
593,I am a fan of his ... This movie sucked really...,0
594,Nothing at all to recommend.,0
595,It is a true classic.,1
596,It's the one movie that never ceases to intere...,1


Unnamed: 0,IMDB Review,Sentiment
0,"This is a bad film, with bad writing, and good...",0
1,Lot of holes in the plot: there's nothing abou...,0
2,Very disappointed and wondered how it could be...,0
3,If you want a real scare rent this one!,1
4,"A Lassie movie which should have been ""put to ...",0
...,...,...
70,"To be honest with you, this is unbelievable no...",0
71,The film looks cheap and bland.,0
72,It's a shame to see good actors like Thomerson...,0
73,I could not stand to even watch it for very lo...,0


Unnamed: 0,IMDB Review,Sentiment
75,"A very good film indeed, about great and uncon...",1
76,This is truly an art movie--it actually has a ...,1
77,The interplay between Martin and Emilio contai...,1
78,His losing his marbles so early in the proceed...,0
79,"In fact, this stinker smells like a direct-to-...",0
...,...,...
145,there is no real plot.,0
146,It was so BORING!,0
147,It is zillion times away from reality.,0
148,The music in the film is really nice too.,1


## Generation of Vocabulary list

In [246]:
def split_words(review):
    return review.lower().replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace('\'s',
                                                                                                      '').replace(
        '.',
        '').replace(
        '!', '').replace('-', ' ').replace('/', ' ').split()


def get_word_count(review_data_frame: pd.DataFrame, column_name: str):
    vocab = review_data_frame["IMDB Review"].apply(lambda review: pd.value_counts(
        split_words(review))).count(axis=0).to_frame()
    vocab.columns = [column_name]
    vocab.reset_index(inplace=True)
    vocab = vocab.rename(columns={'index': 'Word'})
    return vocab

## Get Naive Bayes Parameters
Here we have a function to genereate the Naive Bayes Parameters like:

1. Word Frequency
2. P(Word)
3. Positive Sentiment Word Frequency
4. P(Sentiment = Positive)
5. P(Word | Sentiment = Positive)
6. Negative Sentiment Word Frequency
7. P(Sentiment = Negative)
8. P(Word | Sentiment = Negative)

Which are useful in finding:

**P(Sentiment | Sentence (Collection of words)) = P(Sentence | Sentiment) * P(Sentiment) / P(Sentense)**

The P(Sentense) can be approximated to 1 as we are comparing sentiments the value will be cancelled on either sides

In [247]:
def generate_naive_bayes_parameters(data_frame: pd.DataFrame, smoothening: bool):
    naive_bayes_parameters = get_word_count(data_frame, "Word Frequency")
    if smoothening:
        naive_bayes_parameters["Word Frequency"] += 1

    total_words = naive_bayes_parameters["Word Frequency"].sum(axis=0)
    if smoothening:
        total_words += 2

    total_sentiments = data_frame.count(axis=0)['Sentiment']
    if smoothening:
        total_sentiments += 2

    naive_bayes_parameters['P(Word)'] = naive_bayes_parameters["Word Frequency"].div(total_words)

    positive_sentiments = data_frame[data_frame['Sentiment'] == 1]
    positive_vocabulary = get_word_count(positive_sentiments, "Positive Sentiment Word Frequency")
    naive_bayes_parameters = naive_bayes_parameters.merge(positive_vocabulary, how='left', on='Word')
    if smoothening:
        naive_bayes_parameters["Positive Sentiment Word Frequency"] += 1
        naive_bayes_parameters["Positive Sentiment Word Frequency"] = naive_bayes_parameters[
            "Positive Sentiment Word Frequency"].fillna(
            value=1)

    total_positive_words = positive_sentiments.count(axis=0)['Sentiment']
    if smoothening:
        total_positive_words += 2

    probability_of_positive_sentiments = total_positive_words / total_sentiments
    naive_bayes_parameters['P(Sentiment = Positive)'] = probability_of_positive_sentiments

    naive_bayes_parameters['P(Word | Sentiment = Positive)'] = naive_bayes_parameters[
        'Positive Sentiment Word Frequency'].div(
        total_positive_words)

    negative_sentiments = data_frame[data_frame['Sentiment'] == 0]
    negative_vocabulary = get_word_count(negative_sentiments, "Negative Sentiment Word Frequency")
    naive_bayes_parameters = naive_bayes_parameters.merge(negative_vocabulary, how='left', on='Word')
    if smoothening:
        naive_bayes_parameters["Negative Sentiment Word Frequency"] += 1
        naive_bayes_parameters["Negative Sentiment Word Frequency"] = naive_bayes_parameters[
            "Negative Sentiment Word Frequency"].fillna(
            value=1)

    total_negative_words = negative_sentiments.count(axis=0)['Sentiment']
    if smoothening:
        total_negative_words += 2

    probability_of_negative_sentiments = total_negative_words / total_sentiments
    naive_bayes_parameters['P(Sentiment = Negative)'] = probability_of_negative_sentiments

    naive_bayes_parameters['P(Word | Sentiment = Negative)'] = naive_bayes_parameters[
        'Negative Sentiment Word Frequency'].div(
        total_negative_words)

    return naive_bayes_parameters

## To Get the Probabilities

We use this formula to get the probabilities:

**P(Sentiment | Sentence (Collection of words)) = P(Sentence | Sentiment) * P(Sentiment) / P(Sentense)**

The below function will calculate the numerator part and assumes the denominator to be 1 as it will cancel out during
comparison.

For calculating the P(Sentence | Sentiment) we have words in sentences. So, we can write the formula as:

**P(Sentence | Sentiment) = P(Word_1,Word_2,...,Word_n | Sentiment)**

By Naive Bayes Theorem we can write it as:

**P(Word_1,Word_2,...,Word_n | Sentiment) = P(Word_1 | Sentiment).P(Word_2 | Sentiment). ... .P(Word_n | Sentiment)**

In [248]:
def get_probabilities(review: str, naive_bayes_parameters: pd.DataFrame, sentiment: bool, smoothening: bool):
    prob = 1
    column_name = 'P(Word | Sentiment = Positive)' if sentiment else 'P(Word | Sentiment = Negative)'
    individual_prob = 0 if not smoothening else 1 / (
        naive_bayes_parameters['P(Sentiment = Positive)'][0] if sentiment else naive_bayes_parameters[
            'P(Sentiment = Negative)'][0])
    for word in split_words(review):
        if word in naive_bayes_parameters.values:
            individual_prob = naive_bayes_parameters[naive_bayes_parameters['Word'] == word].iloc[0][column_name]
        prob *= 0 if math.isnan(individual_prob) else individual_prob
    return prob * (naive_bayes_parameters['P(Sentiment = Positive)'][0] if sentiment else naive_bayes_parameters[
        'P(Sentiment = Negative)'][0])

In [249]:
def predict_calculate_accuracy(data_frame: pd.DataFrame, naive_bayes_parameters: pd.DataFrame):
    data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(
        lambda review: get_probabilities(review, naive_bayes_parameters, True, False))
    data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
        lambda review: get_probabilities(review, naive_bayes_parameters, False, False))
    data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[
        "P(Sentiment = Negative | Sentence)"]
    accuracy = data_frame.loc[data_frame["Predicted sentiment"] == data_frame["Sentiment"]].count(axis=0)[
                   'Sentiment'] * 100 / data_frame.count(axis=0)['Sentiment']
    print("Accuracy: ", accuracy)
    # print("Wrong Predictions:")
    # display(data_frame.loc[data_frame["Predicted sentiment"] != data_frame["Sentiment"]].reset_index(drop=True))
    return accuracy


## Calculating Accuracy

To calculate accuracy we first divide the training dataset into k parts of train and test the first part of the
set is used to train the dataset with the remaining k-1 test dataset.

We then predict using the Naive bayes parameters that we get from training against the test data.

We then calculate the accuracy by finding (how many data is of correct prediction)/(total number of datasets)

With the parameters having the best accuracy is chosen from this and used for further validation of dev anf test
datasets which we separated in the beginning.

In [250]:
def five_fold_cross_validation(data_frame: pd.DataFrame, smoothening: bool):
    kf = KFold(n_splits=k, shuffle=True)
    train_folds = kf.split(data_frame)
    accuracies = []
    max_accuracy_naive_bayes_parameters = pd.DataFrame()
    for (train_training, train_testing), index in zip(train_folds, range(5)):
        print(f"---------------------------Fold {index + 1}---------------------------------")
        display(train.loc[train_training])
        trained_parameters = generate_naive_bayes_parameters(train.loc[train_training], smoothening)
        accuracy = predict_calculate_accuracy(train.loc[train_testing], trained_parameters)
        accuracies.append(accuracy)
        max_accuracy_naive_bayes_parameters = trained_parameters if max(
            accuracies) == accuracy else max_accuracy_naive_bayes_parameters
        display(trained_parameters)
    return max_accuracy_naive_bayes_parameters


vocabulary = five_fold_cross_validation(train, False)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,It has northern humour and positive about the ...,1
1,"But it is entertaining, nonetheless.",1
2,The result is a film that just don't look righ...,0
3,Highly unrecommended.,0
4,"Technically, the film is well made with impres...",1
...,...,...
593,I am a fan of his ... This movie sucked really...,0
594,Nothing at all to recommend.,0
595,It is a true classic.,1
596,It's the one movie that never ceases to intere...,1


Accuracy:  56.666666666666664


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,146,0.019120,78.0,0.531381,0.307087,68.0,0.468619,0.303571
1,represents,1,0.000131,1.0,0.531381,0.003937,,0.468619,
2,the,250,0.032740,143.0,0.531381,0.562992,107.0,0.468619,0.477679
3,humour,2,0.000262,2.0,0.531381,0.007874,,0.468619,
4,northern,1,0.000131,1.0,0.531381,0.003937,,0.468619,
...,...,...,...,...,...,...,...,...,...
2400,ceases,1,0.000131,1.0,0.531381,0.003937,,0.468619,
2401,alert,1,0.000131,1.0,0.531381,0.003937,,0.468619,
2402,therapy,1,0.000131,1.0,0.531381,0.003937,,0.468619,
2403,subtitles,1,0.000131,1.0,0.531381,0.003937,,0.468619,


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,It has northern humour and positive about the ...,1
1,"But it is entertaining, nonetheless.",1
3,Highly unrecommended.,0
4,"Technically, the film is well made with impres...",1
5,There is no plot here to keep you going in the...,0
...,...,...
593,I am a fan of his ... This movie sucked really...,0
594,Nothing at all to recommend.,0
595,It is a true classic.,1
596,It's the one movie that never ceases to intere...,1


Accuracy:  70.83333333333333


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,134,0.017503,70.0,0.539749,0.271318,64.0,0.460251,0.290909
1,represents,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2,the,241,0.031479,136.0,0.539749,0.527132,105.0,0.460251,0.477273
3,humour,2,0.000261,2.0,0.539749,0.007752,,0.460251,
4,northern,1,0.000131,1.0,0.539749,0.003876,,0.460251,
...,...,...,...,...,...,...,...,...,...
2388,interest,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2389,alert,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2390,therapy,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2391,subtitles,1,0.000131,1.0,0.539749,0.003876,,0.460251,


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment
2,The result is a film that just don't look righ...,0
3,Highly unrecommended.,0
4,"Technically, the film is well made with impres...",1
5,There is no plot here to keep you going in the...,0
6,"Oh yeah, and the storyline was pathetic too.",0
...,...,...
586,This movie has a cutting edge to it.,1
589,1/10 - and only because there is no setting fo...,0
590,I'll even say it again  this is torture.,0
596,It's the one movie that never ceases to intere...,1


Accuracy:  60.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,don't,14,0.001910,3.0,0.539749,0.011628,11.0,0.460251,0.050000
1,look,13,0.001773,7.0,0.539749,0.027132,6.0,0.460251,0.027273
2,just,33,0.004501,12.0,0.539749,0.046512,21.0,0.460251,0.095455
3,the,239,0.032601,135.0,0.539749,0.523256,104.0,0.460251,0.472727
4,film,68,0.009276,46.0,0.539749,0.178295,22.0,0.460251,0.100000
...,...,...,...,...,...,...,...,...,...
2272,alert,1,0.000136,1.0,0.539749,0.003876,,0.460251,
2273,afraid,1,0.000136,1.0,0.539749,0.003876,,0.460251,
2274,therapy,1,0.000136,1.0,0.539749,0.003876,,0.460251,
2275,subtitles,1,0.000136,1.0,0.539749,0.003876,,0.460251,


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,It has northern humour and positive about the ...,1
1,"But it is entertaining, nonetheless.",1
2,The result is a film that just don't look righ...,0
3,Highly unrecommended.,0
5,There is no plot here to keep you going in the...,0
...,...,...
591,Judith Light is one of my favorite actresses a...,1
592,But it picked up speed and got right to the po...,1
593,I am a fan of his ... This movie sucked really...,0
594,Nothing at all to recommend.,0


Accuracy:  55.46218487394958


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,141,0.018340,74.0,0.509395,0.303279,67.0,0.490605,0.285106
1,represents,1,0.000130,1.0,0.509395,0.004098,,0.490605,
2,the,247,0.032128,134.0,0.509395,0.549180,113.0,0.490605,0.480851
3,humour,2,0.000260,2.0,0.509395,0.008197,,0.490605,
4,northern,1,0.000130,1.0,0.509395,0.004098,,0.490605,
...,...,...,...,...,...,...,...,...,...
2368,setting,1,0.000130,,0.509395,,1.0,0.490605,0.004255
2369,favorite,1,0.000130,1.0,0.509395,0.004098,,0.490605,
2370,judith,1,0.000130,1.0,0.509395,0.004098,,0.490605,
2371,picked,1,0.000130,1.0,0.509395,0.004098,,0.490605,


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,It has northern humour and positive about the ...,1
1,"But it is entertaining, nonetheless.",1
2,The result is a film that just don't look righ...,0
4,"Technically, the film is well made with impres...",1
5,There is no plot here to keep you going in the...,0
...,...,...
593,I am a fan of his ... This movie sucked really...,0
594,Nothing at all to recommend.,0
595,It is a true classic.,1
596,It's the one movie that never ceases to intere...,1


Accuracy:  65.54621848739495


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,131,0.017894,75.0,0.555324,0.281955,56.0,0.444676,0.262911
1,represents,1,0.000137,1.0,0.555324,0.003759,,0.444676,
2,the,247,0.033739,140.0,0.555324,0.526316,107.0,0.444676,0.502347
3,humour,2,0.000273,2.0,0.555324,0.007519,,0.444676,
4,northern,1,0.000137,1.0,0.555324,0.003759,,0.444676,
...,...,...,...,...,...,...,...,...,...
2273,ceases,1,0.000137,1.0,0.555324,0.003759,,0.444676,
2274,alert,1,0.000137,1.0,0.555324,0.003759,,0.444676,
2275,therapy,1,0.000137,1.0,0.555324,0.003759,,0.444676,
2276,subtitles,1,0.000137,1.0,0.555324,0.003759,,0.444676,


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,134,0.017503,70.0,0.539749,0.271318,64.0,0.460251,0.290909
1,represents,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2,the,241,0.031479,136.0,0.539749,0.527132,105.0,0.460251,0.477273
3,humour,2,0.000261,2.0,0.539749,0.007752,,0.460251,
4,northern,1,0.000131,1.0,0.539749,0.003876,,0.460251,
...,...,...,...,...,...,...,...,...,...
2388,interest,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2389,alert,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2390,therapy,1,0.000131,1.0,0.539749,0.003876,,0.460251,
2391,subtitles,1,0.000131,1.0,0.539749,0.003876,,0.460251,


In [251]:
predict_calculate_accuracy(train, vocabulary)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(


Accuracy:  93.1438127090301


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[


93.1438127090301

In [252]:
predict_calculate_accuracy(dev, vocabulary)

Accuracy:  72.0


72.0

In [253]:
predict_calculate_accuracy(test, vocabulary)

Accuracy:  68.0


68.0

## Most Useful words before Smoothing

This is found by ordering the vocabulary in descending order of **P(Word | Sentiment)** for each negative and positive
sentiments

In [254]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Positive)", ascending=False)[:most_useful_limit]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
2,the,241,0.031479,136.0,0.539749,0.527132,105.0,0.460251,0.477273
6,and,171,0.022335,117.0,0.539749,0.453488,54.0,0.460251,0.245455
17,a,172,0.022466,106.0,0.539749,0.410853,66.0,0.460251,0.3
75,of,157,0.020507,93.0,0.539749,0.360465,64.0,0.460251,0.290909
10,is,153,0.019984,82.0,0.539749,0.317829,71.0,0.460251,0.322727
69,this,142,0.018548,79.0,0.539749,0.306202,63.0,0.460251,0.286364
0,it,134,0.017503,70.0,0.539749,0.271318,64.0,0.460251,0.290909
48,to,120,0.015674,68.0,0.539749,0.263566,52.0,0.460251,0.236364
148,i,114,0.01489,63.0,0.539749,0.244186,51.0,0.460251,0.231818
37,film,79,0.010319,52.0,0.539749,0.20155,27.0,0.460251,0.122727


In [255]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Negative)", ascending=False)[:most_useful_limit]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
2,the,241,0.031479,136.0,0.539749,0.527132,105.0,0.460251,0.477273
10,is,153,0.019984,82.0,0.539749,0.317829,71.0,0.460251,0.322727
17,a,172,0.022466,106.0,0.539749,0.410853,66.0,0.460251,0.3
0,it,134,0.017503,70.0,0.539749,0.271318,64.0,0.460251,0.290909
75,of,157,0.020507,93.0,0.539749,0.360465,64.0,0.460251,0.290909
69,this,142,0.018548,79.0,0.539749,0.306202,63.0,0.460251,0.286364
6,and,171,0.022335,117.0,0.539749,0.453488,54.0,0.460251,0.245455
48,to,120,0.015674,68.0,0.539749,0.263566,52.0,0.460251,0.236364
148,i,114,0.01489,63.0,0.539749,0.244186,51.0,0.460251,0.231818
47,in,94,0.012278,52.0,0.539749,0.20155,42.0,0.460251,0.190909


## Smoothening

Smoothening is done to compensate for unknown words. As all words can't be added to a dictionary and Naive Bayes is
specialized to handle missing words.

Smoothening is done by using the +1 method it is done in the get_naive_bayes_parameters function.

All it does is adding +1 to the following:
1. Word Frequency
2. Positive Sentiment Word Frequency
3. Negative Sentiment Word Frequency

Also, +2 for Number of sentiments as these terms are in the denominator and needs to adhere and compensate for the +1 in
the numerator so that the probability of most occurrence words will be less than 1
1. Total words
2. Total Positive sentiments
3. Total Negative sentiments
4. Total sentiments


In [256]:
vocabulary = five_fold_cross_validation(train, True)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,It has northern humour and positive about the ...,1,2.471503e-17,0.000000e+00,True
1,"But it is entertaining, nonetheless.",1,2.276588e-07,0.000000e+00,True
2,The result is a film that just don't look righ...,0,1.782963e-12,0.000000e+00,True
3,Highly unrecommended.,0,0.000000e+00,1.901864e-05,False
4,"Technically, the film is well made with impres...",1,4.910824e-57,0.000000e+00,True
...,...,...,...,...,...
593,I am a fan of his ... This movie sucked really...,0,0.000000e+00,2.194147e-13,False
594,Nothing at all to recommend.,0,4.093666e-08,4.658780e-07,False
595,It is a true classic.,1,2.298270e-06,0.000000e+00,True
596,It's the one movie that never ceases to intere...,1,2.023924e-36,0.000000e+00,True


Accuracy:  74.16666666666667


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,132,0.013408,72.0,0.522917,0.286853,61.0,0.48125,0.264069
1,represents,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2,the,241,0.024479,136.0,0.522917,0.541833,106.0,0.48125,0.458874
3,humour,3,0.000305,3.0,0.522917,0.011952,1.0,0.48125,0.004329
4,northern,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
...,...,...,...,...,...,...,...,...,...
2347,ceases,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2348,alert,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2349,therapy,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2350,subtitles,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,It has northern humour and positive about the ...,1,2.471503e-17,0.000000e+00,True
1,"But it is entertaining, nonetheless.",1,2.276588e-07,0.000000e+00,True
2,The result is a film that just don't look righ...,0,1.782963e-12,0.000000e+00,True
3,Highly unrecommended.,0,0.000000e+00,1.901864e-05,False
4,"Technically, the film is well made with impres...",1,4.910824e-57,0.000000e+00,True
...,...,...,...,...,...
592,But it picked up speed and got right to the po...,1,8.691701e-16,0.000000e+00,True
593,I am a fan of his ... This movie sucked really...,0,0.000000e+00,2.194147e-13,False
594,Nothing at all to recommend.,0,4.093666e-08,4.658780e-07,False
595,It is a true classic.,1,2.298270e-06,0.000000e+00,True


Accuracy:  65.83333333333333


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,137,0.013475,78.0,0.54375,0.298851,60.0,0.460417,0.271493
1,represents,2,0.000197,2.0,0.54375,0.007663,1.0,0.460417,0.004525
2,the,251,0.024688,143.0,0.54375,0.547893,109.0,0.460417,0.493213
3,humour,3,0.000295,3.0,0.54375,0.011494,1.0,0.460417,0.004525
4,northern,2,0.000197,2.0,0.54375,0.007663,1.0,0.460417,0.004525
...,...,...,...,...,...,...,...,...,...
2428,keeps,2,0.000197,2.0,0.54375,0.007663,1.0,0.460417,0.004525
2429,decipher,2,0.000197,2.0,0.54375,0.007663,1.0,0.460417,0.004525
2430,ceases,2,0.000197,2.0,0.54375,0.007663,1.0,0.460417,0.004525
2431,interest,2,0.000197,2.0,0.54375,0.007663,1.0,0.460417,0.004525


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,It has northern humour and positive about the ...,1,2.471503e-17,0.000000e+00,True
1,"But it is entertaining, nonetheless.",1,2.276588e-07,0.000000e+00,True
4,"Technically, the film is well made with impres...",1,4.910824e-57,0.000000e+00,True
6,"Oh yeah, and the storyline was pathetic too.",0,0.000000e+00,5.478339e-12,False
7,Watching washing machine twirling around would...,0,0.000000e+00,3.107479e-25,False
...,...,...,...,...,...
591,Judith Light is one of my favorite actresses a...,1,1.864440e-25,0.000000e+00,True
592,But it picked up speed and got right to the po...,1,8.691701e-16,0.000000e+00,True
593,I am a fan of his ... This movie sucked really...,0,0.000000e+00,2.194147e-13,False
595,It is a true classic.,1,2.298270e-06,0.000000e+00,True


Accuracy:  70.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,135,0.013397,68.0,0.539583,0.262548,68.0,0.464583,0.304933
1,represents,2,0.000198,2.0,0.539583,0.007722,1.0,0.464583,0.004484
2,the,242,0.024015,136.0,0.539583,0.525097,107.0,0.464583,0.479821
3,humour,3,0.000298,3.0,0.539583,0.011583,1.0,0.464583,0.004484
4,northern,2,0.000198,2.0,0.539583,0.007722,1.0,0.464583,0.004484
...,...,...,...,...,...,...,...,...,...
2362,judith,2,0.000198,2.0,0.539583,0.007722,1.0,0.464583,0.004484
2363,picked,2,0.000198,2.0,0.539583,0.007722,1.0,0.464583,0.004484
2364,therapy,2,0.000198,2.0,0.539583,0.007722,1.0,0.464583,0.004484
2365,subtitles,2,0.000198,2.0,0.539583,0.007722,1.0,0.464583,0.004484


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,It has northern humour and positive about the ...,1,2.471503e-17,0.000000e+00,True
1,"But it is entertaining, nonetheless.",1,2.276588e-07,0.000000e+00,True
2,The result is a film that just don't look righ...,0,1.782963e-12,0.000000e+00,True
3,Highly unrecommended.,0,0.000000e+00,1.901864e-05,False
4,"Technically, the film is well made with impres...",1,4.910824e-57,0.000000e+00,True
...,...,...,...,...,...
592,But it picked up speed and got right to the po...,1,8.691701e-16,0.000000e+00,True
593,I am a fan of his ... This movie sucked really...,0,0.000000e+00,2.194147e-13,False
594,Nothing at all to recommend.,0,4.093666e-08,4.658780e-07,False
596,It's the one movie that never ceases to intere...,1,2.023924e-36,0.000000e+00,True


Accuracy:  65.54621848739495


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,140,0.014675,76.0,0.553015,0.285714,65.0,0.451143,0.299539
1,represents,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608
2,the,254,0.026625,148.0,0.553015,0.556391,107.0,0.451143,0.493088
3,humour,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608
4,northern,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608
...,...,...,...,...,...,...,...,...,...
2258,alert,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608
2259,afraid,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608
2260,therapy,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608
2261,subtitles,2,0.000210,2.0,0.553015,0.007519,1.0,0.451143,0.004608


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
2,The result is a film that just don't look righ...,0,1.782963e-12,0.000000e+00,True
3,Highly unrecommended.,0,0.000000e+00,1.901864e-05,False
5,There is no plot here to keep you going in the...,0,0.000000e+00,1.557680e-17,False
6,"Oh yeah, and the storyline was pathetic too.",0,0.000000e+00,5.478339e-12,False
9,This is an extraordinary film.,1,4.071717e-05,1.305057e-05,True
...,...,...,...,...,...
591,Judith Light is one of my favorite actresses a...,1,1.864440e-25,0.000000e+00,True
594,Nothing at all to recommend.,0,4.093666e-08,4.658780e-07,False
595,It is a true classic.,1,2.298270e-06,0.000000e+00,True
596,It's the one movie that never ceases to intere...,1,2.023924e-36,0.000000e+00,True


Accuracy:  67.22689075630252


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,don't,15,0.001540,4.0,0.525988,0.015810,12.0,0.47817,0.052174
1,look,15,0.001540,8.0,0.525988,0.031621,8.0,0.47817,0.034783
2,just,33,0.003388,12.0,0.525988,0.047431,22.0,0.47817,0.095652
3,the,241,0.024741,130.0,0.525988,0.513834,112.0,0.47817,0.486957
4,film,79,0.008110,52.0,0.525988,0.205534,28.0,0.47817,0.121739
...,...,...,...,...,...,...,...,...,...
2308,ceases,2,0.000205,2.0,0.525988,0.007905,1.0,0.47817,0.004348
2309,alert,2,0.000205,2.0,0.525988,0.007905,1.0,0.47817,0.004348
2310,therapy,2,0.000205,2.0,0.525988,0.007905,1.0,0.47817,0.004348
2311,subtitles,2,0.000205,2.0,0.525988,0.007905,1.0,0.47817,0.004348


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,it,132,0.013408,72.0,0.522917,0.286853,61.0,0.48125,0.264069
1,represents,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2,the,241,0.024479,136.0,0.522917,0.541833,106.0,0.48125,0.458874
3,humour,3,0.000305,3.0,0.522917,0.011952,1.0,0.48125,0.004329
4,northern,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
...,...,...,...,...,...,...,...,...,...
2347,ceases,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2348,alert,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2349,therapy,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2350,subtitles,2,0.000203,2.0,0.522917,0.007968,1.0,0.48125,0.004329


In [257]:
predict_calculate_accuracy(train, vocabulary)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(


Accuracy:  93.1438127090301


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[


93.1438127090301

In [258]:
predict_calculate_accuracy(dev, vocabulary)

Accuracy:  73.33333333333333


73.33333333333333

In [259]:
predict_calculate_accuracy(test, vocabulary)

Accuracy:  69.33333333333333


69.33333333333333

## Most Useful words after Smoothing

In [260]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Positive)", ascending=False)[:most_useful_limit]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
2,the,241,0.024479,136.0,0.522917,0.541833,106.0,0.48125,0.458874
6,and,171,0.017369,114.0,0.522917,0.454183,58.0,0.48125,0.251082
19,a,178,0.01808,105.0,0.522917,0.418327,74.0,0.48125,0.320346
74,of,152,0.015439,90.0,0.522917,0.358566,63.0,0.48125,0.272727
10,is,156,0.015846,77.0,0.522917,0.306773,80.0,0.48125,0.34632
68,this,141,0.014322,75.0,0.522917,0.298805,67.0,0.48125,0.290043
0,it,132,0.013408,72.0,0.522917,0.286853,61.0,0.48125,0.264069
53,to,126,0.012798,68.0,0.522917,0.270916,59.0,0.48125,0.255411
108,i,111,0.011275,61.0,0.522917,0.243028,51.0,0.48125,0.220779
52,in,90,0.009142,53.0,0.522917,0.211155,38.0,0.48125,0.164502


In [261]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Negative)", ascending=False)[:most_useful_limit]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
2,the,241,0.024479,136.0,0.522917,0.541833,106.0,0.48125,0.458874
10,is,156,0.015846,77.0,0.522917,0.306773,80.0,0.48125,0.34632
19,a,178,0.01808,105.0,0.522917,0.418327,74.0,0.48125,0.320346
68,this,141,0.014322,75.0,0.522917,0.298805,67.0,0.48125,0.290043
74,of,152,0.015439,90.0,0.522917,0.358566,63.0,0.48125,0.272727
0,it,132,0.013408,72.0,0.522917,0.286853,61.0,0.48125,0.264069
53,to,126,0.012798,68.0,0.522917,0.270916,59.0,0.48125,0.255411
6,and,171,0.017369,114.0,0.522917,0.454183,58.0,0.48125,0.251082
108,i,111,0.011275,61.0,0.522917,0.243028,51.0,0.48125,0.220779
20,that,81,0.008228,41.0,0.522917,0.163347,41.0,0.48125,0.177489


## Inference

From the above we can see the effect of smoothening at the time of runtime with accuracy increase of 15%.

Also, from the most useful words we can see 2 things.
1. The most common words are the useful words.
2. The most common words, and some words have higher probability in both positive and negative sentiments.

This shows us that these data need to be removed.

For doing these as future work we can remove stop words from Pythons old NLTK library for stop words.
Also, we can remove the more frequent words like the movie, film as it is both positive and negative which is
logical as it is a movie database...