<a href="https://colab.research.google.com/github/VighneshS/sentiment_prediction/blob/master/sentiment_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Prediction using Naive Bayes Classifier (NBC)
This is a notebook to understand how Naive Bayes Classifier (NBC) works and also how it is useful to classify text based on sentiment.

We will also see how it will be effective against missing data.

## Settings
Training Percentage

In [174]:
training_ratio = 80 / 100
k = 5
most_useful_limit = 20

## Importing the Data
We used the [kaggle dataset](https://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def) to get the ground truth of sample IMDB reviews.

In [175]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import display
import math
from sklearn.model_selection import KFold

In [176]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

data = pd.read_csv(
    r"http://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def",
    delimiter="\t", header=None, names=["IMDB Review", "Sentiment"])
data = data.sample(frac=1).reset_index(drop=True)

### Split Data
We split the data into train, development and test

In [177]:
train = data[:math.floor(data.shape[0] * training_ratio)]

In [178]:
validation = data[math.floor(data.shape[0] * training_ratio):].sample(frac=1).reset_index(drop=True)
dev, test = np.array_split(validation, 2)

In [179]:
display(train, dev, test)

Unnamed: 0,IMDB Review,Sentiment
0,"In conclusion, I will not bother with this mov...",0
1,"Mark my words, this is one of those cult films...",1
2,& That movie was bad.,0
3,Generally; it just lacked imagination.,0
4,THERE IS NO PLOT OR STORYLINE!!,0
...,...,...
593,This is a stunning movie.,1
594,"Utterly without merit on any level, this is ak...",0
595,"With great sound effects, and impressive spec...",1
596,"This is a bad film, with bad writing, and good...",0


Unnamed: 0,IMDB Review,Sentiment
0,"While you don't yet hear Mickey speak, there a...",1
1,One of the most disappointing aspects is the l...,0
2,Lot of holes in the plot: there's nothing abou...,0
3,It is indescribably the most annoying and idio...,0
4,It's a feel-good film and that's how I felt wh...,1
...,...,...
70,"Each track commands sentiment, actually contri...",1
71,With the originality and freshness of the firs...,0
72,"Bad characters, bad story and bad acting.",0
73,"Overall, a delight!",1


Unnamed: 0,IMDB Review,Sentiment
75,This movie does an excellent job of revealing ...,1
76,The story unfolds in 18th century Jutland and ...,1
77,"It's a mediocre, miserable, hollow, laughable ...",0
78,Both Rickman and Stowe play their roles to the...,1
79,It failed to convey the broad sweep of landsca...,0
...,...,...
145,"The rest of the movie lacks art, charm, meanin...",0
146,At a time when it seems that film animation ha...,1
147,I saw this short film on HBO the other day and...,1
148,The characters are fleshed out surprisingly we...,1


## Generation of Vocabulary list

In [180]:
def split_words(review):
    return review.lower().replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace('\'s',
                                                                                                      '').replace(
        '.',
        '').replace(
        '!', '').replace('-', ' ').replace('/', ' ').split()


def get_word_count(review_data_frame: pd.DataFrame, column_name: str):
    vocab = review_data_frame["IMDB Review"].apply(lambda review: pd.value_counts(
        split_words(review))).count(axis=0).to_frame()
    vocab.columns = [column_name]
    vocab.reset_index(inplace=True)
    vocab = vocab.rename(columns={'index': 'Word'})
    return vocab

In [181]:
def generate_naive_bayes_parameters(data_frame: pd.DataFrame, smoothening: bool):
    naive_bayes_parameters = get_word_count(data_frame, "Word Frequency")
    if smoothening:
        naive_bayes_parameters["Word Frequency"] += 1

    total_words = naive_bayes_parameters["Word Frequency"].sum(axis=0)
    if smoothening:
        total_words += 2

    total_sentiments = data_frame.count(axis=0)['Sentiment']
    if smoothening:
        total_sentiments += 2

    naive_bayes_parameters['P(Word)'] = naive_bayes_parameters["Word Frequency"].div(total_words)

    positive_sentiments = data_frame[data_frame['Sentiment'] == 1]
    positive_vocabulary = get_word_count(positive_sentiments, "Positive Sentiment Word Frequency")
    naive_bayes_parameters = naive_bayes_parameters.merge(positive_vocabulary, how='left', on='Word')
    if smoothening:
        naive_bayes_parameters["Positive Sentiment Word Frequency"] += 1
        naive_bayes_parameters["Positive Sentiment Word Frequency"] = naive_bayes_parameters[
            "Positive Sentiment Word Frequency"].fillna(
            value=1)

    total_positive_words = positive_sentiments.count(axis=0)['Sentiment']
    if smoothening:
        total_positive_words += 2

    probability_of_positive_sentiments = total_positive_words / total_sentiments
    naive_bayes_parameters['P(Sentiment = Positive)'] = probability_of_positive_sentiments

    naive_bayes_parameters['P(Word | Sentiment = Positive)'] = naive_bayes_parameters[
        'Positive Sentiment Word Frequency'].div(
        total_positive_words)

    negative_sentiments = data_frame[data_frame['Sentiment'] == 0]
    negative_vocabulary = get_word_count(negative_sentiments, "Negative Sentiment Word Frequency")
    naive_bayes_parameters = naive_bayes_parameters.merge(negative_vocabulary, how='left', on='Word')
    if smoothening:
        naive_bayes_parameters["Negative Sentiment Word Frequency"] += 1
        naive_bayes_parameters["Negative Sentiment Word Frequency"] = naive_bayes_parameters[
            "Negative Sentiment Word Frequency"].fillna(
            value=1)

    total_negative_words = negative_sentiments.count(axis=0)['Sentiment']
    if smoothening:
        total_negative_words += 2

    probability_of_negative_sentiments = total_negative_words / total_sentiments
    naive_bayes_parameters['P(Sentiment = Negative)'] = probability_of_negative_sentiments

    naive_bayes_parameters['P(Word | Sentiment = Negative)'] = naive_bayes_parameters[
        'Negative Sentiment Word Frequency'].div(
        total_negative_words)

    return naive_bayes_parameters

In [182]:
def get_probabilities(review: str, naive_bayes_parameters: pd.DataFrame, sentiment: bool, smoothening: bool):
    prob = 1
    column_name = 'P(Word | Sentiment = Positive)' if sentiment else 'P(Word | Sentiment = Negative)'
    individual_prob = 0 if not smoothening else 1 / (
        naive_bayes_parameters['P(Sentiment = Positive)'][0] if sentiment else naive_bayes_parameters[
            'P(Sentiment = Negative)'][0])
    for word in split_words(review):
        if word in naive_bayes_parameters.values:
            individual_prob = naive_bayes_parameters[naive_bayes_parameters['Word'] == word].iloc[0][column_name]
        prob *= 0 if math.isnan(individual_prob) else individual_prob
    return prob * (naive_bayes_parameters['P(Sentiment = Positive)'][0] if sentiment else naive_bayes_parameters[
        'P(Sentiment = Negative)'][0])

In [183]:
def predict_calculate_accuracy(data_frame: pd.DataFrame, naive_bayes_parameters: pd.DataFrame):
    data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(
        lambda review: get_probabilities(review, naive_bayes_parameters, True, False))
    data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
        lambda review: get_probabilities(review, naive_bayes_parameters, False, False))
    data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[
        "P(Sentiment = Negative | Sentence)"]
    accuracy = data_frame.loc[data_frame["Predicted sentiment"] == data_frame["Sentiment"]].count(axis=0)[
                   'Sentiment'] * 100 / data_frame.count(axis=0)['Sentiment']
    print("Accuracy: ", accuracy)
    # print("Wrong Predictions:")
    # display(data_frame.loc[data_frame["Predicted sentiment"] != data_frame["Sentiment"]].reset_index(drop=True))
    return accuracy


In [184]:
def five_fold_cross_validation(data_frame: pd.DataFrame, smoothening: bool):
    kf = KFold(n_splits=k, shuffle=True)
    train_folds = kf.split(data_frame)
    accuracies = []
    max_accuracy_naive_bayes_parameters = pd.DataFrame()
    for (train_training, train_testing), index in zip(train_folds, range(5)):
        print(f"---------------------------Fold {index + 1}---------------------------------")
        display(train.loc[train_training])
        trained_parameters = generate_naive_bayes_parameters(train.loc[train_training], smoothening)
        accuracy = predict_calculate_accuracy(train.loc[train_testing], trained_parameters)
        accuracies.append(accuracy)
        max_accuracy_naive_bayes_parameters = trained_parameters if max(
            accuracies) == accuracy else max_accuracy_naive_bayes_parameters
        display(trained_parameters)
    return max_accuracy_naive_bayes_parameters


vocabulary = five_fold_cross_validation(train, False)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment
2,& That movie was bad.,0
3,Generally; it just lacked imagination.,0
4,THERE IS NO PLOT OR STORYLINE!!,0
5,I do not know if this was Emilio Estevez's dir...,1
6,Perabo has a nice energy level and is obviousl...,1
...,...,...
591,"Totally different, with loads of understatemen...",1
592,Saw the movie today and thought it was a good ...,1
593,This is a stunning movie.,1
594,"Utterly without merit on any level, this is ak...",0


Accuracy:  70.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,movie,88,0.013225,45.0,0.51046,0.184426,43.0,0.48954,0.183761
1,bad,30,0.004509,1.0,0.51046,0.004098,29.0,0.48954,0.123932
2,&,7,0.001052,3.0,0.51046,0.012295,4.0,0.48954,0.017094
3,that,73,0.010971,33.0,0.51046,0.135246,40.0,0.48954,0.170940
4,was,90,0.013526,44.0,0.51046,0.180328,46.0,0.48954,0.196581
...,...,...,...,...,...,...,...,...,...
2035,merit,1,0.000150,,0.51046,,1.0,0.48954,0.004274
2036,microsoft,1,0.000150,,0.51046,,1.0,0.48954,0.004274
2037,space,1,0.000150,,0.51046,,1.0,0.48954,0.004274
2038,cg,1,0.000150,,0.51046,,1.0,0.48954,0.004274


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"In conclusion, I will not bother with this mov...",0
1,"Mark my words, this is one of those cult films...",1
2,& That movie was bad.,0
4,THERE IS NO PLOT OR STORYLINE!!,0
5,I do not know if this was Emilio Estevez's dir...,1
...,...,...
592,Saw the movie today and thought it was a good ...,1
593,This is a stunning movie.,1
595,"With great sound effects, and impressive spec...",1
596,"This is a bad film, with bad writing, and good...",0


Accuracy:  64.16666666666667


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,87,0.011887,48.0,0.51046,0.196721,39.0,0.48954,0.166667
1,movie,85,0.011614,42.0,0.51046,0.172131,43.0,0.48954,0.183761
2,because,11,0.001503,5.0,0.51046,0.020492,6.0,0.48954,0.025641
3,not,36,0.004919,10.0,0.51046,0.040984,26.0,0.48954,0.111111
4,angeles,1,0.000137,,0.51046,,1.0,0.48954,0.004274
...,...,...,...,...,...,...,...,...,...
2269,storytellinga,1,0.000137,,0.51046,,1.0,0.48954,0.004274
2270,microsoft,1,0.000137,,0.51046,,1.0,0.48954,0.004274
2271,cg,1,0.000137,,0.51046,,1.0,0.48954,0.004274
2272,slideshow,1,0.000137,,0.51046,,1.0,0.48954,0.004274


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"In conclusion, I will not bother with this mov...",0
1,"Mark my words, this is one of those cult films...",1
2,& That movie was bad.,0
3,Generally; it just lacked imagination.,0
4,THERE IS NO PLOT OR STORYLINE!!,0
...,...,...
592,Saw the movie today and thought it was a good ...,1
593,This is a stunning movie.,1
594,"Utterly without merit on any level, this is ak...",0
595,"With great sound effects, and impressive spec...",1


Accuracy:  64.16666666666667


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,90,0.012225,46.0,0.504184,0.190871,44.0,0.495816,0.185654
1,movie,82,0.011138,40.0,0.504184,0.165975,42.0,0.495816,0.177215
2,because,15,0.002037,7.0,0.504184,0.029046,8.0,0.495816,0.033755
3,not,34,0.004618,8.0,0.504184,0.033195,26.0,0.495816,0.109705
4,angeles,1,0.000136,,0.504184,,1.0,0.495816,0.004219
...,...,...,...,...,...,...,...,...,...
2315,haggis,1,0.000136,,0.504184,,1.0,0.495816,0.004219
2316,handle,1,0.000136,,0.504184,,1.0,0.495816,0.004219
2317,strokes,1,0.000136,,0.504184,,1.0,0.495816,0.004219
2318,ugly,1,0.000136,,0.504184,,1.0,0.495816,0.004219


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"In conclusion, I will not bother with this mov...",0
1,"Mark my words, this is one of those cult films...",1
2,& That movie was bad.,0
3,Generally; it just lacked imagination.,0
5,I do not know if this was Emilio Estevez's dir...,1
...,...,...
591,"Totally different, with loads of understatemen...",1
594,"Utterly without merit on any level, this is ak...",0
595,"With great sound effects, and impressive spec...",1
596,"This is a bad film, with bad writing, and good...",0


Accuracy:  61.34453781512605


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,94,0.012760,46.0,0.496868,0.193277,48.0,0.503132,0.199170
1,movie,93,0.012624,47.0,0.496868,0.197479,46.0,0.503132,0.190871
2,because,15,0.002036,6.0,0.496868,0.025210,9.0,0.503132,0.037344
3,not,37,0.005022,11.0,0.496868,0.046218,26.0,0.503132,0.107884
4,angeles,1,0.000136,,0.496868,,1.0,0.503132,0.004149
...,...,...,...,...,...,...,...,...,...
2301,opening,1,0.000136,,0.496868,,1.0,0.503132,0.004149
2302,microsoft,1,0.000136,,0.496868,,1.0,0.503132,0.004149
2303,cg,1,0.000136,,0.496868,,1.0,0.503132,0.004149
2304,slideshow,1,0.000136,,0.496868,,1.0,0.503132,0.004149


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"In conclusion, I will not bother with this mov...",0
1,"Mark my words, this is one of those cult films...",1
3,Generally; it just lacked imagination.,0
4,THERE IS NO PLOT OR STORYLINE!!,0
5,I do not know if this was Emilio Estevez's dir...,1
...,...,...
593,This is a stunning movie.,1
594,"Utterly without merit on any level, this is ak...",0
595,"With great sound effects, and impressive spec...",1
596,"This is a bad film, with bad writing, and good...",0


Accuracy:  65.54621848739495


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,95,0.013075,49.0,0.494781,0.206751,46.0,0.505219,0.190083
1,movie,88,0.012111,46.0,0.494781,0.194093,42.0,0.505219,0.173554
2,because,14,0.001927,7.0,0.494781,0.029536,7.0,0.505219,0.028926
3,not,35,0.004817,10.0,0.494781,0.042194,25.0,0.505219,0.103306
4,angeles,1,0.000138,,0.494781,,1.0,0.505219,0.004132
...,...,...,...,...,...,...,...,...,...
2205,ugly,1,0.000138,,0.494781,,1.0,0.505219,0.004132
2206,storytellinga,1,0.000138,,0.494781,,1.0,0.505219,0.004132
2207,microsoft,1,0.000138,,0.494781,,1.0,0.505219,0.004132
2208,cg,1,0.000138,,0.494781,,1.0,0.505219,0.004132


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,movie,88,0.013225,45.0,0.51046,0.184426,43.0,0.48954,0.183761
1,bad,30,0.004509,1.0,0.51046,0.004098,29.0,0.48954,0.123932
2,&,7,0.001052,3.0,0.51046,0.012295,4.0,0.48954,0.017094
3,that,73,0.010971,33.0,0.51046,0.135246,40.0,0.48954,0.170940
4,was,90,0.013526,44.0,0.51046,0.180328,46.0,0.48954,0.196581
...,...,...,...,...,...,...,...,...,...
2035,merit,1,0.000150,,0.51046,,1.0,0.48954,0.004274
2036,microsoft,1,0.000150,,0.51046,,1.0,0.48954,0.004274
2037,space,1,0.000150,,0.51046,,1.0,0.48954,0.004274
2038,cg,1,0.000150,,0.51046,,1.0,0.48954,0.004274


In [185]:
predict_calculate_accuracy(train, vocabulary)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(


Accuracy:  93.64548494983278


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[


93.64548494983278

In [186]:
predict_calculate_accuracy(dev, vocabulary)

Accuracy:  60.0


60.0

In [187]:
predict_calculate_accuracy(test, vocabulary)

Accuracy:  61.333333333333336


61.333333333333336

## Most Useful words before Smoothing

In [188]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Positive)", ascending=False)[:most_useful_limit]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
16,the,229,0.034415,127.0,0.51046,0.520492,102.0,0.48954,0.435897
32,a,169,0.025398,102.0,0.51046,0.418033,67.0,0.48954,0.286325
31,and,167,0.025098,102.0,0.51046,0.418033,65.0,0.48954,0.277778
29,of,140,0.02104,78.0,0.51046,0.319672,62.0,0.48954,0.264957
27,this,141,0.02119,72.0,0.51046,0.295082,69.0,0.48954,0.294872
8,it,135,0.020289,70.0,0.51046,0.286885,65.0,0.48954,0.277778
10,is,147,0.022092,68.0,0.51046,0.278689,79.0,0.48954,0.337607
46,i,116,0.017433,63.0,0.51046,0.258197,53.0,0.48954,0.226496
65,to,113,0.016982,56.0,0.51046,0.229508,57.0,0.48954,0.24359
86,film,80,0.012023,53.0,0.51046,0.217213,27.0,0.48954,0.115385


In [189]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Negative)", ascending=False)[:most_useful_limit]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
16,the,229,0.034415,127.0,0.51046,0.520492,102.0,0.48954,0.435897
10,is,147,0.022092,68.0,0.51046,0.278689,79.0,0.48954,0.337607
27,this,141,0.02119,72.0,0.51046,0.295082,69.0,0.48954,0.294872
32,a,169,0.025398,102.0,0.51046,0.418033,67.0,0.48954,0.286325
31,and,167,0.025098,102.0,0.51046,0.418033,65.0,0.48954,0.277778
8,it,135,0.020289,70.0,0.51046,0.286885,65.0,0.48954,0.277778
29,of,140,0.02104,78.0,0.51046,0.319672,62.0,0.48954,0.264957
65,to,113,0.016982,56.0,0.51046,0.229508,57.0,0.48954,0.24359
46,i,116,0.017433,63.0,0.51046,0.258197,53.0,0.48954,0.226496
4,was,90,0.013526,44.0,0.51046,0.180328,46.0,0.48954,0.196581


## Smoothening

In [190]:
vocabulary = five_fold_cross_validation(train, True)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"In conclusion, I will not bother with this mov...",0,0.000000e+00,0.000000e+00,False
2,& That movie was bad.,0,1.156946e-07,6.404024e-06,False
3,Generally; it just lacked imagination.,0,0.000000e+00,2.358443e-09,False
4,THERE IS NO PLOT OR STORYLINE!!,0,0.000000e+00,2.443803e-08,False
5,I do not know if this was Emilio Estevez's dir...,1,1.130731e-56,0.000000e+00,True
...,...,...,...,...,...
592,Saw the movie today and thought it was a good ...,1,2.819991e-19,0.000000e+00,True
594,"Utterly without merit on any level, this is ak...",0,0.000000e+00,1.795988e-18,False
595,"With great sound effects, and impressive spec...",1,0.000000e+00,0.000000e+00,False
596,"This is a bad film, with bad writing, and good...",0,0.000000e+00,0.000000e+00,False


Accuracy:  70.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,94,0.009920,55.0,0.514583,0.222672,40.0,0.489583,0.170213
1,movie,92,0.009709,49.0,0.514583,0.198381,44.0,0.489583,0.187234
2,because,14,0.001477,7.0,0.514583,0.028340,8.0,0.489583,0.034043
3,not,35,0.003694,11.0,0.514583,0.044534,25.0,0.489583,0.106383
4,angeles,2,0.000211,1.0,0.514583,0.004049,2.0,0.489583,0.008511
...,...,...,...,...,...,...,...,...,...
2244,storytellinga,2,0.000211,1.0,0.514583,0.004049,2.0,0.489583,0.008511
2245,microsoft,2,0.000211,1.0,0.514583,0.004049,2.0,0.489583,0.008511
2246,cg,2,0.000211,1.0,0.514583,0.004049,2.0,0.489583,0.008511
2247,slideshow,2,0.000211,1.0,0.514583,0.004049,2.0,0.489583,0.008511


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
1,"Mark my words, this is one of those cult films...",1,0.000000e+00,0.000000e+00,False
3,Generally; it just lacked imagination.,0,0.000000e+00,2.358443e-09,False
6,Perabo has a nice energy level and is obviousl...,1,4.678746e-24,0.000000e+00,True
7,Every element of this story was so over the to...,0,0.000000e+00,0.000000e+00,False
8,This is such a fun and funny movie.,1,1.490107e-08,0.000000e+00,True
...,...,...,...,...,...
587,"It's an empty, hollow shell of a movie.",0,1.068904e-07,1.311096e-08,True
588,The film's sole bright spot was Jonah Hill (wh...,1,1.194431e-46,0.000000e+00,True
590,The only thing really worth watching was the s...,1,4.707281e-20,0.000000e+00,True
593,This is a stunning movie.,1,1.326377e-05,0.000000e+00,True


Accuracy:  63.333333333333336


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,discovering,2,0.000220,2.0,0.502083,0.008299,1.0,0.502083,0.004149
1,years,8,0.000881,7.0,0.502083,0.029046,2.0,0.502083,0.008299
2,cult,5,0.000551,3.0,0.502083,0.012448,3.0,0.502083,0.012448
3,like,21,0.002313,10.0,0.502083,0.041494,12.0,0.502083,0.049793
4,and,163,0.017952,99.0,0.502083,0.410788,65.0,0.502083,0.269710
...,...,...,...,...,...,...,...,...,...
2138,weight,2,0.000220,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2139,stunning,2,0.000220,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2140,torture,2,0.000220,1.0,0.502083,0.004149,2.0,0.502083,0.008299
2141,akin,2,0.000220,1.0,0.502083,0.004149,2.0,0.502083,0.008299


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"In conclusion, I will not bother with this mov...",0,0.000000e+00,0.000000e+00,False
1,"Mark my words, this is one of those cult films...",1,0.000000e+00,0.000000e+00,False
2,& That movie was bad.,0,1.156946e-07,6.404024e-06,False
3,Generally; it just lacked imagination.,0,0.000000e+00,2.358443e-09,False
4,THERE IS NO PLOT OR STORYLINE!!,0,0.000000e+00,2.443803e-08,False
...,...,...,...,...,...
592,Saw the movie today and thought it was a good ...,1,2.819991e-19,0.000000e+00,True
593,This is a stunning movie.,1,1.326377e-05,0.000000e+00,True
595,"With great sound effects, and impressive spec...",1,0.000000e+00,0.000000e+00,False
596,"This is a bad film, with bad writing, and good...",0,0.000000e+00,0.000000e+00,False


Accuracy:  67.5


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,96,0.010022,48.0,0.4875,0.205128,49.0,0.516667,0.197581
1,movie,90,0.009396,46.0,0.4875,0.196581,45.0,0.516667,0.181452
2,because,18,0.001879,9.0,0.4875,0.038462,10.0,0.516667,0.040323
3,not,37,0.003863,10.0,0.4875,0.042735,28.0,0.516667,0.112903
4,angeles,2,0.000209,1.0,0.4875,0.004274,2.0,0.516667,0.008065
...,...,...,...,...,...,...,...,...,...
2242,ugly,2,0.000209,1.0,0.4875,0.004274,2.0,0.516667,0.008065
2243,storytellinga,2,0.000209,1.0,0.4875,0.004274,2.0,0.516667,0.008065
2244,microsoft,2,0.000209,1.0,0.4875,0.004274,2.0,0.516667,0.008065
2245,cg,2,0.000209,1.0,0.4875,0.004274,2.0,0.516667,0.008065


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"In conclusion, I will not bother with this mov...",0,0.000000e+00,0.000000e+00,False
1,"Mark my words, this is one of those cult films...",1,0.000000e+00,0.000000e+00,False
2,& That movie was bad.,0,1.156946e-07,6.404024e-06,False
4,THERE IS NO PLOT OR STORYLINE!!,0,0.000000e+00,2.443803e-08,False
5,I do not know if this was Emilio Estevez's dir...,1,1.130731e-56,0.000000e+00,True
...,...,...,...,...,...
593,This is a stunning movie.,1,1.326377e-05,0.000000e+00,True
594,"Utterly without merit on any level, this is ak...",0,0.000000e+00,1.795988e-18,False
595,"With great sound effects, and impressive spec...",1,0.000000e+00,0.000000e+00,False
596,"This is a bad film, with bad writing, and good...",0,0.000000e+00,0.000000e+00,False


Accuracy:  75.63025210084034


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,100,0.010732,55.0,0.523909,0.218254,46.0,0.480249,0.199134
1,movie,89,0.009551,45.0,0.523909,0.178571,45.0,0.480249,0.194805
2,because,12,0.001288,7.0,0.523909,0.027778,6.0,0.480249,0.025974
3,not,35,0.003756,9.0,0.523909,0.035714,27.0,0.480249,0.116883
4,angeles,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
...,...,...,...,...,...,...,...,...,...
2226,ugly,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
2227,storytellinga,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
2228,microsoft,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
2229,cg,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"In conclusion, I will not bother with this mov...",0,0.000000e+00,0.000000e+00,False
1,"Mark my words, this is one of those cult films...",1,0.000000e+00,0.000000e+00,False
2,& That movie was bad.,0,1.156946e-07,6.404024e-06,False
3,Generally; it just lacked imagination.,0,0.000000e+00,2.358443e-09,False
4,THERE IS NO PLOT OR STORYLINE!!,0,0.000000e+00,2.443803e-08,False
...,...,...,...,...,...
593,This is a stunning movie.,1,1.326377e-05,0.000000e+00,True
594,"Utterly without merit on any level, this is ak...",0,0.000000e+00,1.795988e-18,False
595,"With great sound effects, and impressive spec...",1,0.000000e+00,0.000000e+00,False
596,"This is a bad film, with bad writing, and good...",0,0.000000e+00,0.000000e+00,False


Accuracy:  68.90756302521008


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,86,0.008879,41.0,0.49896,0.170833,46.0,0.505198,0.189300
1,movie,83,0.008569,43.0,0.49896,0.179167,41.0,0.505198,0.168724
2,because,16,0.001652,8.0,0.49896,0.033333,9.0,0.505198,0.037037
3,not,41,0.004233,13.0,0.49896,0.054167,29.0,0.505198,0.119342
4,angeles,2,0.000206,1.0,0.49896,0.004167,2.0,0.505198,0.008230
...,...,...,...,...,...,...,...,...,...
2286,microsoft,2,0.000206,1.0,0.49896,0.004167,2.0,0.505198,0.008230
2287,sake,2,0.000206,1.0,0.49896,0.004167,2.0,0.505198,0.008230
2288,cg,2,0.000206,1.0,0.49896,0.004167,2.0,0.505198,0.008230
2289,slideshow,2,0.000206,1.0,0.49896,0.004167,2.0,0.505198,0.008230


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,in,100,0.010732,55.0,0.523909,0.218254,46.0,0.480249,0.199134
1,movie,89,0.009551,45.0,0.523909,0.178571,45.0,0.480249,0.194805
2,because,12,0.001288,7.0,0.523909,0.027778,6.0,0.480249,0.025974
3,not,35,0.003756,9.0,0.523909,0.035714,27.0,0.480249,0.116883
4,angeles,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
...,...,...,...,...,...,...,...,...,...
2226,ugly,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
2227,storytellinga,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
2228,microsoft,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658
2229,cg,2,0.000215,1.0,0.523909,0.003968,2.0,0.480249,0.008658


In [191]:
predict_calculate_accuracy(train, vocabulary)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Positive | Sentence)"] = data_frame["IMDB Review"].apply(


Accuracy:  93.31103678929766


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["P(Sentiment = Negative | Sentence)"] = data_frame["IMDB Review"].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame["Predicted sentiment"] = data_frame["P(Sentiment = Positive | Sentence)"] > data_frame[


93.31103678929766

In [192]:
predict_calculate_accuracy(dev, vocabulary)

Accuracy:  68.0


68.0

In [193]:
predict_calculate_accuracy(test, vocabulary)

Accuracy:  78.66666666666667


78.66666666666667

## Most Useful words after Smoothing

In [194]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Positive)", ascending=False)[:most_useful_limit]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
45,the,239,0.025649,136.0,0.523909,0.539683,104.0,0.480249,0.450216
22,and,172,0.018459,110.0,0.523909,0.436508,63.0,0.480249,0.272727
6,a,174,0.018674,104.0,0.523909,0.412698,71.0,0.480249,0.307359
28,of,137,0.014703,83.0,0.523909,0.329365,55.0,0.480249,0.238095
5,is,155,0.016634,77.0,0.523909,0.305556,79.0,0.480249,0.341991
11,this,139,0.014917,72.0,0.523909,0.285714,68.0,0.480249,0.294372
88,it,135,0.014488,71.0,0.523909,0.281746,65.0,0.480249,0.281385
14,i,114,0.012234,65.0,0.523909,0.257937,50.0,0.480249,0.21645
97,to,109,0.011698,59.0,0.523909,0.234127,51.0,0.480249,0.220779
0,in,100,0.010732,55.0,0.523909,0.218254,46.0,0.480249,0.199134


In [195]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values("P(Word | Sentiment = Negative)", ascending=False)[:most_useful_limit]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
45,the,239,0.025649,136.0,0.523909,0.539683,104.0,0.480249,0.450216
5,is,155,0.016634,77.0,0.523909,0.305556,79.0,0.480249,0.341991
6,a,174,0.018674,104.0,0.523909,0.412698,71.0,0.480249,0.307359
11,this,139,0.014917,72.0,0.523909,0.285714,68.0,0.480249,0.294372
88,it,135,0.014488,71.0,0.523909,0.281746,65.0,0.480249,0.281385
22,and,172,0.018459,110.0,0.523909,0.436508,63.0,0.480249,0.272727
28,of,137,0.014703,83.0,0.523909,0.329365,55.0,0.480249,0.238095
97,to,109,0.011698,59.0,0.523909,0.234127,51.0,0.480249,0.220779
14,i,114,0.012234,65.0,0.523909,0.257937,50.0,0.480249,0.21645
0,in,100,0.010732,55.0,0.523909,0.218254,46.0,0.480249,0.199134
