<a href="https://colab.research.google.com/github/VighneshS/sentiment_prediction/blob/master/sentiment_prediction.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://vighnesh-studies.blogspot.com/2021/04/sentiment-prediction-using-naive-bayes.html" target="_blank">BLOG</a>

# Sentiment Prediction using Naive Bayes Classifier (NBC)
This is a notebook to understand how Naive Bayes Classifier (NBC) works and also how it is useful to classify text based on sentiment.

We will also see how it will be effective against missing data.

## Settings
Training Percentage

In [None]:
TRAINING_RATIO = 80 / 100
K_FOLDS = 5
MOST_USEFUL_LIMIT = 20

MOST_COMMON_WORDS_IN_DATA_SET = ["movie", "film", "one"]

## CONSTANTS

In [None]:
REVIEW_COL = "IMDB Review"
WORD_FREQ_COL = "Word Frequency"
SENTIMENT_COL = "Sentiment"
POS_SENTIMENT_WORD_FREQ_COL = "Positive Sentiment Word Frequency"
NEG_SENTIMENT_WORD_FREQ_COL = "Negative Sentiment Word Frequency"
P_SENTIMENT_POSITIVE_COL = "P(Sentiment = Positive)"
P_SENTIMENT_NEGATIVE_COL = "P(Sentiment = Negative)"
P_WORD_COL = "P(Word)"
WORD_COL = "Word"
P_WORD_GIVEN_SENTIMENT_POSITIVE_COL = "P(Word | Sentiment = Positive)"
P_WORD_GIVEN_SENTIMENT_NEGATIVE_COL = "P(Word | Sentiment = Negative)"
P_SENTIMENT_POSITIVE_GIVEN_SENTENCE_COL = "P(Sentiment = Positive | Sentence)"
P_SENTIMENT_NEGATIVE_GIVEN_SENTENCE_COL = "P(Sentiment = Negative | Sentence)"
PREDICTED_SENTIMENT_COL = "Predicted sentiment"

## Importing the Data
We used the [kaggle dataset](https://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def) to get the ground truth of sample IMDB reviews.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
from IPython.display import display
import math
from sklearn.model_selection import KFold
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stop_words = stopwords.words('english')

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

data = pd.read_csv(
    r"http://storage.googleapis.com/kagglesdsdata/datasets/22169/30047/sentiment%20labelled%20sentences/imdb_labelled.txt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210425%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210425T202010Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=6133706ef10bc2dcd0b58f8398b4d73ab9e9d788de1718b07334df91f6007e1e4ca0b78e3176f95b8250e0c4535ce1633528f4fabffeb7e4124af3ee3f895ac34c03044fca9b23b23c4ddb8fa90d84dfc14869ff4806f03783cafad53b19445b3c3052983fdf1ca4384257eac1bc0a4270d238a1ea89d1289866c7a0ea7ad7c97a76f2e142c148019e39cc5a1295f92650747ac5ea5946b026f7ad6d5d262d4c4a370aee6bc1f5d5b445bb6d93692debe678a79e5e1c1fe3d3e68ea4f2fad3115795d3361e0626e98156fbc7f5967beb7cf0f00e07351d23a00d8677ebb75e3e13b1bfa07762266efabf6f6f9d53206be31b7623cf3614f60f8cf5011cf23def",
    delimiter="\t", header=None, names=[REVIEW_COL, SENTIMENT_COL])
data = data.sample(frac=1).reset_index(drop=True)

### Split Data
We split the data into train, development and test

In [None]:
train = data[:math.floor(data.shape[0] * TRAINING_RATIO)]

In [None]:
validation = data[math.floor(data.shape[0] * TRAINING_RATIO):].sample(frac=1).reset_index(drop=True)
dev, test = np.array_split(validation, 2)

In [None]:
display(train, dev, test)

Unnamed: 0,IMDB Review,Sentiment
0,"Technically, the film is well made with impres...",1
1,This movie is so awesome!,1
2,The scenes are often funny and occasionally to...,1
3,I do not know if this was Emilio Estevez's dir...,1
4,There was a few pathetic attempts to give the ...,0
...,...,...
593,"Not frightening in the least, and barely compr...",0
594,The only place good for this film is in the ga...,0
595,"This convention never worked well in the past,...",0
596,Hayao Miyazaki's latest and eighth film for St...,1


Unnamed: 0,IMDB Review,Sentiment
0,There still are good actors around!,1
1,Damian is so talented and versatile in so many...,1
2,The movie showed a lot of Florida at it's best...,1
3,I never walked out of a movie faster.,0
4,"People who like European films and ""art movies...",1
...,...,...
70,"It felt like a very gripping, intelligent stag...",1
71,"It's a sad movie, but very good.",1
72,But other than that the movie seemed to drag a...,0
73,Lewis Black's considerable talent is wasted he...,0


Unnamed: 0,IMDB Review,Sentiment
75,I didn't realize how wonderful the short reall...,1
76,The movie in movie situations in the beginning...,1
77,"Aside from it's terrible lead, this film has l...",0
78,"The movie was so boring, that I sometimes foun...",0
79,Generally; it just lacked imagination.,0
...,...,...
145,Everything from acting to cinematography was s...,1
146,I rather enjoyed it.,1
147,Meredith M was better than all right.,1
148,You can find better movies at youtube.,0


## Generation of Vocabulary list

In [None]:
def split_words(review: str, remove_stop_words: bool):
    reviews_array = review.lower().replace(',', '').replace('"', '').replace('(', '').replace(')', '').replace('\'s',
                                                                                                               '').replace(
        '.',
        '').replace(
        '!', '').replace('/', ' ').split()
    return [w for w in reviews_array if not w in stop_words] if remove_stop_words else reviews_array


def get_word_count(review_data_frame: pd.DataFrame, column_name: str, remove_stop_words: bool):
    vocab = review_data_frame[REVIEW_COL].apply(lambda review: pd.value_counts(
        split_words(review, remove_stop_words))).count(axis=0).to_frame()
    vocab.columns = [column_name]
    vocab.reset_index(inplace=True)
    vocab = vocab.rename(columns={'index': WORD_COL})
    return vocab


## Get Naive Bayes Parameters
Here we have a function to genereate the Naive Bayes Parameters like:

1. Word Frequency
2. P(Word)
3. Positive Sentiment Word Frequency
4. P(Sentiment = Positive)
5. P(Word | Sentiment = Positive)
6. Negative Sentiment Word Frequency
7. P(Sentiment = Negative)
8. P(Word | Sentiment = Negative)

Which are useful in finding:

**P(Sentiment | Sentence (Collection of words)) = P(Sentence | Sentiment) * P(Sentiment) / P(Sentense)**

The P(Sentense) can be approximated to 1 as we are comparing sentiments the value will be cancelled on either sides


In [None]:
def generate_naive_bayes_parameters(data_frame: pd.DataFrame, smoothening: bool, remove_stop_words: bool):
    naive_bayes_parameters = get_word_count(data_frame, WORD_FREQ_COL, remove_stop_words)
    if smoothening:
        naive_bayes_parameters[WORD_FREQ_COL] += 1

    total_words = naive_bayes_parameters[WORD_FREQ_COL].sum(axis=0)
    if smoothening:
        total_words += 2

    total_sentiments = data_frame.count(axis=0)[SENTIMENT_COL]
    if smoothening:
        total_sentiments += 2

    naive_bayes_parameters[P_WORD_COL] = naive_bayes_parameters[WORD_FREQ_COL].div(total_words)

    positive_sentiments = data_frame[data_frame[SENTIMENT_COL] == 1]
    positive_vocabulary = get_word_count(positive_sentiments, POS_SENTIMENT_WORD_FREQ_COL, remove_stop_words)
    naive_bayes_parameters = naive_bayes_parameters.merge(positive_vocabulary, how='left', on=WORD_COL)
    if smoothening:
        naive_bayes_parameters[POS_SENTIMENT_WORD_FREQ_COL] += 1
        naive_bayes_parameters[POS_SENTIMENT_WORD_FREQ_COL] = naive_bayes_parameters[
            POS_SENTIMENT_WORD_FREQ_COL].fillna(
            value=1)

    total_positive_words = positive_sentiments.count(axis=0)[SENTIMENT_COL]
    if smoothening:
        total_positive_words += 2

    probability_of_positive_sentiments = total_positive_words / total_sentiments
    naive_bayes_parameters[P_SENTIMENT_POSITIVE_COL] = probability_of_positive_sentiments

    naive_bayes_parameters[P_WORD_GIVEN_SENTIMENT_POSITIVE_COL] = naive_bayes_parameters[
        POS_SENTIMENT_WORD_FREQ_COL].div(
        total_positive_words)

    negative_sentiments = data_frame[data_frame[SENTIMENT_COL] == 0]
    negative_vocabulary = get_word_count(negative_sentiments, NEG_SENTIMENT_WORD_FREQ_COL, remove_stop_words)
    naive_bayes_parameters = naive_bayes_parameters.merge(negative_vocabulary, how='left', on=WORD_COL)
    if smoothening:
        naive_bayes_parameters[NEG_SENTIMENT_WORD_FREQ_COL] += 1
        naive_bayes_parameters[NEG_SENTIMENT_WORD_FREQ_COL] = naive_bayes_parameters[
            NEG_SENTIMENT_WORD_FREQ_COL].fillna(
            value=1)

    total_negative_words = negative_sentiments.count(axis=0)[SENTIMENT_COL]
    if smoothening:
        total_negative_words += 2

    probability_of_negative_sentiments = total_negative_words / total_sentiments
    naive_bayes_parameters[P_SENTIMENT_NEGATIVE_COL] = probability_of_negative_sentiments

    naive_bayes_parameters[P_WORD_GIVEN_SENTIMENT_NEGATIVE_COL] = naive_bayes_parameters[
        NEG_SENTIMENT_WORD_FREQ_COL].div(
        total_negative_words)

    return naive_bayes_parameters


## To Get the Probabilities

We use this formula to get the probabilities:

**P(Sentiment | Sentence (Collection of words)) = P(Sentence | Sentiment) * P(Sentiment) / P(Sentense)**

The below function will calculate the numerator part and assumes the denominator to be 1 as it will cancel out during
comparison.

For calculating the P(Sentence | Sentiment) we have words in sentences. So, we can write the formula as:

**P(Sentence | Sentiment) = P(Word_1,Word_2,...,Word_n | Sentiment)**

By Naive Bayes Theorem we can write it as:

**P(Word_1,Word_2,...,Word_n | Sentiment) = P(Word_1 | Sentiment).P(Word_2 | Sentiment). ... .P(Word_n | Sentiment)**


In [None]:
def get_probabilities(review: str, naive_bayes_parameters: pd.DataFrame, sentiment: bool, smoothening: bool,
                      remove_stop_words: bool):
    prob = 1
    column_name = P_WORD_GIVEN_SENTIMENT_POSITIVE_COL if sentiment else P_WORD_GIVEN_SENTIMENT_NEGATIVE_COL
    individual_prob = 0 if not smoothening else 1 / (
        naive_bayes_parameters[P_SENTIMENT_POSITIVE_COL][0] if sentiment else naive_bayes_parameters[
            P_SENTIMENT_NEGATIVE_COL][0])
    for word in split_words(review, remove_stop_words):
        if word in naive_bayes_parameters.values:
            individual_prob = naive_bayes_parameters[naive_bayes_parameters[WORD_COL] == word].iloc[0][column_name]
        prob *= 0 if math.isnan(individual_prob) else individual_prob
    return prob * (naive_bayes_parameters[P_SENTIMENT_POSITIVE_COL][0] if sentiment else naive_bayes_parameters[
        P_SENTIMENT_NEGATIVE_COL][0])

In [None]:
def predict_calculate_accuracy(data_frame: pd.DataFrame, naive_bayes_parameters: pd.DataFrame, smoothening: bool,
                               remove_stop_words: bool):
    data_frame[P_SENTIMENT_POSITIVE_GIVEN_SENTENCE_COL] = data_frame[REVIEW_COL].apply(
        lambda review: get_probabilities(review, naive_bayes_parameters, True, smoothening, remove_stop_words))
    data_frame[P_SENTIMENT_NEGATIVE_GIVEN_SENTENCE_COL] = data_frame[REVIEW_COL].apply(
        lambda review: get_probabilities(review, naive_bayes_parameters, False, smoothening, remove_stop_words))
    data_frame[PREDICTED_SENTIMENT_COL] = data_frame[P_SENTIMENT_POSITIVE_GIVEN_SENTENCE_COL] > data_frame[
        P_SENTIMENT_NEGATIVE_GIVEN_SENTENCE_COL]
    accuracy = data_frame.loc[data_frame[PREDICTED_SENTIMENT_COL] == data_frame[SENTIMENT_COL]].count(axis=0)[
                   SENTIMENT_COL] * 100 / data_frame.count(axis=0)[SENTIMENT_COL]
    print("Accuracy: ", accuracy)
    # print("Wrong Predictions:")
    # display(data_frame.loc[data_frame[PREDICTED_SENTIMENT_COL] != data_frame[SENTIMENT_COL]].reset_index(drop=True))
    return accuracy


## Calculating Accuracy

To calculate accuracy we first divide the training dataset into k parts of train and test the first part of the
set is used to train the dataset with the remaining k-1 test dataset.

We then predict using the Naive bayes parameters that we get from training against the test data.

We then calculate the accuracy by finding (how many data is of correct prediction)/(total number of datasets)

With the parameters having the best accuracy is chosen from this and used for further validation of dev anf test
datasets which we separated in the beginning.


In [None]:
def five_fold_cross_validation(data_frame: pd.DataFrame, smoothening: bool, remove_stop_words: bool):
    kf = KFold(n_splits=K_FOLDS, shuffle=True)
    train_folds = kf.split(data_frame)
    accuracies = []
    max_accuracy_naive_bayes_parameters = pd.DataFrame()
    for (train_training, train_testing), index in zip(train_folds, range(5)):
        print(f"---------------------------Fold {index + 1}---------------------------------")
        display(train.loc[train_training])
        trained_parameters = generate_naive_bayes_parameters(train.loc[train_training], smoothening, remove_stop_words)
        accuracy = predict_calculate_accuracy(train.loc[train_testing], trained_parameters, smoothening,
                                              remove_stop_words)
        accuracies.append(accuracy)
        max_accuracy_naive_bayes_parameters = trained_parameters if max(
            accuracies) == accuracy else max_accuracy_naive_bayes_parameters
        display(trained_parameters)
    return max_accuracy_naive_bayes_parameters


vocabulary = five_fold_cross_validation(train, False, False)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment
1,This movie is so awesome!,1
2,The scenes are often funny and occasionally to...,1
3,I do not know if this was Emilio Estevez's dir...,1
4,There was a few pathetic attempts to give the ...,0
5,"In fact, this stinker smells like a direct-to-...",0
...,...,...
590,"The acting is fantastic, the stories are seaml...",1
591,20th Century Fox's ROAD HOUSE 1948) is not onl...,0
592,"Just consider the excellent story, solid actin...",1
593,"Not frightening in the least, and barely compr...",0


Accuracy:  60.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,147,0.018507,74.0,0.520921,0.297189,73.0,0.479079,0.318777
1,this,140,0.017626,73.0,0.520921,0.293173,67.0,0.479079,0.292576
2,movie,83,0.010449,45.0,0.520921,0.180723,38.0,0.479079,0.165939
3,so,33,0.004155,14.0,0.520921,0.056225,19.0,0.479079,0.082969
4,awesome,3,0.000378,2.0,0.520921,0.008032,1.0,0.479079,0.004367
...,...,...,...,...,...,...,...,...,...
2560,bonuses,1,0.000126,1.0,0.520921,0.004016,,0.479079,
2561,comprehensible,1,0.000126,,0.520921,,1.0,0.479079,0.004367
2562,past,1,0.000126,,0.520921,,1.0,0.479079,0.004367
2563,convention,1,0.000126,,0.520921,,1.0,0.479079,0.004367


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"Technically, the film is well made with impres...",1
1,This movie is so awesome!,1
2,The scenes are often funny and occasionally to...,1
3,I do not know if this was Emilio Estevez's dir...,1
4,There was a few pathetic attempts to give the ...,0
...,...,...
593,"Not frightening in the least, and barely compr...",0
594,The only place good for this film is in the ga...,0
595,"This convention never worked well in the past,...",0
596,Hayao Miyazaki's latest and eighth film for St...,1


Accuracy:  63.333333333333336


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,158,0.022092,79.0,0.518828,0.318548,79.0,0.481172,0.343478
1,a,167,0.023350,101.0,0.518828,0.407258,66.0,0.481172,0.286957
2,from,20,0.002796,12.0,0.518828,0.048387,8.0,0.481172,0.034783
3,coming,2,0.000280,2.0,0.518828,0.008065,,0.481172,
4,vocal,1,0.000140,1.0,0.518828,0.004032,,0.481172,
...,...,...,...,...,...,...,...,...,...
2265,overly,1,0.000140,1.0,0.518828,0.004032,,0.481172,
2266,film-maker,1,0.000140,1.0,0.518828,0.004032,,0.481172,
2267,fall,1,0.000140,1.0,0.518828,0.004032,,0.481172,
2268,worthy,1,0.000140,1.0,0.518828,0.004032,,0.481172,


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"Technically, the film is well made with impres...",1
2,The scenes are often funny and occasionally to...,1
3,I do not know if this was Emilio Estevez's dir...,1
4,There was a few pathetic attempts to give the ...,0
5,"In fact, this stinker smells like a direct-to-...",0
...,...,...
592,"Just consider the excellent story, solid actin...",1
594,The only place good for this film is in the ga...,0
595,"This convention never worked well in the past,...",0
596,Hayao Miyazaki's latest and eighth film for St...,1


Accuracy:  59.166666666666664


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,155,0.019103,73.0,0.491632,0.310638,82.0,0.508368,0.337449
1,a,176,0.021691,101.0,0.491632,0.429787,75.0,0.508368,0.308642
2,from,24,0.002958,13.0,0.491632,0.055319,11.0,0.508368,0.045267
3,coming,2,0.000246,2.0,0.491632,0.008511,,0.508368,
4,vocal,1,0.000123,1.0,0.491632,0.004255,,0.508368,
...,...,...,...,...,...,...,...,...,...
2563,overly,1,0.000123,1.0,0.491632,0.004255,,0.508368,
2564,film-maker,1,0.000123,1.0,0.491632,0.004255,,0.508368,
2565,worthy,1,0.000123,1.0,0.491632,0.004255,,0.508368,
2566,trap,1,0.000123,1.0,0.491632,0.004255,,0.508368,


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"Technically, the film is well made with impres...",1
1,This movie is so awesome!,1
2,The scenes are often funny and occasionally to...,1
5,"In fact, this stinker smells like a direct-to-...",0
6,Lame would be the best way to describe it.,0
...,...,...
592,"Just consider the excellent story, solid actin...",1
593,"Not frightening in the least, and barely compr...",0
594,The only place good for this film is in the ga...,0
596,Hayao Miyazaki's latest and eighth film for St...,1


Accuracy:  57.142857142857146


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,157,0.019778,79.0,0.503132,0.327801,78.0,0.496868,0.327731
1,a,168,0.021164,97.0,0.503132,0.402490,71.0,0.496868,0.298319
2,from,22,0.002771,12.0,0.503132,0.049793,10.0,0.496868,0.042017
3,coming,2,0.000252,2.0,0.503132,0.008299,,0.496868,
4,vocal,1,0.000126,1.0,0.503132,0.004149,,0.496868,
...,...,...,...,...,...,...,...,...,...
2498,overly,1,0.000126,1.0,0.503132,0.004149,,0.496868,
2499,film-maker,1,0.000126,1.0,0.503132,0.004149,,0.496868,
2500,worthy,1,0.000126,1.0,0.503132,0.004149,,0.496868,
2501,trap,1,0.000126,1.0,0.503132,0.004149,,0.496868,


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment
0,"Technically, the film is well made with impres...",1
1,This movie is so awesome!,1
3,I do not know if this was Emilio Estevez's dir...,1
4,There was a few pathetic attempts to give the ...,0
5,"In fact, this stinker smells like a direct-to-...",0
...,...,...
593,"Not frightening in the least, and barely compr...",0
594,The only place good for this film is in the ga...,0
595,"This convention never worked well in the past,...",0
596,Hayao Miyazaki's latest and eighth film for St...,1


Accuracy:  65.54621848739495


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,143,0.019662,67.0,0.515658,0.271255,76.0,0.484342,0.327586
1,a,169,0.023237,103.0,0.515658,0.417004,66.0,0.484342,0.284483
2,from,24,0.003300,14.0,0.515658,0.056680,10.0,0.484342,0.043103
3,coming,1,0.000137,1.0,0.515658,0.004049,,0.484342,
4,vocal,1,0.000137,1.0,0.515658,0.004049,,0.484342,
...,...,...,...,...,...,...,...,...,...
2305,rare,1,0.000137,1.0,0.515658,0.004049,,0.484342,
2306,overly,1,0.000137,1.0,0.515658,0.004049,,0.484342,
2307,film-maker,1,0.000137,1.0,0.515658,0.004049,,0.484342,
2308,worthy,1,0.000137,1.0,0.515658,0.004049,,0.484342,


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,143,0.019662,67.0,0.515658,0.271255,76.0,0.484342,0.327586
1,a,169,0.023237,103.0,0.515658,0.417004,66.0,0.484342,0.284483
2,from,24,0.003300,14.0,0.515658,0.056680,10.0,0.484342,0.043103
3,coming,1,0.000137,1.0,0.515658,0.004049,,0.484342,
4,vocal,1,0.000137,1.0,0.515658,0.004049,,0.484342,
...,...,...,...,...,...,...,...,...,...
2305,rare,1,0.000137,1.0,0.515658,0.004049,,0.484342,
2306,overly,1,0.000137,1.0,0.515658,0.004049,,0.484342,
2307,film-maker,1,0.000137,1.0,0.515658,0.004049,,0.484342,
2308,worthy,1,0.000137,1.0,0.515658,0.004049,,0.484342,


In [None]:
predict_calculate_accuracy(train, vocabulary, False, False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Accuracy:  92.3076923076923


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


92.3076923076923

In [None]:
predict_calculate_accuracy(dev, vocabulary, False, False)

Accuracy:  60.0


60.0

In [None]:
predict_calculate_accuracy(test, vocabulary, False, False)

Accuracy:  62.666666666666664


62.666666666666664

## Most Useful words before Smoothing

In [None]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values(P_WORD_GIVEN_SENTIMENT_POSITIVE_COL, ascending=False)[:MOST_USEFUL_LIMIT]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
19,the,230,0.031624,128.0,0.515658,0.518219,102.0,0.484342,0.439655
29,and,180,0.024749,114.0,0.515658,0.461538,66.0,0.484342,0.284483
1,a,169,0.023237,103.0,0.515658,0.417004,66.0,0.484342,0.284483
39,of,152,0.020899,87.0,0.515658,0.352227,65.0,0.484342,0.280172
31,it,131,0.018012,73.0,0.515658,0.295547,58.0,0.484342,0.25
0,is,143,0.019662,67.0,0.515658,0.271255,76.0,0.484342,0.327586
32,this,136,0.018699,67.0,0.515658,0.271255,69.0,0.484342,0.297414
44,i,112,0.015399,61.0,0.515658,0.246964,51.0,0.484342,0.219828
74,to,111,0.015262,59.0,0.515658,0.238866,52.0,0.484342,0.224138
80,in,96,0.0132,54.0,0.515658,0.218623,42.0,0.484342,0.181034


In [None]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values(P_WORD_GIVEN_SENTIMENT_NEGATIVE_COL, ascending=False)[:MOST_USEFUL_LIMIT]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
19,the,230,0.031624,128.0,0.515658,0.518219,102.0,0.484342,0.439655
0,is,143,0.019662,67.0,0.515658,0.271255,76.0,0.484342,0.327586
32,this,136,0.018699,67.0,0.515658,0.271255,69.0,0.484342,0.297414
1,a,169,0.023237,103.0,0.515658,0.417004,66.0,0.484342,0.284483
29,and,180,0.024749,114.0,0.515658,0.461538,66.0,0.484342,0.284483
39,of,152,0.020899,87.0,0.515658,0.352227,65.0,0.484342,0.280172
31,it,131,0.018012,73.0,0.515658,0.295547,58.0,0.484342,0.25
74,to,111,0.015262,59.0,0.515658,0.238866,52.0,0.484342,0.224138
44,i,112,0.015399,61.0,0.515658,0.246964,51.0,0.484342,0.219828
59,was,86,0.011825,43.0,0.515658,0.174089,43.0,0.484342,0.185345


## Smoothening

Smoothening is done to compensate for unknown words. As all words can't be added to a dictionary and Naive Bayes is
specialized to handle missing words.

Smoothening is done by using the +1 method it is done in the get_naive_bayes_parameters function.

All it does is adding +1 to the following:
1. Word Frequency
2. Positive Sentiment Word Frequency
3. Negative Sentiment Word Frequency

Also, +2 for Number of sentiments as these terms are in the denominator and needs to adhere and compensate for the +1 in
the numerator so that the probability of most occurrence words will be less than 1
1. Total words
2. Total Positive sentiments
3. Total Negative sentiments
4. Total sentiments

In [None]:
vocabulary = five_fold_cross_validation(train, True, False)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,4.087259e-56,0.000000e+00,True
1,This movie is so awesome!,1,2.890468e-06,0.000000e+00,True
2,The scenes are often funny and occasionally to...,1,1.323531e-27,0.000000e+00,True
3,I do not know if this was Emilio Estevez's dir...,1,2.832706e-56,0.000000e+00,True
4,There was a few pathetic attempts to give the ...,0,0.000000e+00,1.372547e-30,False
...,...,...,...,...,...
593,"Not frightening in the least, and barely compr...",0,0.000000e+00,6.004679e-12,False
594,The only place good for this film is in the ga...,0,0.000000e+00,4.070636e-12,False
595,"This convention never worked well in the past,...",0,0.000000e+00,9.498045e-21,False
596,Hayao Miyazaki's latest and eighth film for St...,1,5.209350e-50,0.000000e+00,True


Accuracy:  72.5


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,157,0.015695,81.0,0.522917,0.322709,77.0,0.48125,0.333333
1,a,172,0.017195,105.0,0.522917,0.418327,68.0,0.48125,0.294372
2,from,22,0.002199,13.0,0.522917,0.051793,10.0,0.48125,0.043290
3,coming,3,0.000300,3.0,0.522917,0.011952,1.0,0.48125,0.004329
4,vocal,2,0.000200,2.0,0.522917,0.007968,1.0,0.48125,0.004329
...,...,...,...,...,...,...,...,...,...
2421,rare,2,0.000200,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2422,overly,2,0.000200,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2423,film-maker,2,0.000200,2.0,0.522917,0.007968,1.0,0.48125,0.004329
2424,worthy,2,0.000200,2.0,0.522917,0.007968,1.0,0.48125,0.004329


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,4.087259e-56,0.000000e+00,True
1,This movie is so awesome!,1,2.890468e-06,0.000000e+00,True
3,I do not know if this was Emilio Estevez's dir...,1,2.832706e-56,0.000000e+00,True
5,"In fact, this stinker smells like a direct-to-...",0,0.000000e+00,1.284596e-15,False
6,Lame would be the best way to describe it.,0,0.000000e+00,1.287882e-12,False
...,...,...,...,...,...
591,20th Century Fox's ROAD HOUSE 1948) is not onl...,0,0.000000e+00,0.000000e+00,False
592,"Just consider the excellent story, solid actin...",1,2.078212e-18,0.000000e+00,True
593,"Not frightening in the least, and barely compr...",0,0.000000e+00,6.004679e-12,False
594,The only place good for this film is in the ga...,0,0.000000e+00,4.070636e-12,False


Accuracy:  80.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,157,0.014646,78.0,0.514583,0.315789,80.0,0.489583,0.340426
1,a,177,0.016511,107.0,0.514583,0.433198,71.0,0.489583,0.302128
2,from,23,0.002146,13.0,0.514583,0.052632,11.0,0.489583,0.046809
3,coming,3,0.000280,3.0,0.514583,0.012146,1.0,0.489583,0.004255
4,vocal,2,0.000187,2.0,0.514583,0.008097,1.0,0.489583,0.004255
...,...,...,...,...,...,...,...,...,...
2578,added,2,0.000187,2.0,0.514583,0.008097,1.0,0.489583,0.004255
2579,bonuses,2,0.000187,2.0,0.514583,0.008097,1.0,0.489583,0.004255
2580,comprehensible,2,0.000187,1.0,0.514583,0.004049,2.0,0.489583,0.008511
2581,convention,2,0.000187,1.0,0.514583,0.004049,2.0,0.489583,0.008511


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,4.087259e-56,0.000000e+00,True
1,This movie is so awesome!,1,2.890468e-06,0.000000e+00,True
2,The scenes are often funny and occasionally to...,1,1.323531e-27,0.000000e+00,True
4,There was a few pathetic attempts to give the ...,0,0.000000e+00,1.372547e-30,False
6,Lame would be the best way to describe it.,0,0.000000e+00,1.287882e-12,False
...,...,...,...,...,...
592,"Just consider the excellent story, solid actin...",1,2.078212e-18,0.000000e+00,True
594,The only place good for this film is in the ga...,0,0.000000e+00,4.070636e-12,False
595,"This convention never worked well in the past,...",0,0.000000e+00,9.498045e-21,False
596,Hayao Miyazaki's latest and eighth film for St...,1,5.209350e-50,0.000000e+00,True


Accuracy:  78.33333333333333


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,148,0.015286,70.0,0.51875,0.281124,79.0,0.485417,0.339056
1,a,168,0.017352,101.0,0.51875,0.405622,68.0,0.485417,0.291845
2,from,25,0.002582,15.0,0.51875,0.060241,11.0,0.485417,0.047210
3,coming,3,0.000310,3.0,0.51875,0.012048,1.0,0.485417,0.004292
4,vocal,2,0.000207,2.0,0.51875,0.008032,1.0,0.485417,0.004292
...,...,...,...,...,...,...,...,...,...
2311,overly,2,0.000207,2.0,0.51875,0.008032,1.0,0.485417,0.004292
2312,film-maker,2,0.000207,2.0,0.51875,0.008032,1.0,0.485417,0.004292
2313,worthy,2,0.000207,2.0,0.51875,0.008032,1.0,0.485417,0.004292
2314,trap,2,0.000207,2.0,0.51875,0.008032,1.0,0.485417,0.004292


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
1,This movie is so awesome!,1,2.890468e-06,0.000000e+00,True
2,The scenes are often funny and occasionally to...,1,1.323531e-27,0.000000e+00,True
3,I do not know if this was Emilio Estevez's dir...,1,2.832706e-56,0.000000e+00,True
4,There was a few pathetic attempts to give the ...,0,0.000000e+00,1.372547e-30,False
5,"In fact, this stinker smells like a direct-to-...",0,0.000000e+00,1.284596e-15,False
...,...,...,...,...,...
592,"Just consider the excellent story, solid actin...",1,2.078212e-18,0.000000e+00,True
593,"Not frightening in the least, and barely compr...",0,0.000000e+00,6.004679e-12,False
595,"This convention never worked well in the past,...",0,0.000000e+00,9.498045e-21,False
596,Hayao Miyazaki's latest and eighth film for St...,1,5.209350e-50,0.000000e+00,True


Accuracy:  69.74789915966386


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,143,0.014447,71.0,0.513514,0.287449,73.0,0.490644,0.309322
1,this,129,0.013033,69.0,0.513514,0.279352,61.0,0.490644,0.258475
2,movie,78,0.007880,41.0,0.513514,0.165992,38.0,0.490644,0.161017
3,so,34,0.003435,15.0,0.513514,0.060729,20.0,0.490644,0.084746
4,awesome,4,0.000404,3.0,0.513514,0.012146,2.0,0.490644,0.008475
...,...,...,...,...,...,...,...,...,...
2369,overly,2,0.000202,2.0,0.513514,0.008097,1.0,0.490644,0.004237
2370,film-maker,2,0.000202,2.0,0.513514,0.008097,1.0,0.490644,0.004237
2371,fall,2,0.000202,2.0,0.513514,0.008097,1.0,0.490644,0.004237
2372,worthy,2,0.000202,2.0,0.513514,0.008097,1.0,0.490644,0.004237


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,4.087259e-56,0.000000e+00,True
2,The scenes are often funny and occasionally to...,1,1.323531e-27,0.000000e+00,True
3,I do not know if this was Emilio Estevez's dir...,1,2.832706e-56,0.000000e+00,True
4,There was a few pathetic attempts to give the ...,0,0.000000e+00,1.372547e-30,False
5,"In fact, this stinker smells like a direct-to-...",0,0.000000e+00,1.284596e-15,False
...,...,...,...,...,...
592,"Just consider the excellent story, solid actin...",1,2.078212e-18,0.000000e+00,True
593,"Not frightening in the least, and barely compr...",0,0.000000e+00,6.004679e-12,False
594,The only place good for this film is in the ga...,0,0.000000e+00,4.070636e-12,False
596,Hayao Miyazaki's latest and eighth film for St...,1,5.209350e-50,0.000000e+00,True


Accuracy:  75.63025210084034


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,160,0.015465,77.0,0.490644,0.326271,84.0,0.513514,0.340081
1,a,166,0.016045,94.0,0.490644,0.398305,73.0,0.513514,0.295547
2,from,28,0.002706,16.0,0.490644,0.067797,13.0,0.513514,0.052632
3,coming,3,0.000290,3.0,0.490644,0.012712,1.0,0.513514,0.004049
4,vocal,2,0.000193,2.0,0.490644,0.008475,1.0,0.513514,0.004049
...,...,...,...,...,...,...,...,...,...
2515,overly,2,0.000193,2.0,0.490644,0.008475,1.0,0.513514,0.004049
2516,film-maker,2,0.000193,2.0,0.490644,0.008475,1.0,0.513514,0.004049
2517,worthy,2,0.000193,2.0,0.490644,0.008475,1.0,0.513514,0.004049
2518,trap,2,0.000193,2.0,0.490644,0.008475,1.0,0.513514,0.004049


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,is,157,0.014646,78.0,0.514583,0.315789,80.0,0.489583,0.340426
1,a,177,0.016511,107.0,0.514583,0.433198,71.0,0.489583,0.302128
2,from,23,0.002146,13.0,0.514583,0.052632,11.0,0.489583,0.046809
3,coming,3,0.000280,3.0,0.514583,0.012146,1.0,0.489583,0.004255
4,vocal,2,0.000187,2.0,0.514583,0.008097,1.0,0.489583,0.004255
...,...,...,...,...,...,...,...,...,...
2578,added,2,0.000187,2.0,0.514583,0.008097,1.0,0.489583,0.004255
2579,bonuses,2,0.000187,2.0,0.514583,0.008097,1.0,0.489583,0.004255
2580,comprehensible,2,0.000187,1.0,0.514583,0.004049,2.0,0.489583,0.008511
2581,convention,2,0.000187,1.0,0.514583,0.004049,2.0,0.489583,0.008511


In [None]:
predict_calculate_accuracy(train, vocabulary, True, False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Accuracy:  93.47826086956522


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


93.47826086956522

In [None]:
predict_calculate_accuracy(dev, vocabulary, True, False)

Accuracy:  70.66666666666667


70.66666666666667

In [None]:
predict_calculate_accuracy(test, vocabulary, True, False)

Accuracy:  61.333333333333336


61.333333333333336

## Most Useful words after Smoothing

In [None]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values(P_WORD_GIVEN_SENTIMENT_POSITIVE_COL, ascending=False)[:MOST_USEFUL_LIMIT]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
19,the,245,0.022854,129.0,0.514583,0.522267,117.0,0.489583,0.497872
29,and,192,0.01791,120.0,0.514583,0.48583,73.0,0.489583,0.310638
1,a,177,0.016511,107.0,0.514583,0.433198,71.0,0.489583,0.302128
39,of,159,0.014832,89.0,0.514583,0.360324,71.0,0.489583,0.302128
31,it,143,0.01334,82.0,0.514583,0.331984,62.0,0.489583,0.26383
0,is,157,0.014646,78.0,0.514583,0.315789,80.0,0.489583,0.340426
32,this,146,0.013619,72.0,0.514583,0.291498,75.0,0.489583,0.319149
72,to,120,0.011194,61.0,0.514583,0.246964,60.0,0.489583,0.255319
44,i,116,0.010821,61.0,0.514583,0.246964,56.0,0.489583,0.238298
68,in,102,0.009515,55.0,0.514583,0.222672,48.0,0.489583,0.204255


In [None]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values(P_WORD_GIVEN_SENTIMENT_NEGATIVE_COL, ascending=False)[:MOST_USEFUL_LIMIT]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
19,the,245,0.022854,129.0,0.514583,0.522267,117.0,0.489583,0.497872
0,is,157,0.014646,78.0,0.514583,0.315789,80.0,0.489583,0.340426
32,this,146,0.013619,72.0,0.514583,0.291498,75.0,0.489583,0.319149
29,and,192,0.01791,120.0,0.514583,0.48583,73.0,0.489583,0.310638
39,of,159,0.014832,89.0,0.514583,0.360324,71.0,0.489583,0.302128
1,a,177,0.016511,107.0,0.514583,0.433198,71.0,0.489583,0.302128
31,it,143,0.01334,82.0,0.514583,0.331984,62.0,0.489583,0.26383
72,to,120,0.011194,61.0,0.514583,0.246964,60.0,0.489583,0.255319
44,i,116,0.010821,61.0,0.514583,0.246964,56.0,0.489583,0.238298
68,in,102,0.009515,55.0,0.514583,0.222672,48.0,0.489583,0.204255


## Inference

From the above we can see the effect of smoothening at the time of runtime with accuracy increase of 15%.

Also, from the most useful words we can see 2 things.
1. The most common words are the useful words.
2. The most common words, and some words have higher probability in both positive and negative sentiments.

This shows us that these data need to be removed.

For doing these as future work we can remove stop words from Pythons old NLTK library for stop words.
Also, we can remove the more frequent words like the movie, film as it is both positive and negative which is
logical as it is a movie database...


## Removing Stop Words and positive and negative words [Bonus Experiment]

For this as described in the inference we use the NLTK Library and add the most positive and negative words to the stop words.

In [None]:
for common_words in MOST_COMMON_WORDS_IN_DATA_SET: stop_words.append(common_words)

In [None]:
vocabulary = five_fold_cross_validation(train, True, True)
vocabulary

---------------------------Fold 1---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,3.511948e-51,2.918517e-56,True
1,This movie is so awesome!,1,6.506814e-06,7.033247e-06,False
2,The scenes are often funny and occasionally to...,1,1.464325e-25,1.136684e-27,True
4,There was a few pathetic attempts to give the ...,0,8.804615e-32,1.138327e-28,False
5,"In fact, this stinker smells like a direct-to-...",0,3.058464e-15,5.496015e-14,False
...,...,...,...,...,...
592,"Just consider the excellent story, solid actin...",1,3.287308e-19,1.876172e-20,True
593,"Not frightening in the least, and barely compr...",0,4.933309e-12,4.350350e-11,False
594,The only place good for this film is in the ga...,0,8.028592e-12,3.797749e-11,False
595,"This convention never worked well in the past,...",0,5.135544e-20,1.664900e-18,False


Accuracy:  80.0


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,recurring,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
1,well,17,0.002586,10.0,0.502083,0.041494,8.0,0.502083,0.033195
2,like,27,0.004107,15.0,0.502083,0.062241,13.0,0.502083,0.053942
3,female,3,0.000456,2.0,0.502083,0.008299,2.0,0.502083,0.008299
4,technically,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
...,...,...,...,...,...,...,...,...,...
2268,imaginative,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2269,latest,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2270,eighth,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2271,gake,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149


---------------------------Fold 2---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
1,This movie is so awesome!,1,6.506814e-06,7.033247e-06,False
2,The scenes are often funny and occasionally to...,1,1.464325e-25,1.136684e-27,True
3,I do not know if this was Emilio Estevez's dir...,1,1.763068e-52,2.335998e-57,True
6,Lame would be the best way to describe it.,0,5.589063e-13,1.801551e-11,False
7,the movie is littered with overt racial slurs ...,0,3.615243e-37,3.502888e-34,False
...,...,...,...,...,...
593,"Not frightening in the least, and barely compr...",0,4.933309e-12,4.350350e-11,False
594,The only place good for this film is in the ga...,0,8.028592e-12,3.797749e-11,False
595,"This convention never worked well in the past,...",0,5.135544e-20,1.664900e-18,False
596,Hayao Miyazaki's latest and eighth film for St...,1,1.781313e-35,1.370390e-32,False


Accuracy:  63.333333333333336


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,awesome,4,0.000569,3.0,0.50625,0.012346,2.0,0.497917,0.008368
1,touching,4,0.000569,4.0,0.50625,0.016461,1.0,0.497917,0.004184
2,characters,21,0.002988,12.0,0.50625,0.049383,10.0,0.497917,0.041841
3,funny,13,0.001850,10.0,0.50625,0.041152,4.0,0.497917,0.016736
4,lives,2,0.000285,2.0,0.50625,0.008230,1.0,0.497917,0.004184
...,...,...,...,...,...,...,...,...,...
2405,trap,2,0.000285,2.0,0.50625,0.008230,1.0,0.497917,0.004184
2406,rare,2,0.000285,2.0,0.50625,0.008230,1.0,0.497917,0.004184
2407,worthy,2,0.000285,2.0,0.50625,0.008230,1.0,0.497917,0.004184
2408,indulgent,2,0.000285,2.0,0.50625,0.008230,1.0,0.497917,0.004184


---------------------------Fold 3---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,3.511948e-51,2.918517e-56,True
1,This movie is so awesome!,1,6.506814e-06,7.033247e-06,False
2,The scenes are often funny and occasionally to...,1,1.464325e-25,1.136684e-27,True
3,I do not know if this was Emilio Estevez's dir...,1,1.763068e-52,2.335998e-57,True
4,There was a few pathetic attempts to give the ...,0,8.804615e-32,1.138327e-28,False
...,...,...,...,...,...
591,20th Century Fox's ROAD HOUSE 1948) is not onl...,0,1.693351e-36,1.756697e-32,False
592,"Just consider the excellent story, solid actin...",1,3.287308e-19,1.876172e-20,True
594,The only place good for this film is in the ga...,0,8.028592e-12,3.797749e-11,False
595,"This convention never worked well in the past,...",0,5.135544e-20,1.664900e-18,False


Accuracy:  72.5


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,recurring,2,0.000287,2.0,0.516667,0.008065,1.0,0.4875,0.004274
1,well,14,0.002006,9.0,0.516667,0.036290,6.0,0.4875,0.025641
2,like,22,0.003152,11.0,0.516667,0.044355,12.0,0.4875,0.051282
3,female,3,0.000430,2.0,0.516667,0.008065,2.0,0.4875,0.008547
4,technically,2,0.000287,2.0,0.516667,0.008065,1.0,0.4875,0.004274
...,...,...,...,...,...,...,...,...,...
2395,trap,2,0.000287,2.0,0.516667,0.008065,1.0,0.4875,0.004274
2396,rare,2,0.000287,2.0,0.516667,0.008065,1.0,0.4875,0.004274
2397,worthy,2,0.000287,2.0,0.516667,0.008065,1.0,0.4875,0.004274
2398,indulgent,2,0.000287,2.0,0.516667,0.008065,1.0,0.4875,0.004274


---------------------------Fold 4---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,3.511948e-51,2.918517e-56,True
1,This movie is so awesome!,1,6.506814e-06,7.033247e-06,False
3,I do not know if this was Emilio Estevez's dir...,1,1.763068e-52,2.335998e-57,True
4,There was a few pathetic attempts to give the ...,0,8.804615e-32,1.138327e-28,False
5,"In fact, this stinker smells like a direct-to-...",0,3.058464e-15,5.496015e-14,False
...,...,...,...,...,...
591,20th Century Fox's ROAD HOUSE 1948) is not onl...,0,1.693351e-36,1.756697e-32,False
592,"Just consider the excellent story, solid actin...",1,3.287308e-19,1.876172e-20,True
593,"Not frightening in the least, and barely compr...",0,4.933309e-12,4.350350e-11,False
596,Hayao Miyazaki's latest and eighth film for St...,1,1.781313e-35,1.370390e-32,False


Accuracy:  78.99159663865547


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,recurring,2,0.000336,2.0,0.534304,0.007782,1.0,0.469854,0.004425
1,well,20,0.003360,14.0,0.534304,0.054475,7.0,0.469854,0.030973
2,like,20,0.003360,13.0,0.534304,0.050584,8.0,0.469854,0.035398
3,female,3,0.000504,2.0,0.534304,0.007782,2.0,0.469854,0.008850
4,technically,2,0.000336,2.0,0.534304,0.007782,1.0,0.469854,0.004425
...,...,...,...,...,...,...,...,...,...
2086,trap,2,0.000336,2.0,0.534304,0.007782,1.0,0.469854,0.004425
2087,rare,2,0.000336,2.0,0.534304,0.007782,1.0,0.469854,0.004425
2088,worthy,2,0.000336,2.0,0.534304,0.007782,1.0,0.469854,0.004425
2089,indulgent,2,0.000336,2.0,0.534304,0.007782,1.0,0.469854,0.004425


---------------------------Fold 5---------------------------------


Unnamed: 0,IMDB Review,Sentiment,P(Sentiment = Positive | Sentence),P(Sentiment = Negative | Sentence),Predicted sentiment
0,"Technically, the film is well made with impres...",1,3.511948e-51,2.918517e-56,True
2,The scenes are often funny and occasionally to...,1,1.464325e-25,1.136684e-27,True
3,I do not know if this was Emilio Estevez's dir...,1,1.763068e-52,2.335998e-57,True
4,There was a few pathetic attempts to give the ...,0,8.804615e-32,1.138327e-28,False
5,"In fact, this stinker smells like a direct-to-...",0,3.058464e-15,5.496015e-14,False
...,...,...,...,...,...
593,"Not frightening in the least, and barely compr...",0,4.933309e-12,4.350350e-11,False
594,The only place good for this film is in the ga...,0,8.028592e-12,3.797749e-11,False
595,"This convention never worked well in the past,...",0,5.135544e-20,1.664900e-18,False
596,Hayao Miyazaki's latest and eighth film for St...,1,1.781313e-35,1.370390e-32,False


Accuracy:  68.90756302521008


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,recurring,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132
1,well,20,0.002884,13.0,0.50104,0.053942,8.0,0.503119,0.033058
2,like,20,0.002884,10.0,0.50104,0.041494,11.0,0.503119,0.045455
3,female,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132
4,technically,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132
...,...,...,...,...,...,...,...,...,...
2367,trap,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132
2368,rare,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132
2369,worthy,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132
2370,indulgent,2,0.000288,2.0,0.50104,0.008299,1.0,0.503119,0.004132


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
0,recurring,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
1,well,17,0.002586,10.0,0.502083,0.041494,8.0,0.502083,0.033195
2,like,27,0.004107,15.0,0.502083,0.062241,13.0,0.502083,0.053942
3,female,3,0.000456,2.0,0.502083,0.008299,2.0,0.502083,0.008299
4,technically,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
...,...,...,...,...,...,...,...,...,...
2268,imaginative,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2269,latest,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2270,eighth,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149
2271,gake,2,0.000304,2.0,0.502083,0.008299,1.0,0.502083,0.004149


In [None]:
predict_calculate_accuracy(train, vocabulary, True, True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Accuracy:  94.14715719063545


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


94.14715719063545

In [None]:
predict_calculate_accuracy(dev, vocabulary, True, True)

Accuracy:  69.33333333333333


69.33333333333333

In [None]:
predict_calculate_accuracy(test, vocabulary, True, True)

Accuracy:  73.33333333333333


73.33333333333333

In [None]:
print("Most Useful Positive sentiment words:")
vocabulary.sort_values(P_WORD_GIVEN_SENTIMENT_POSITIVE_COL, ascending=False)[:MOST_USEFUL_LIMIT]

Most Useful Positive sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
14,good,31,0.004716,21.0,0.502083,0.087137,11.0,0.502083,0.045643
2,like,27,0.004107,15.0,0.502083,0.062241,13.0,0.502083,0.053942
301,see,25,0.003803,15.0,0.502083,0.062241,11.0,0.502083,0.045643
143,great,17,0.002586,13.0,0.502083,0.053942,5.0,0.502083,0.020747
99,best,14,0.00213,13.0,0.502083,0.053942,2.0,0.502083,0.008299
253,wonderful,14,0.00213,13.0,0.502083,0.053942,2.0,0.502083,0.008299
33,really,22,0.003347,12.0,0.502083,0.049793,11.0,0.502083,0.045643
25,characters,21,0.003194,12.0,0.502083,0.049793,10.0,0.502083,0.041494
26,funny,13,0.001977,11.0,0.502083,0.045643,3.0,0.502083,0.012448
174,time,24,0.003651,11.0,0.502083,0.045643,14.0,0.502083,0.058091


In [None]:
print("Most Useful Negative sentiment words:")
vocabulary.sort_values(P_WORD_GIVEN_SENTIMENT_NEGATIVE_COL, ascending=False)[:MOST_USEFUL_LIMIT]

Most Useful Negative sentiment words:


Unnamed: 0,Word,Word Frequency,P(Word),Positive Sentiment Word Frequency,P(Sentiment = Positive),P(Word | Sentiment = Positive),Negative Sentiment Word Frequency,P(Sentiment = Negative),P(Word | Sentiment = Negative)
110,bad,33,0.00502,4.0,0.502083,0.016598,30.0,0.502083,0.124481
392,even,21,0.003194,5.0,0.502083,0.020747,17.0,0.502083,0.070539
38,plot,17,0.002586,4.0,0.502083,0.016598,14.0,0.502083,0.058091
174,time,24,0.003651,11.0,0.502083,0.045643,14.0,0.502083,0.058091
17,acting,24,0.003651,11.0,0.502083,0.045643,14.0,0.502083,0.058091
2,like,27,0.004107,15.0,0.502083,0.062241,13.0,0.502083,0.053942
620,would,14,0.00213,2.0,0.502083,0.008299,13.0,0.502083,0.053942
220,little,16,0.002434,5.0,0.502083,0.020747,12.0,0.502083,0.049793
325,could,17,0.002586,6.0,0.502083,0.024896,12.0,0.502083,0.049793
33,really,22,0.003347,12.0,0.502083,0.049793,11.0,0.502083,0.045643
