# Naïve Bayes with smoothing

### Authors: Anjali Singh, Bhagyashri Nivdunge, Hrishika Shetty

# Introduction

This problem set has two aims:
- Learn to understand Naive Bayes
- Learn to handle text, a form of data that does not come as a table of numbers. 

You will implement your own Naive Bayes classifier and use this for categorizing Rotten Tomatoes reviews into rotten and fresh ones. Finally, you also find the optimal smoothing parameter.


# Rotten Tomatoes


Our first task is to load, clean and explore the Rotten Tomatoes movie reviews data. Please familiarize yourself a little bit with the webpage. Briefly, approved critics can write reviews for movies, and evaluate the movie as "fresh" or "rotten". The webpage normally shows a short "quote" from each critic, and whether it evaluates the movie as fresh or rotten. You will work on these quotes below.
The central variables for our purpose in rotten-tomatoes.csv are the following:

fresh evaluation: 'fresh' or 'rotten'
quote short version of the review
There are more variables like links to IMDB.

# 1 Explore and clean the data

First, let's load data and take a closer look at it.

In [131]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import stop_words
import os
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

In [132]:
#path = os.getcwd()
#print(path)
data= pd.read_csv('rotten-tomatoes.csv.bz2')

In [133]:
data.sample(4)

Unnamed: 0,critic,fresh,imdb,link,publication,quote,review_date,rtid,title
3260,Rick Groen,fresh,117628,http://www.globeandmail.com/servlet/ArticleNews/movie/MOVIEREVIEWS/19960823/TAONEE,Globe and Mail,Edward Burns returns to the themes of family and love that he examined in The Brothers McMullen.,2002-04-12 00:00:00,13832,She's the One
1264,Kevin Thomas,rotten,110516,http://articles.latimes.com/1994-08-31/entertainment/ca-33047_1_milk-money,Los Angeles Times,Only the least demanding audiences can be expected to buy into Milk Money.,2013-05-22 00:00:00,11257,Milk Money
1524,Peter Rainer,fresh,114924,"http://www.calendarlive.com/movies/reviews/cl-movie960406-234,0,636778.story",Los Angeles Times,"Bullock is a genuinely engaging performer, which at least gives the treacle some minty freshness.",2001-02-13 00:00:00,10076,While You Were Sleeping
3996,Ty Burr,fresh,58946,http://www.boston.com/ae/movies/articles/2004/02/27/65_classic_battle_of_algiers_still_electrifies_and_challenges,Boston Globe,"The chafing, mutually uncomprehending collision of Western occupiers and Muslim occupied has never been captured with such dispassionate, thrilling clarity.",2004-02-27 00:00:00,133039722,La Battaglia di Algeri (The Battle of Algiers)


2. print out all variable names.

In [134]:
data.columns

Index(['critic', 'fresh', 'imdb', 'link', 'publication', 'quote',
       'review_date', 'rtid', 'title'],
      dtype='object')

3. Create a summary table (maybe more like a bullet list) where you print out the most important summary statistics for the most interesting variables. The most interesting facts you should present should include: 
    - number of missings for fresh and quote
    - all different values for fresh/rotten evaluations
    - counts or percentages of these values
    - number of zero-length or only whitespace quote-s
    - minimum-maximum-average length of quotes (either in words, or in characters). (Can you do this as an one-liner?)
    - how many reviews are in data multiple times. Feel free to add more 􏰃gures you consider relevant.

In [135]:
print(" \033[1m- The number of missings for fresh: \033[0m", data[data.fresh.isna() == True], sep="\n", end="\n\n")
print(" \033[1m- The number of missings for quote: \033[0m", data[data.quote.isna() == True], sep="\n", end="\n\n")
print(" \033[1m- All different values for fresh/rotten evaluations: \033[0m", data.fresh.value_counts(), sep="\n", end="\n\n")
print(" \033[1m- Percentage values for fresh/rotten evaluations: \033[0m", data.fresh.value_counts(normalize=True)*100, sep="\n", end="\n\n")
print(" \033[1m- Number of zero length or only white-space quotes: \033[0m", 
      data[(data.quote == "") | (data.quote.str.isspace())], sep="\n", end="\n\n")
data['quote_length'] = data['quote'].apply(len)
print(" \033[1m- Minimum-maximum-average length of quotes (either in words, or in characters):\n\033[0m", 
      "Mean = ",data.quote_length.mean(), "\n", 
      "Minimum = ",data.quote_length.min(), "\n", 
      "Maximum = ",data.quote_length.max(), "\n", end="\n\n")
print("\033[1m - Reviews in data multiple times:\n\033[0m", data[data.duplicated(subset=None, keep='first')][['quote']].count(), end="\n")

 [1m- The number of missings for fresh: [0m
Empty DataFrame
Columns: [critic, fresh, imdb, link, publication, quote, review_date, rtid, title]
Index: []

 [1m- The number of missings for quote: [0m
Empty DataFrame
Columns: [critic, fresh, imdb, link, publication, quote, review_date, rtid, title]
Index: []

 [1m- All different values for fresh/rotten evaluations: [0m
fresh     8389
rotten    5030
none      23  
Name: fresh, dtype: int64

 [1m- Percentage values for fresh/rotten evaluations: [0m
fresh     62.408868
rotten    37.420027
none      0.171105 
Name: fresh, dtype: float64

 [1m- Number of zero length or only white-space quotes: [0m
Empty DataFrame
Columns: [critic, fresh, imdb, link, publication, quote, review_date, rtid, title]
Index: []

 [1m- Minimum-maximum-average length of quotes (either in words, or in characters):
[0m Mean =  121.23128998660914 
 Minimum =  4 
 Maximum =  256 


[1m - Reviews in data multiple times:
[0m quote    596
dtype: int64


Now when you have an overview what you have in data, clean it by removing all the inconsistencies the table reveals. We have to ensure that the central variables, quote and fresh are not missing, quote is not an empty string (or just contain spaces and such), and all rows are unique.
I recommend to do it as a standalone function so you can use the same function for another similar dataset (such as test data).

In [136]:
def clean_data(dataframe):
    #eliminate duplicate rows
    data.drop_duplicates(keep = 'first', inplace = True) 
    data_new = data[data.fresh!='none']
    #drop rows where column value for fresh and quote are missing or null
    data_new.dropna(subset = ['fresh','quote'],inplace=True)
    #write a code to remove whitespace. 
    data_new = data_new[~data_new.quote.str.isspace()]
    return data_new

In [137]:
data_new = clean_data(data)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


# 2 Naive Bayes

Now where you are familiar with the data, it's time to get serious and implement the Naive Bayes classifier from scratch. But first things first.

2. Convert your data (quotes) into bag-of-words. Your code should look something like this:

In [138]:
stop_list = list(stop_words.ENGLISH_STOP_WORDS)
other_words = ['movie','movies','film','films','story','like','way','just','make','makes','director','comedy','picture',
              'characters','good','way','action','time','picture','really','funny','characters','character','performances',
              'does','doesn','isn','work','don','best','new','little','feel','plot','script','thriller']

stop_list.extend(other_words)
frozen_stop = frozenset(stop_list)

vectorizer = CountVectorizer(binary=True,stop_words=frozen_stop)
X = vectorizer.fit_transform(data_new.quote.values)
words = vectorizer.get_feature_names()

# x = data_new.drop(columns='fresh')
Y = data_new.fresh

3. Split your work data and target (i.e. the variable fresh) into training and validation chunks (80/20 or so).

In [139]:
#splitting into training and test data 
from sklearn.model_selection import train_test_split
train_x,val_x,train_y,val_y = train_test_split(X, Y, test_size = 0.2)

4. Compute the unconditional (log) probability that the tomato is fresh/rotten, log Pr(F), and log Pr(R). These probabilities are based on the values of fresh alone, not on the words the quotes contain.

In [140]:
#calculating log probability that the tomato is fresh
prF = np.log(len(train_y[train_y=='fresh'])/len(train_y))
prR = np.log(len(train_y[train_y=='rotten'])/len(train_y))

5. For each word w, compute log Pr(w|F) and log Pr(w|R), the (log) probability that the word is present in a fresh/rotten review. These probabilities can easily be calculated from counts of how many times these words are present for each class.
Hint: these computations are based on your BOW-s X. Look at ways to sum along columns in this matrix.

In [141]:
#fresh count and rotten count
fresh_count = train_x[np.where(train_y == 'fresh')].shape[0]
rotten_count = train_x[np.where(train_y == 'rotten')].shape[0]

#log probability that the word is present in fresh/rotten review
prWF = np.log(train_x[np.where(train_y == 'fresh')].toarray().sum(axis=0) / fresh_count)
prWR = np.log(train_x[np.where(train_y == 'rotten')].toarray().sum(axis=0)/ rotten_count)

  
  import sys



Now we are done with the estimator. Your fitted model is completely described by these four probability vectors: log Pr(F), log Pr(R), log Pr(w|F), log Pr(w|R). Let's now turn to prediction, and pull out your validation data (not the test data!).

6. For both destination classes, F and R, compute the log-likelihood that the quote belongs to this class. log-likelihood is what is given inside the brackets in equation (1) on slide 28, and the equations on Schutt "Doing Data Science", page 102. In lecture notes it is explained before the email classification example (and in the example too). On the slides we have the log-likelihood essentially as (although we do not write it out):

In [142]:
import re
arr_val_x = val_x.toarray()

In [143]:
#empty arrays for saving log likelyhood
log_likehood_F = []
log_likehood_R = []

# log likelyhood of the quote 
for word_index in np.arange(arr_val_x.shape[0]):
    val_text = arr_val_x[word_index]
    prTextF = 0.0
    prTextR = 0.0
    for index, word in enumerate(val_text,start=0):
        if word == 0:
            continue
        if np.isfinite(prWF[index]):
            prTextF += prWF[index]
        if np.isfinite(prWR[index]):
            prTextR += prWR[index]
    prTextF += prF
    prTextR += prR
    log_likehood_F.append(prTextF)
    log_likehood_R.append(prTextR)

7. Print the resulting confusion matrix and accuracy (feel free to use existing libraries).

In [144]:
predicted_target = []

#calculating confusion matrix and accuracy
for index in np.arange(len(log_likehood_F)):
    if log_likehood_F[index] > log_likehood_R[index]:
        predicted_target.append('fresh')
        fresh_count+=1
    else:
        predicted_target.append('rotten')
        rotten_count+=1
predicted_target = np.array(predicted_target)

print(type(predicted_target))
print(type(val_y))
print(confusion_matrix(val_y, predicted_target))
print(accuracy_score(val_y, predicted_target))

<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>
[[651 904]
 [430 580]]
0.47992202729044836


# 3 Interpretation

Now it is time to look at your fitted model a little bit closer. NB model probabilities are rather easy to understand and interpret. The task here is to find the best words to predict a fresh, and a rotten review. And we only want to look at words that are reasonably frequent, say more frequent than 30 times in the data.


1. Extract from your conditional probability vectors log Pr(w|F) and log Pr(w|R) the probabilities that correspond to frequent words only.


In [145]:
#log probability value when word frequency is 30 times and review is fresh
fresh_baseline = np.log(30/fresh_count)
#log probability value when word frequency is 30 times and review is rotten
rotten_baseline = np.log(30/rotten_count)

#vector of probabilities of most frequent words when review is fresh
freq_fresh = prWF[prWF > fresh_baseline]

#vector of probabilities of most frequent words when review is rotten
freq_rotten = prWR[prWR > rotten_baseline]

2. Find 10 best words to predict F and 10 best words to predict R. Hint: imagine we have a review that contains just a single word. Which word will give the highest weight to the probability the review is fresh? Which one to the likelihood it is rotten?
Comment your results.


In [146]:
#get indexes of top 10 values
index_fresh = np.argsort(prWF)[-10:]
top_10_fresh = [words[i] for i in index_fresh]
print(top_10_fresh)

index_rotten = np.argsort(prWR)[-10:]
top_10_rotten = [words[i] for i in index_rotten]
print(top_10_rotten)

['love', 'screen', 'entertainment', 'world', 'american', 'performance', 'life', 'entertaining', 'fun', 'great']
['thing', 'effects', 'old', 'hard', 'real', 'end', 'audience', 'better', 'long', 'bad']


3. Print out a few missclassied quotes. Can you understand why these are misclassified?

In [147]:
# we print out the rows from our dataset that have mismatches between labels and predictions
result = (predicted_target == val_y)
data_new.loc[list(result[result == False].index)][['quote','fresh']]

Unnamed: 0,quote,fresh
11224,"The movie understands that the story really is about the killer's point of view. It is not the story of a crime, not a docudrama, not a sociological essay.",fresh
12701,"This 1940 film is one of Ernst Lubitsch's finest and most enduring works, a romantic comedy of dazzling range.",fresh
746,"A delicacy, a passionate and compassionate study of erotica.",fresh
12985,"Hopefully it's three strikes and you're out for The Punisher, as this latest attempt at creating another comic book franchise will prove once and for all that maybe this character just doesn't deserve the big screen treatment.",rotten
12536,"Its willingness to take risks, and its insights into the frailties and confusions of teenage friendships, lift the film right out of the rut.",fresh
6430,"Such Greenaway films as ""The Cook, The Thief, His Wife & Her Lover,"" ""Prospero's Books"" and now this make one wonder if they're really as deep as they pretend to be. Perhaps, as his actors, this emperor has no clothes.",rotten
898,"Brings the popular TV series to the screen with a barrage of spectacular special effects, a slew of fantastic monsters, a ferociously funny villain -- and, most important, a refreshing lack of pretentiousness.",fresh
8319,Mr. Bolt has reduced the vast upheaval of the Russian Revolution to the banalities of a doomed romance.,rotten
7427,Burning is too good for such a wretched fiasco; only a surgical nuclear strike could suitably destroy what has to be one of the most enervating comedies ever made.,rotten
10868,Angelopoulos' meditation on the meaning of one man's life is genuinely hypnotic in its way of transcending ordinary narrative.,fresh


If categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This can be reason why these are missclassified.

This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. 

The other reason that the quotes are getting misclassified is the figure of speech in the language English. Usage of both positive and negative words(Oxymoron) makes the classfier misjudge. For example - "It's easy to admire what the Coens are trying to do in Fargo, but more difficult to actually like the film." It uses easy, admire and at the same time difficult.
In certain cases, the author is talking about not the movie, but the actors involved, which may not be picked up by the classified. For example -"Boring as it may have become for Meryl Streep to show up for awards ceremonies, count on her to be doing it again."

# 4 NB with Smoothing

So, now you have your brand-new NB algorithm up and running. As a next step, we add smoothing to it. As our task is to find the best smoothing parameter below, your first task is to mold what you did above into two functions: one for fitting and another one for predicting.

1. Create two functions: one for fitting NB model, and another to predict outcome based on the fitted model.
As mentioned above, the model is fully described with 4 probabilities, so your fitting function may return such a list as the model; and the prediction function may take it as an input.

In [148]:
def fitting_nb_model(train_x, train_y):
    """This function takes the training data and 
    returns the apriori probabilities needed for naive bayes calcs"""
    #calculating log probability that the review is fresh
    prF = np.log(len(train_y[train_y=='fresh'])/len(train_y))
    #calculating log probability that the review is rotten
    prR = np.log(len(train_y[train_y=='rotten'])/len(train_y))
    
    # get the number of fresh and rotten reviews on the training set
    fresh_count = train_x[np.where(train_y == 'fresh')].shape[0]
    rotten_count = train_x[np.where(train_y == 'rotten')].shape[0]
    
    # get conditional log probabilties 
    # condition log probability of word when review is fresh
    prWF = np.log(train_x[np.where(train_y == 'fresh')].toarray().sum(axis=0) / fresh_count)
    # condition log probability of word when review is rotten
    prWR = np.log(train_x[np.where(train_y == 'rotten')].toarray().sum(axis=0) / rotten_count)
    
    model_list = [prF, prR, prWF, prWR]
    return model_list

In [149]:
def predict_outcome(model_list,data):
    """ This function predicts the naive bayes results based on conditional probabilities passed to it"""
    arr_val_x = data.toarray()
    log_likehood_F = []
    log_likehood_R = []
    prF = model_list[0]
    prR = model_list[1]
    prWF = model_list[2]
    prWR = model_list[3]
    
    for word_index in np.arange(arr_val_x.shape[0]):
        val_text = arr_val_x[word_index]
        prTextF = 0.0
        prTextR = 0.0
        for index, word in enumerate(val_text,start=0):
            if word == 0:
                continue
            prTextF += prWF[index]
            prTextR += prWR[index]
        prTextF += prF
        prTextR += prR
        log_likehood_F.append(prTextF)
        log_likehood_R.append(prTextR)
        
    predicted_target = []
    
    for index in np.arange(len(log_likehood_F)):
        if log_likehood_F[index] > log_likehood_R[index]:
            predicted_target.append('fresh')
        else:
            predicted_target.append('rotten')
    
    predicted_target = np.array(predicted_target)
    return predicted_target

2. Add smoothing to the model. See Schutt p 103 and 109. Smoothing amounts to assuming that we have seen every possible word α >= 0 times already, for both classes. (If you wish, you can also assume you have seen the words α times for F and β times for R). Note that α does not have to be an integer, and typically the best α < 1.

In [150]:
def fitting_nb_model(train_x, train_y,smoothing):
    """This function takes the training data and 
    returns the apriori probabilities needed for naive bayes calcs"""
    #calculating log probability that the review is fresh
    prF = np.log(len(train_y[train_y=='fresh'])/len(train_y))
    #calculating log probability that the review is rotten
    prR = np.log(len(train_y[train_y=='rotten'])/len(train_y))
    
    # get the number of fresh and rotten reviews on the training set
    fresh_count = train_x[np.where(train_y == 'fresh')].shape[0]
    rotten_count = train_x[np.where(train_y == 'rotten')].shape[0]
    
    # get conditional log probabilties 
    # condition log probability of word when review is fresh
    prWF = np.log(np.add(train_x[np.where(train_y == 'fresh')].toarray().sum(axis=0),smoothing) / fresh_count)
    # condition log probability of word when review is rotten
    prWR = np.log(np.add(train_x[np.where(train_y == 'rotten')].toarray().sum(axis=0),smoothing) / rotten_count)
    model_list = [prF, prR, prWF, prWR]
    return model_list

In [151]:
train_x,val_x,train_y,val_y = train_test_split(X, Y, test_size = 0.5)
model_list = fitting_nb_model(train_x, train_y,0.1)
predicted_target = predict_outcome(model_list,val_x)
print(confusion_matrix(val_y, predicted_target))
print(accuracy_score(val_y, predicted_target))

[[3183  840]
 [ 948 1441]]
0.7211478477854024


3. Cross-validate the accuracy (on the validation data) on a number of α values and find the α that gives you the best result. You can use your own CV algorithm you created for PS4, or an existing library.

In [152]:
from sklearn.model_selection import KFold
from statistics import mean 
from sklearn.model_selection import LeaveOneOut 
kf = KFold(n_splits=5)

smoothing = np.arange(0.1,1,0.1)
abc = None

for alpha in smoothing:
    accuracy = list()
    for train_index, test_index in kf.split(val_x):
        train_x, x_val = val_x[train_index], val_x[test_index]
        train_y, y_val = val_y.iloc[train_index], val_y.iloc[test_index]
        model_list = fitting_nb_model(train_x, train_y,alpha)
        predictions = predict_outcome(model_list,x_val)
        abc = predictions
        fold_accuracy = accuracy_score(y_val, predictions)
#         print(fold_accuracy)
        accuracy.append(fold_accuracy)
    print("For smoothing parameter =",alpha)
    print("Accuracy score:",mean(accuracy))
    print()

For smoothing parameter = 0.1
Accuracy score: 0.7138203532817852

For smoothing parameter = 0.2
Accuracy score: 0.7147552963692982

For smoothing parameter = 0.30000000000000004
Accuracy score: 0.7117916641841044

For smoothing parameter = 0.4
Accuracy score: 0.7107001068819059

For smoothing parameter = 0.5
Accuracy score: 0.7080493383414215

For smoothing parameter = 0.6
Accuracy score: 0.7025907006662184

For smoothing parameter = 0.7000000000000001
Accuracy score: 0.6979120941922634

For smoothing parameter = 0.8
Accuracy score: 0.6943245586409582

For smoothing parameter = 0.9
Accuracy score: 0.6904256185835898



The value of alpha that gives the best result is at alpha = 0.2. The lower the value of alpha, the results are better and the accuracy keeps decresing as the value of alpha is increased.