<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis and Naive Bayes

_Authors: Kiefer Katovich (SF)_

---

In the sentiment analysis lesson, we used a predefined dictionary of positive and negative valences for words. This  lab inverts that process: You'll find which words are most likely to appear in positive or negative reviews by using the rotten versus fresh binary label.

### Naive Bayes

A common practical way to do this is with the Naive Bayes algorithm. Naive Bayes classifiers are covered in more depth in another lecture — for this lab, you'll just be leveraging the scikit-learn implementation.

Given a feature, $x_i$, and target, $y_i$, Naive Bayes classifiers solve for $P(x_i \;|\; y_i)$. In other words, they solve for the probability of a feature/predictor _given_ that the target is one.

We'll use this to figure out which words are more likely to appear when the target is one ("fresh") versus zero ("rotten").

---

### 1) Load the packages and movie data.

Perform any necessary cleaning.

In [1]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes that predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
rt = pd.read_csv('./datasets/rt_critics.csv')

In [3]:
rt.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


In [4]:
import string
rt['qt'] = rt['quote'].map(lambda x: x.translate(None, string.punctuation).lower())

In [5]:
rt['quote_len'] = rt['quote'].map(lambda x: len(x))

In [6]:
rt = rt[(rt['fresh'] != 'none') & (rt['quote_len']>10)]

---

### 2) Create a predictor matrix of words in the quotes with `CountVectorizer`.

It's up to you to select an n-gram range. **Make sure that `binary=True`**.

In [24]:
#pipeline
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, cross_val_score 
pclass = Pipeline([
    ('vect', CountVectorizer(strip_accents='unicode', binary=True, stop_words='english', ngram_range=(1,3), min_df=2)),
#     ('tfidf', TfidfTransformer()),
#     ('cls', MultinomialNB())
#     ('logit', LogisticRegression())
    ('bnb', BernoulliNB())
]) 

In [25]:
from sklearn.model_selection import train_test_split
X = rt['qt']
y = rt['fresh']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42) #can add in stratify=y

In [26]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(pclass, X_train, y_train, cv=5)
scores

array([0.73684211, 0.7676182 , 0.74130241, 0.73806336, 0.75457385])

In [27]:
#fit and predict
class_pred = class_fit.predict(X_test)

In [36]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

conmat = np.array(confusion_matrix(y_test, class_pred))
confusion = pd.DataFrame(conmat, index=['fresh','rotten'], columns=['fresh','rotten'])
print (confusion)
print (classification_report(y_test, class_pred))
metrics.f1_score(y_test, class_pred, average='macro')

        fresh  rotten
fresh    1488     230
rotten    469     615
             precision    recall  f1-score   support

      fresh       0.76      0.87      0.81      1718
     rotten       0.73      0.57      0.64      1084

avg / total       0.75      0.75      0.74      2802



0.7237159996191322

---

### 3) Split the data into training and testing sets.

You should keep 25 percent of the data in the testing set.

In [78]:
from sklearn.model_selection import train_test_split
X2 = X.drop('y',axis=1)
y2 = X['y']
X_train, X_test, y_train, y_test = train_test_split(X2, y2, stratify=y2, test_size=0.2, random_state=42) #can add in stratify=y

---

### 4) Build a `BernoulliNB` model predicting fresh versus rotten from the word occurrences.

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to the baseline.

In [81]:
y_train.value_counts()/len(y_train)

fresh     0.612955
rotten    0.387045
Name: y, dtype: float64

---

### 5) Pull out the probability of words given "fresh."

The `.feature_log_prob_` attribute of the Naive Bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target and the columns correspond to the features. The first row is the zero, "rotten" class and the second row is the one, "fresh" class.

#### 5.A) Pull out the log probabilities and convert them to probabilities for fresh and rotten.

In [80]:
#vectorizing text into dataframe
cvt      =  CountVectorizer(strip_accents='unicode', binary=True, stop_words='english', ngram_range=(1,3), min_df=2)
X_train2 = pd.DataFrame(cvt.fit_transform(X_train).todense(),
             columns=cvt.get_feature_names())#binary=True if BernoulliNB version of Naive Bayes is used

In [81]:
#model fit
bnb = BernoulliNB()
model = bnb.fit(X_train2, y_train)

In [82]:
#alternative model cross_val direct
cross_val_score(bnb, X_train2, y_train, cv=5)

array([0.73550401, 0.76895629, 0.74308653, 0.73493976, 0.73717091])

In [83]:
#vectorize test set, transform directly
X_test2 = pd.DataFrame(cvt.transform(X_test).todense(),
             columns=cvt.get_feature_names())

In [84]:
#score vs baseline 0.61
model.score(X_test2, y_test)

0.7505353319057816

In [85]:
#this is for fresh class, cos didnt encode y
coef_0 = pd.DataFrame(zip(X_train2.columns, model.feature_log_prob_[0]))
coef_0['exp_0'] = np.exp(coef_0[1])*100
coef_0.sort_values('exp_0', ascending=False).head(10)

Unnamed: 0,0,1,exp_0
5245,film,-1.939528,14.377183
9606,movie,-2.116197,12.048894
8451,like,-3.16533,4.220023
13848,story,-3.175728,4.176368
1305,best,-3.186236,4.132712
6257,good,-3.222082,3.987194
5447,films,-3.263056,3.827125
9804,movies,-3.274529,3.783469
2647,comedy,-3.358747,3.477881
14738,time,-3.41026,3.30326


In [86]:
coef_1 = pd.DataFrame(zip(X_train2.columns, model.feature_log_prob_[1]))
coef_1['exp_1'] = np.exp(coef_1[1])*100
coef_1.sort_values('exp_1', ascending=False).head(10)

Unnamed: 0,0,1,exp_1
9606,movie,-2.007442,13.43318
5245,film,-2.318846,9.83871
8451,like,-2.709203,6.658986
13848,story,-3.149883,4.285714
7935,just,-3.251666,3.870968
2647,comedy,-3.313035,3.640553
8563,little,-3.405816,3.317972
2227,characters,-3.405816,3.317972
9804,movies,-3.419803,3.271889
6257,good,-3.441156,3.202765


#### 5.B) Make a DataFrame with the probabilities and features.

In [87]:
df_prob = pd.concat([coef_0[[0,'exp_0']], coef_1['exp_1']], axis=1)
df_prob.head(5)

Unnamed: 0,0,exp_0,exp_1
0,007,0.087311,0.046083
1,10,0.145518,0.391705
2,10 minutes,0.043655,0.138249
3,10 years,0.043655,0.069124
4,100,0.101863,0.069124


#### 5.C) Create a column that is the difference between the probability of the appearance of fresh and rotten.

In [88]:
#fresh is 0 cos nv encode
df_prob['diff'] = df_prob['exp_0'] - df_prob['exp_1']

#### 5.D) Look at the most likely words for fresh and rotten reviews.

In [89]:
#likely fresh words
df_prob.sort_values('diff', ascending=False).head(10)

Unnamed: 0,0,exp_0,exp_1,diff
5245,film,14.377183,9.83871,4.538473
1305,best,4.132712,1.543779,2.588934
6403,great,2.721187,0.875576,1.845611
10783,performance,2.182771,0.506912,1.675858
4600,entertaining,2.182771,0.62212,1.560651
5447,films,3.827125,2.373272,1.453853
5882,fun,2.473807,1.24424,1.229567
4618,entertainment,1.77532,0.552995,1.222325
593,american,2.022701,0.806452,1.216249
16204,years,1.848079,0.668203,1.179876


In [90]:
#likely rotten words
df_prob.sort_values('diff', ascending=True).head(10)

Unnamed: 0,0,exp_0,exp_1,diff
8451,like,4.220023,6.658986,-2.438963
1069,bad,0.785797,2.557604,-1.771806
8563,little,1.818976,3.317972,-1.498997
4023,doesnt,1.600698,3.064516,-1.463818
9606,movie,12.048894,13.43318,-1.384286
11774,really,0.698487,2.050691,-1.352205
7674,isnt,1.251455,2.511521,-1.260066
11031,plot,1.266007,2.465438,-1.199431
14543,theres,1.77532,2.949309,-1.173989
7935,just,2.706636,3.870968,-1.164332


---

### 6) Examine how your model performs on the testing set.

In [91]:
#score vs baseline 0.61
model.score(X_test2, y_test)

0.7505353319057816

---

### 7) Look at the top 10 movies and reviews likely to be fresh and the top 10 likely to be rotten.

You can fit the model on the full set of data for this.

> **Note:** While it's good at classifying, Naive Bayes is known to be somewhat bad at providing accurate predicted probabilities (beyond getting it on the correct side of 50 percent). It's a good classifier but a bad estimator. 

In [95]:
y_pred = model.predict_proba(X_test2)

In [98]:
y_pred[:,0]

array([0.97158038, 0.90979675, 0.00333308, ..., 0.79236212, 0.01529337,
       0.88559509])

In [110]:
test = pd.DataFrame(X_test).reset_index()
test['fresh_prob'] = y_pred[:,0]

In [118]:
#top 10 likely fresh
test.sort_values('fresh_prob', ascending=False).head(10)['qt'].values

array(['rudy vallee turns in his best performance as a gentle puny millionaire named hackensacker in this brilliant simultaneously tender and scalding 1942 screwball comedy by preston sturges',
       'the city of lost children is a stunningly surreal fantasy a fable of longing and danger of heroic deeds and bravery set in a brilliantly realized world of its own it is one of the most audacious original films of the year',
       'martin scorseses intimate epic about money sex and brute force is a grandly conceived study of what happens to goodfellas from the mean streets when they outstrip their wildest dreams and achieve the pinnacle of wealth and power',
       'the most exciting debut in years it is unified by the extraordinary decor  colour supplement chic meets pop art surrealism  which creates a world of totally fantastic reality situated foursquare in contemporary paris',
       'james and the giant peach the latest animated film from disney is a technological marvel arch and in

In [117]:
#top 10 likely rot
test.sort_values('fresh_prob', ascending=True).head(10)['qt'].values

array(['the problem with deuce bigalow male gigolo is the low level of its low humor',
       'georgia rule is a bad idea dreadfully executed  on golden pond with fellatio jokes and whimsical incest melodrama and fonda playing her dad who more and more she eerily resembles',
       'contrived obvious and overstated crash is basically just one white mans righteous attempt to make other white people feel as if theyve confronted the problem of racism headon',
       'while the makers of robin hood prince of thieves may have set out to bury the poor old duffer of sherwood forest in a welter of trendy banter they have ended up burying themselves as well',
       'bland interminable chase scenes take up so much of the story  the hackneyed plot doesnt need much exposition  that the sheer repetitiveness begins to amaze you',
       'perhaps at 90 or so minutes it would have been the hitchcockian thriller that it isnt at the beginning but turns into at two hours and 20 minutes theres too much o

---

### 8) Find the movies with at least 10 reviews that are most likely to be fresh or rotten.

In [13]:
# A: