<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis and Naive Bayes

_Authors: Kiefer Katovich (SF)_

---

In the sentiment analysis lesson we used a predefined dictionary of positive and negative valences for words. This  lab has invert the process: you'll find which words are most likely to appear in positive or negative reviews by using the rotten vs. fresh binary label.

### Naive Bayes

A practical and common way to do this is with the Naive Bayes algorithm. Naive Bayes classifiers are covered in more depth in another lecture – for this lab you'll just be leveraging the sklearn implementation.

Given a feature $x_i$ and target $y_i$, Naive Bayes classifiers solve for $P(x_i \;|\; y_i)$. In other words, the probability of a feature/predictor _given_ that the target is 1.

We'll use this to figure out which words are more likely to appear when the target is 1 ("fresh") vs when the target is 0 ("rotten").

---

### 1. Load packages and movie data

Do any cleaning you deem necessary.

In [2]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [3]:
rt = pd.read_csv('./datasets/rt_critics.csv')

rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [4]:
rt.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story


---

### 2. Create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [5]:
cv = CountVectorizer(ngram_range=(1,2), max_features=2500, binary=True, stop_words='english')
words = cv.fit_transform(rt.quote)

In [6]:
words.shape

(14049, 2500)

In [7]:
words = pd.DataFrame(words.todense(), columns=cv.get_feature_names())

In [8]:
words.head()

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
print words.shape

(14049, 2500)


---

### 3. Split data into training and testing splits

You should keep 25% of the data in the test set.

In [10]:
Xtrain, Xtest, ytrain, ytest = train_test_split(words.values, rt.fresh.values, test_size=0.25)

In [11]:
print Xtrain.shape, Xtest.shape

(10536, 2500) (3513, 2500)


---

### 4. Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [12]:
nb = BernoulliNB()

In [13]:
nb.fit(Xtrain, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [14]:
nb_scores = cross_val_score(BernoulliNB(), Xtrain, ytrain, cv=5)
print nb_scores
print np.mean(nb_scores)
print np.mean(ytrain)

[ 0.74240987  0.7329222   0.73896535  0.73659231  0.72649573]
0.735477091947
0.615413819286


---

### 5. Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correpsond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 5.1 Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [15]:
feat_lp = nb.feature_log_prob_

In [16]:
fresh_p = np.exp(feat_lp[1])

In [17]:
rotten_p = np.exp(feat_lp[0])

#### 5.2 Make a dataframe with the probabilities and features

In [18]:
feat_probs = pd.DataFrame({'fresh_p':fresh_p, 'rotten_p':rotten_p, 'feature':words.columns.values})

#### 5.3 Create a column that is the difference between fresh probability of appearance and rotten

In [20]:
feat_probs['fresh_diff'] = feat_probs.fresh_p - feat_probs.rotten_p

#### 5.4 Look at the most likely words for fresh and rotten reviews

In [21]:
feat_probs.sort_values('fresh_diff', ascending=False, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
825,film,0.154178,0.111495,0.042683
193,best,0.041936,0.019734,0.022203
965,great,0.030065,0.009373,0.020691
693,entertaining,0.02436,0.00666,0.0177
1584,performance,0.023127,0.006413,0.016713
87,american,0.022047,0.0074,0.014647
694,entertainment,0.017422,0.004193,0.013229
900,fun,0.025131,0.01332,0.011811
837,films,0.023127,0.012087,0.01104
2248,time,0.039161,0.028367,0.010794


In [22]:
feat_probs.sort_values('fresh_diff', ascending=True, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
1269,like,0.044403,0.065614,-0.021211
157,bad,0.007092,0.024914,-0.017821
606,doesn,0.015109,0.030587,-0.015478
1754,really,0.006938,0.020967,-0.014029
1629,plot,0.012488,0.0259,-0.013412
1445,movie,0.13136,0.144549,-0.013189
1164,isn,0.011718,0.024667,-0.012949
1280,little,0.018347,0.030587,-0.01224
1894,script,0.010484,0.020967,-0.010483
1182,jokes,0.003238,0.013074,-0.009836


---

### 6. Examine how your model performs on the test set

In [23]:
print nb.score(Xtest, ytest)
print np.mean(ytest)

0.743524053516
0.606034728153


---

### 7. Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

> **Note:** Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%). It is a good classifier but a bad estimator. 

In [24]:
X = words.values
y = rt.fresh

In [25]:
nbfull = BernoulliNB().fit(X,y)

In [26]:
pp = pd.DataFrame({
        'prob_fresh':nbfull.predict_proba(X)[:,1],
        'movie':rt.title,
        'quote':rt.quote
    })

In [27]:
pp.sort_values('prob_fresh', ascending=False, inplace=True)
for movie, quote in zip(pp.movie[0:10], pp.quote[0:10]):
    print movie,'\t', quote
    print '--------------------------------------------------\n'

Kundun 	Stunning, odd, glorious, calm and sensationally absorbing, director Martin Scorsese's Kundun is a remarkable piece of work with vital colors and a wrenching message.
--------------------------------------------------

The Wild Bunch 	The Wild Bunch is Peckinpah's most complex inquiry into the metamorphosis of man into myth. Not incidentally, it is also a raucous, violent, powerful feat of American film making.
--------------------------------------------------

Witness 	Powerful, assured, full of beautiful imagery and thankfully devoid of easy moralising, it also offers a performance of surprising skill and sensitivity from Ford.
--------------------------------------------------

The English Patient 	This is one of the year's most unabashed and powerful love stories, using flawless performances, intelligent dialogue, crisp camera work, and loaded glances to attain a level of eroticism and emotional connection that many similar films miss.
--------------------------------------

In [28]:
pp.sort_values('prob_fresh', ascending=True, inplace=True)
for movie, quote in zip(pp.movie[0:10], pp.quote[0:10]):
    print movie,'\t', quote
    print '--------------------------------------------------\n'

Pokémon: The First Movie 	With intentionally stilted animation, uninspired music and lame jokes, Pokemon is basically an ultralong version of the phenomenon's own boring TV 'toon.
--------------------------------------------------

Joe's Apartment 	There's not enough story here for something half that length, so we're subjected to numerous pointless and irritating song-and-dance numbers designed to nudge the lame plot towards its conclusion.
--------------------------------------------------

Kazaam 	As fairy tale, buddy comedy, family drama, thriller or rap revue, Kazaam is simply uninspired and unconvincing, and Mr. O'Neal, who can carry a basketball team, lacks the charisma to rescue this misguided effort.
--------------------------------------------------

Gung Ho 	A disappointment, a movie in which the Japanese are mostly used for the mechanical requirements of the plot, and the Americans are constructed from durable but boring stereotypes.
----------------------------------------

---

### 8. Find the most likely to be fresh and rotten for movies with at least 10 reviews.

In [33]:
# subset to movies with at least 10 reviews:
movie_counts = pp.movie.value_counts().reset_index()
movie_counts.columns = ['movie','counts']
movie_counts.head()

Unnamed: 0,movie,counts
0,The Hurricane,20
1,Fever Pitch,20
2,The Truman Show,20
3,The Green Mile,20
4,The Sixth Sense,20


In [34]:
pp_movies = pp[['movie','prob_fresh']].groupby('movie').agg(np.mean).reset_index()
pp_movies = pp_movies[pp_movies.movie.isin(movie_counts[movie_counts.counts >= 10].movie)]

In [35]:
pp_movies.sort_values('prob_fresh', ascending=False, inplace=True)
pp_movies.head(20)

Unnamed: 0,movie,prob_fresh
1417,The Iron Giant,0.979857
862,Midnight Run,0.938485
209,Boogie Nights,0.933018
830,Manhattan,0.923313
1447,The Little Mermaid,0.91311
1058,Raging Bull,0.909063
652,Il conformista,0.899021
1055,Quiz Show,0.897883
1615,Toy story,0.894362
298,Cookie's Fortune,0.889969


In [36]:
pp_movies.sort_values('prob_fresh', ascending=True, inplace=True)
pp_movies.head(20)

Unnamed: 0,movie,prob_fresh
152,Basic Instinct 2,0.172296
1205,Spy Hard,0.178794
1661,Vegas Vacation,0.178997
1669,Virus,0.198833
288,Color of Night,0.210623
139,Bad Boys,0.211745
9,3 Strikes,0.216798
391,Dracula: Dead and Loving It,0.220098
627,House Arrest,0.223951
1296,The Bachelor,0.224735
