# Sentiment analysis and naive bayes

Now that you've seen sentiment analysis using a predefined dictionary of positive and negative valence for different words, this lab will have you do sentiment analysis on the movie review dataset "in reverse": you'll find which words are most likely to appear in positive or negative valenced reviews.

---

### Naive Bayes

For this lab we're going to use a classifier common in NLP and sentiment analysis called Naive Bayes. We'll be covering Naive Bayes classifiers in more depth in a later lecture – for this lab you'll just be implementing it using sklearn.

Essentially, Naive Bayes solves an inverted problem to what you are used to in supervized learning. Given a feature $x_i$ and target $y$, it solves for $P(x_i | y)$. In other words, the probability of a feature/predictor _given_ that the target is 1.

We'll use this to figure out which words are more likely to appear when the target is 1 ("fresh") vs when the target is 0 ("rotten").

---

### Load packages and movie data

Do any cleaning you deem necessary.

In [5]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.cross_validation import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [22]:
rt = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/rottentomatoes_critics/rt_critics.csv')

rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [98]:
rt.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story


---

### Create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [77]:
cv = CountVectorizer(ngram_range=(1,2), max_features=2500, binary=True, stop_words='english')
words = cv.fit_transform(rt.quote)

In [78]:
words.shape

(14049, 2500)

In [79]:
words = pd.DataFrame(words.todense(), columns=cv.get_feature_names())

In [80]:
words.head()

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [81]:
print words.shape

(14049, 2500)


---

### Split data into training and testing splits

You should keep 25% of the data in the test set.

In [82]:
Xtrain, Xtest, ytrain, ytest = train_test_split(words.values, rt.fresh.values, test_size=0.25)

In [83]:
print Xtrain.shape, Xtest.shape

(10536, 2500) (3513, 2500)


---

### Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [84]:
nb = BernoulliNB()

In [85]:
nb.fit(Xtrain, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [94]:
nb_scores = cross_val_score(BernoulliNB(), Xtrain, ytrain, cv=5)
print nb_scores
print np.mean(nb_scores)
print np.mean(ytrain)

[ 0.7471537   0.71963947  0.72615093  0.74228761  0.75071225]
0.73718879156
0.613800303721


---

### Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correpsond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 1. Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [87]:
feat_lp = nb.feature_log_prob_

In [88]:
fresh_p = np.exp(feat_lp[1])

In [89]:
rotten_p = np.exp(feat_lp[0])

#### 2. Make a dataframe with the probabilities and features

In [90]:
feat_probs = pd.DataFrame({'fresh_p':fresh_p, 'rotten_p':rotten_p, 'feature':words.columns.values})

#### 3. Create a column that is the difference between fresh probability of appearance and rotten

In [91]:
feat_probs['fresh_diff'] = feat_probs.fresh_p - feat_probs.rotten_p

#### 4. Look at the most likely words for fresh and rotten reviews

In [92]:
feat_probs.sort_values('fresh_diff', ascending=False, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
823,film,0.155047,0.112012,0.043035
192,best,0.041738,0.01916,0.022578
965,great,0.028598,0.00958,0.019018
691,entertaining,0.023188,0.006141,0.017047
87,american,0.021333,0.006141,0.015192
1590,performance,0.021178,0.006632,0.014546
898,fun,0.024888,0.011791,0.013097
902,funny,0.033699,0.020879,0.01282
948,good,0.046375,0.033898,0.012477
835,films,0.023342,0.011299,0.012043


In [93]:
feat_probs.sort_values('fresh_diff', ascending=True, inplace=True)
feat_probs.head(20)

Unnamed: 0,feature,fresh_p,rotten_p,fresh_diff
1269,like,0.041738,0.071481,-0.029744
1446,movie,0.12985,0.14763,-0.01778
156,bad,0.008038,0.025301,-0.017263
1761,really,0.006183,0.021371,-0.015187
605,doesn,0.015149,0.029968,-0.014819
1166,isn,0.010666,0.024564,-0.013898
1635,plot,0.012058,0.025301,-0.013243
1280,little,0.017313,0.029722,-0.012409
1192,just,0.027516,0.039548,-0.012032
1900,script,0.010357,0.021616,-0.011259


---

### Examine how your model performs on the test set

In [95]:
print nb.score(Xtest, ytest)
print np.mean(ytest)

0.724736692286
0.610873896954


---

### Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

Just to note: Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%). It is a good classifier but a bad estimator. 

In [96]:
X = words.values
y = rt.fresh

In [97]:
nbfull = BernoulliNB().fit(X,y)

In [99]:
pp = pd.DataFrame({
        'prob_fresh':nbfull.predict_proba(X)[:,1],
        'movie':rt.title,
        'quote':rt.quote
    })

In [100]:
pp.sort_values('prob_fresh', ascending=False, inplace=True)
for movie, quote in zip(pp.movie[0:10], pp.quote[0:10]):
    print movie,'\t', quote
    print '--------------------------------------------------\n'

Kundun 	Stunning, odd, glorious, calm and sensationally absorbing, director Martin Scorsese's Kundun is a remarkable piece of work with vital colors and a wrenching message.
--------------------------------------------------

City Hall 	Its chief pleasure is the acting of the big cast, notably Pacino. At 55, he has a haggard, life-wrestling beauty and a street eloquence that has more innocence than De Niro and more sincerity than Nicholson.
--------------------------------------------------

The Wild Bunch 	The Wild Bunch is Peckinpah's most complex inquiry into the metamorphosis of man into myth. Not incidentally, it is also a raucous, violent, powerful feat of American film making.
--------------------------------------------------

Witness 	Powerful, assured, full of beautiful imagery and thankfully devoid of easy moralising, it also offers a performance of surprising skill and sensitivity from Ford.
--------------------------------------------------

The English Patient 	This is on

In [101]:
pp.sort_values('prob_fresh', ascending=True, inplace=True)
for movie, quote in zip(pp.movie[0:10], pp.quote[0:10]):
    print movie,'\t', quote
    print '--------------------------------------------------\n'

Pokémon: The First Movie 	With intentionally stilted animation, uninspired music and lame jokes, Pokemon is basically an ultralong version of the phenomenon's own boring TV 'toon.
--------------------------------------------------

Joe's Apartment 	There's not enough story here for something half that length, so we're subjected to numerous pointless and irritating song-and-dance numbers designed to nudge the lame plot towards its conclusion.
--------------------------------------------------

Kazaam 	As fairy tale, buddy comedy, family drama, thriller or rap revue, Kazaam is simply uninspired and unconvincing, and Mr. O'Neal, who can carry a basketball team, lacks the charisma to rescue this misguided effort.
--------------------------------------------------

Gung Ho 	A disappointment, a movie in which the Japanese are mostly used for the mechanical requirements of the plot, and the Americans are constructed from durable but boring stereotypes.
----------------------------------------

In [112]:
# subset to movies with at least 10 reviews:
movie_counts = pp.movie.value_counts().reset_index()
movie_counts.columns = ['movie','counts']
movie_counts.head()

Unnamed: 0,movie,counts
0,The Hurricane,20
1,War of the Worlds,20
2,My Fellow Americans,20
3,Shakespeare in Love,20
4,The Sixth Sense,20


In [114]:
pp_movies = pp[['movie','prob_fresh']].groupby('movie').agg(np.mean).reset_index()
pp_movies = pp_movies[pp_movies.movie.isin(movie_counts[movie_counts.counts >= 10].movie)]

In [117]:
pp_movies.sort_values('prob_fresh', ascending=False, inplace=True)
pp_movies.head(20)

Unnamed: 0,movie,prob_fresh
1417,The Iron Giant,0.979671
862,Midnight Run,0.938409
209,Boogie Nights,0.932932
830,Manhattan,0.923222
1447,The Little Mermaid,0.915153
652,Il conformista,0.910188
1058,Raging Bull,0.909178
1055,Quiz Show,0.897758
1615,Toy story,0.894194
298,Cookie's Fortune,0.889868


In [118]:
pp_movies.sort_values('prob_fresh', ascending=True, inplace=True)
pp_movies.head(20)

Unnamed: 0,movie,prob_fresh
152,Basic Instinct 2,0.174275
1205,Spy Hard,0.178672
1661,Vegas Vacation,0.17886
1669,Virus,0.198663
1632,Twisted,0.207847
139,Bad Boys,0.211646
288,Color of Night,0.212499
9,3 Strikes,0.21675
391,Dracula: Dead and Loving It,0.219904
627,House Arrest,0.223762
