<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis and Naive Bayes

_Instructor: Aymeric Flaisler_

---

In the sentiment analysis lesson we used a predefined dictionary of positive and negative valences for words. This  lab has invert the process: you'll find which words are most likely to appear in positive or negative reviews by using the rotten vs. fresh binary label.

### Naive Bayes

A practical and common way to do this is with the Naive Bayes algorithm. For this lab you'll  be leveraging the sklearn implementation.

Given a feature $x_i$ and target $y_i$, Naive Bayes classifiers solve for $P(x_i \;|\; y_i)$. In other words, the probability of a feature/predictor _given_ that the target is 1.

We'll use this to figure out which words are more likely to appear when the target is 1 ("fresh") vs when the target is 0 ("rotten").

---

### 1. Load packages and movie data

Do any cleaning you deem necessary.

In [43]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [44]:
rt = pd.read_csv('./datasets/rt_critics.csv')

In [45]:
rt = rt[rt.fresh.isin(['fresh','rotten'])]
rt.fresh = rt.fresh.map(lambda x: 1 if x == 'fresh' else 0)

In [46]:
rt.head(2)

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,1,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,1,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story


---

### 2. We need to create a predictor matrix of words from the quotes with CountVectorizer

It is up to you what ngram range you want to select. **Make sure that `binary=True`**

In [47]:
cv = CountVectorizer(ngram_range=(1,2), max_features=2500, binary=True, stop_words='english')
words = cv.fit_transform(rt.quote)

In [48]:
words.shape

(14049, 2500)

In [49]:
words = pd.DataFrame(words.todense(), columns=cv.get_feature_names())

In [50]:
words.head()

Unnamed: 0,10,100,13,1961,1998,20,2001,30,40,50s,...,year,year old,years,years ago,yes,york,young,younger,youth,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
print(words.shape)

(14049, 2500)


---

### 3. Let's split data into training and testing splits

You should keep 25% of the data in the test set.

In [52]:
Xtrain, Xtest, ytrain, ytest = train_test_split(words.values, rt.fresh.values, test_size=0.25)
print(Xtrain.shape, Xtest.shape)

(10536, 2500) (3513, 2500)


---

### 4. Build a `BernoulliNB` model predicting fresh vs. rotten from the word appearances

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to baseline.

In [53]:
# A:
from sklearn.naive_bayes import BernoulliNB


In [54]:
clf = BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
clf.fit(Xtrain, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [55]:
#basline
np.mean(ytrain)

0.6139901290812453

In [56]:
cross_val_score(clf,Xtrain, ytrain ,cv=5).mean()

0.7387071331162396


---

### 5. Pull out the probability of words given "fresh"

The `.feature_log_prob_` attribute of the naive bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target, and the columns correspond to the features. The first row is the 0 "rotten" class, and the second is the 1 "fresh" class.

#### 5.1 Pull out the log probabilities and convert them to probabilities (for fresh and for rotten).

In [57]:
df = clf.feature_log_prob_

In [58]:
# A:
df_n = pd.DataFrame(np.exp(df)).T
df_n.columns = ['rotten','fresh']

#### 5.2 Make a dataframe with the probabilities and features

In [59]:
# A:
words_co=pd.DataFrame(words.columns.values)
words_arr=np.array(words.columns)
words_j=pd.DataFrame(np.array(words.columns),columns=['words'])
df_n=df_n.join(words_j)

In [41]:
df_n.head()

Unnamed: 0,rotten,fresh,words
0,0.004179,0.002318,10
1,0.001229,0.001391,100
2,0.001229,0.001391,13
3,0.000246,0.001391,1961
4,0.000983,0.001082,1998


#### 5.3 Create a column that is the difference between fresh probability of appearance and rotten

In [60]:
# A:
df_n.isna().sum()

rotten    0
fresh     0
words     0
dtype: int64

In [63]:
df_n['difference']= df_n.loc[:,'fresh'] - df_n.loc[:,'rotten']
df_n.head()

Unnamed: 0,rotten,fresh,words,difference
0,0.003686,0.002163,10,-0.001523
1,0.001475,0.001236,100,-0.000238
2,0.000983,0.001391,13,0.000408
3,0.000246,0.001391,1961,0.001145
4,0.001229,0.000927,1998,-0.000302


#### 5.4 Look at the most likely words for fresh and rotten reviews

In [64]:
# A:
df_n.sort_values(by=['difference'],ascending=False).head(15)



Unnamed: 0,rotten,fresh,words,difference
825,0.119194,0.157781,film,0.038587
193,0.017449,0.042343,best,0.024894
965,0.010568,0.028125,great,0.017558
693,0.005407,0.022253,entertaining,0.016846
1584,0.006636,0.022717,performance,0.016081
900,0.012042,0.025344,fun,0.013302
87,0.008356,0.020862,american,0.012506
694,0.005407,0.017153,entertainment,0.011747
2248,0.026296,0.037861,time,0.011565
837,0.01278,0.023489,films,0.01071


---

### 6. Examine how your model performs on the test set

In [65]:
# A:
y_predict=clf.predict(Xtest)

In [68]:
#basline
np.mean(ytest)

0.6103045829775121

In [66]:
cross_val_score(clf, Xtest,ytest,cv=5).mean()

0.6920034983553265

---

### 7. Look at the top 10 movies and reviews likely to be fresh and top 10 likely to be rotten

You can fit the model on the full set of data for this.

> **Note:** Naive Bayes, while good at classifying, is known to be somewhat bad at giving accurate predicted probabilities (beyond getting it on the correct side of 50%). It is a good classifier but a bad estimator. 

In [69]:
# A:
train_data = words
test_data = rt.fresh

In [73]:
model = BernoulliNB().fit(train_data,test_data)

In [74]:
prob_movies = model.predict_proba(train_data)

In [75]:
df= pd.DataFrame({'movie_title':rt.title, 'quote':rt.quote,
                  'prob_fresh':full_model.predict_proba(train_data)[:,1]})

In [77]:
df.head(2)

Unnamed: 0,movie_title,quote,prob_fresh
0,Toy story,"So ingenious in concept, design and execution ...",0.68813
1,Toy story,The year's most inventive comedy.,0.687331


In [79]:
df_rotten = df.copy(deep=True)
df_rotten.sort_values(by='prob_fresh', ascending=True, inplace=True)

In [80]:
# the top 10 movies
df_rotten.iloc[:10,[1,2]]

Unnamed: 0,quote,prob_fresh
12567,"With intentionally stilted animation, uninspir...",1.2e-05
3546,There's not enough story here for something ha...,1.3e-05
3521,"As fairy tale, buddy comedy, family drama, thr...",4.1e-05
9906,"A disappointment, a movie in which the Japanes...",5e-05
2112,Imagine the dumbest half-hour sitcom you've ev...,6.2e-05
13795,Unless you have a craving to watch a sluggish ...,6.4e-05
1580,"Despite some delicious moments, this sluggish,...",6.8e-05
898,Basic Instinct 2 has a stylish look and a few ...,6.8e-05
5680,A compendium of the worst cliches of Japanese ...,0.000123
6837,"Pointless, plodding plotting; asinine action; ...",0.000137


---

### 8. Find the most likely to be fresh and rotten for movies with at least 10 reviews.

In [82]:
# A:

# top 10 fresh movies with at least 10 reviews.
fresh_reviewd=pd.DataFrame(df.movie_title.value_counts().reset_index())
fresh_reviewd.columns=['movie_title','no.reviews']
fresh_reviewd=fresh_reviewd.join(df.prob_fresh)
fresh_reviewd=fresh_reviewd.sort_values(by=['prob_fresh'],ascending=False)
fresh_reviewd[fresh_reviewd['no.reviews']>=10].head(10)

Unnamed: 0,movie_title,no.reviews,prob_fresh
261,Supernova,15,0.999977
399,Sweet and Lowdown,12,0.99995
381,The Love Letter,13,0.999917
305,Sense and Sensibility,14,0.999856
312,A League of Their Own,14,0.999762
49,Heat,20,0.999703
228,Ever After,16,0.999443
532,City Lights,10,0.999318
367,Midnight Run,13,0.999311
55,Gojira,20,0.999126
