<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Sentiment Analysis and Naive Bayes

_Authors: Kiefer Katovich (SF)_

---

In the sentiment analysis lesson, we used a predefined dictionary of positive and negative valences for words. This  lab inverts that process: You'll find which words are most likely to appear in positive or negative reviews by using the rotten versus fresh binary label.

### Naive Bayes

A common practical way to do this is with the Naive Bayes algorithm. Naive Bayes classifiers are covered in more depth in another lecture — for this lab, you'll just be leveraging the scikit-learn implementation.

Given a feature, $x_i$, and target, $y_i$, Naive Bayes classifiers solve for $P(x_i \;|\; y_i)$. In other words, they solve for the probability of a feature/predictor _given_ that the target is one.

We'll use this to figure out which words are more likely to appear when the target is one ("fresh") versus zero ("rotten").

---

### 1) Load the packages and movie data.

Perform any necessary cleaning.

In [1]:
import pandas as pd
import numpy as np

# We are using the BernoulliNB version of Naive Bayes, which assumes that predictors are binary encoded.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
rt = pd.read_csv('./datasets/rt_critics.csv')

In [3]:
# A:

---

### 2) Create a predictor matrix of words in the quotes with `CountVectorizer`.

It's up to you to select an n-gram range. **Make sure that `binary=True`**.

In [4]:
# A:

---

### 3) Split the data into training and testing sets.

You should keep 25 percent of the data in the testing set.

In [5]:
# A:

---

### 4) Build a `BernoulliNB` model predicting fresh versus rotten from the word occurrences.

The model should only be built (and cross-validated) on the training data.

Cross-validate the score and compare it to the baseline.

In [6]:
# A:

---

### 5) Pull out the probability of words given "fresh."

The `.feature_log_prob_` attribute of the Naive Bayes model contains the log probabilities of a feature appearing given a target class.

The rows correspond to the class of the target and the columns correspond to the features. The first row is the zero, "rotten" class and the second row is the one, "fresh" class.

#### 5.A) Pull out the log probabilities and convert them to probabilities for fresh and rotten.

In [7]:
# A:

#### 5.B) Make a DataFrame with the probabilities and features.

In [8]:
# A:

#### 5.C) Create a column that is the difference between the probability of the appearance of fresh and rotten.

In [9]:
# A:

#### 5.D) Look at the most likely words for fresh and rotten reviews.

In [10]:
# A:

---

### 6) Examine how your model performs on the testing set.

In [11]:
# A:

---

### 7) Look at the top 10 movies and reviews likely to be fresh and the top 10 likely to be rotten.

You can fit the model on the full set of data for this.

> **Note:** While it's good at classifying, Naive Bayes is known to be somewhat bad at providing accurate predicted probabilities (beyond getting it on the correct side of 50 percent). It's a good classifier but a bad estimator. 

In [12]:
# A:

---

### 8) Find the movies with at least 10 reviews that are most likely to be fresh or rotten.

In [13]:
# A: