# Naive Bayes for Sentiment Analysis

## Introduction

In this mission, we'll be working with a CSV file containing movie reviews. Each row contains the text of the review, as well as a number indicating whether the tone of the review is positive(`1`) or negative(`-1`).<br>

We want to predict whether a review is negative or positive, based on the text alone. To do this, we'll train an algorithm using the reviews and classifications in `train.csv`, and then make predictions on the reviews in `test.csv`. We'll be able to calculate our error using the actual classifications in `test.csv` to see how good our predictions were.<br>

We'll use [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) for our classification algorithm. A Naive Bayes classifier works by figuring out how likely data attributes are to be associated with a certain class.<br>

This classifier is based on [Bayes' theorem](http://en.wikipedia.org/wiki/Bayes%27_theorem), which is:

$$P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}$$

This equation basically states that "the probability of A given that B is true equals the probability of B given that A is true times the probability of A being true, divided by the probability of B being true."<br>

Let's do a quick exercise to understand this rule better.

In [2]:
# Let's say this is your running history for the past week
# For each day, it records whether or not you ran, and whether or not you were tired
days = [["ran", "was tired"], ["ran", "was not tired"], ["didn't run", "was tired"], ["ran", "was tired"], ["didn't run", "was not tired"], ["ran", "was not tired"], ["ran", "was tired"]]

# Let's say we want to use Bayes' theorem to calculate the odds that you were tired, given that you ran
# This is P(A)
prob_tired = len([d for d in days if d[1] == "was tired"]) / len(days)
# This is P(B)
prob_ran = len([d for d in days if d[0] == "ran"]) / len(days)
# This is P(B|A)
prob_ran_given_tired = len([d for d in days if d[0] == "ran" and d[1] == "was tired"]) / len([d for d in days if d[1] == "was tired"])

# Now we can calculate P(A|B)
prob_tired_given_ran = (prob_ran_given_tired * prob_tired) / prob_ran

print("Probability of being tired given that you ran: {0}".format(prob_tired_given_ran))

Probability of being tired given that you ran: 0.6


## Overview of Naive Bayes

Let's try a slightly different example. Let's say we still have one classification -- whether or not you were tired. And let's say we have two data points -- whether or not you ran, and whether or not you woke up early. Bayes' theorem doesn't work in this case, because we have two data points, instead of just one.<br>

This is where Naive Bayes can help. Naive Bayes extends Bayes' theorem to handle this case by assuming that each data point is independent.<br>

The formula looks like this:<br>

$$
P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}{P(x_1, \dots, x_n)}
$$

This is saying "the probability that classification $y$ is correct given the features $x_1$, $x_2$, and so on equals the probability of $y$ times the product of each $x$ feature given $y$, divided by the probability of the $x$ features".<br>

To find the "right" classification, we just find out which classification $(P(y∣x_1,…,x_n))$ has the highest probability with the formula.

In [4]:
# Here's our data, but with "woke up early" or "didn't wake up early" added
days = [["ran", "was tired", "woke up early"], ["ran", "was not tired", "didn't wake up early"], ["didn't run", "was tired", "woke up early"], ["ran", "was tired", "didn't wake up early"], ["didn't run", "was tired", "woke up early"], ["ran", "was not tired", "didn't wake up early"], ["ran", "was tired", "woke up early"]]

# We're trying to predict whether or not you were tired on this day
new_day = ["ran", "didn't wake up early"]

def calc_y_probability(y_label, days):
    return len([d for d in days if d[1] == y_label]) / len(days)

def calc_ran_probability_given_y(ran_label, y_label, days):
    return len([d for d in days if d[1] == y_label and d[0] == ran_label]) / len(days)

def calc_woke_early_probability_given_y(woke_label, y_label, days):
    return len([d for d in days if d[1] == y_label and d[2] == woke_label]) / len(days)

denominator = len([d for d in days if d[0] == new_day[0] and d[2] == new_day[1]]) / len(days)
# Plug all the values into our formula
# Multiply the class (y) probability, and the probability of the x-values occurring given that class
prob_tired = (calc_y_probability("was tired", days) * calc_ran_probability_given_y(new_day[0], "was tired", days) * calc_woke_early_probability_given_y(new_day[1], "was tired", days)) / denominator

prob_not_tired = (calc_y_probability("was not tired", days) * calc_ran_probability_given_y(new_day[0], "was not tired", days) * calc_woke_early_probability_given_y(new_day[1], "was not tired", days)) / denominator

# Make a classification decision based on the probabilities
classification = "was tired"
if prob_not_tired > prob_tired:
    classification = "was not tired"
print("Final classification for new day: {0}.\nTired probability: {1}.\nNot tired probability: {2}.".format(classification, prob_tired, prob_not_tired))

Final classification for new day: was tired.
Tired probability: 0.10204081632653061.
Not tired probability: 0.054421768707482984.
