In [1]:
import numpy as np
from __future__ import division
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Naive Bayes

_Authors: Dan Wilhelm (LA) and Alex Combs (NYC) _

---

<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*
- Describe Naive Bayes
- Choose a Naive Bayes implementation based on your use case
- Implement a Naive Bayes model through scikit-learn



### Lesson Guide

- [Discriminative Models vs Generative Models](#discriminative-models-vs-generative-models)
    - [Discriminative vs Generative Example](#discriminative-vs-generative-example)
    - [Making a Generative Model (joint probability distribution)](#making-a-generative-model-joint-probability-distribution)
    - [The Joint Probability Generative Model is not Practical](#the-joint-probability-generative-model-is-not-practical)
    - [Making a Discriminative Model (logistic regression)](#making-a-discriminative-model-logistic-regression)


- [A Better Generative Model?](#a-better-generative-model)
    - [Bayes' Theorem](#bayes-theorem)
    - [Conditional probability](#conditional-probability)


- [A really simple spam example](#a-really-simple-spam-example)
    - [Problem: Multiple features require joint probabilities](#problem-multiple-features-require-joint-probabilities)
    - [Solution: The Naive Bayes Independence Assumption](#solution-the-naive-bayes-independence-assumption)
    - [Spam application: Multiple Features](#spam-application-multiple-features)


- [Production Issues](#production-issues)
- [Summary](#summary)


- [Implementation in Scikit-learn](#implementation-in-scikit-learn)
- [Guided practice: Scikit-learn implementation](#guided-practice-scikit-learn-implementation)
- [Conclusion](#conclusion)


<a id="discriminative-models-vs-generative-models"></a>
## Discriminative Models vs Generative Models
---

Logistic Regression is a **discriminative model**. 

+ Its equation $P\;(\;class\;|\;features\;)$ **discriminates** between two classes. In other words, it describes the boundary between the classes.
+ From this equation, we cannot generate "typical" members of either class -- we only know their boundary.

Naive Bayes is a **generative model**. Using it, we can **generate** typical members of each class, since we know what a "typical" member of each class looks like.

+ Note that we are still estimating $P\;(\;class\;|\;features\;)$ for each class!
+ By computing how "typical" each example is to each class, we can choose the most likely class.

<a id="discriminative-vs-generative-example"></a>
### Discriminative vs Generative Example

Let's make a simple generative model from scratch. Suppose we are attempting to infer whether someone is a GA data science student or not. To do this, we take members of the general population and evaluate three binary features:

- **G**: At GA.
- **C**: Has Computer.
- **S**: Has Stats Book.

Our model of a "person" is now the presence or absence of each of these three features.

#### Data

We sample ten students in each class. We'll wildly assume the GA/non-GA numbers are good, representative samples of the population for each combination of features. 

- **GA Student** - GCS, GC, GCS, GCS, GC, C, C, GC, GCS, GCS
- **Not GA Student** - none, none, C, GC, C, none, C, C, CS, none

We grabbed ten examples per class. Note this is typically the data we have access to in most data science problems -- examples of each class of interest. However, to be representative of the overall population, should we have gotten ten examples from each of the eight categories instead? Which technique would be more accurate? Think about that, while we take this toy data and tally it.

<a id="making-a-generative-model-joint-probability-distribution"></a>
### Making a Generative Model (joint probability distribution)

Summarizing this data per class  (where $\neg$ indicates negation):

|                   | GCS | GC | CS | S | C | none
| ---               | --- | -- | -- | - | - | -
| **Student**       | 5   | 3  | 0  | 0 | 2 | 0 
| **$\neg$Student** | 0   | 1  | 1  | 0 | 4 | 4


Directly from these, we'll make a generative model of a student:

- $P(\;Student\;|\;GCS\;) = 1$.
- $P(\;Student\;|\;GC\;) = 3/4$.
- $P(\;Student\;|\;C\;) = 1/3$.
- $P(\;Student\;|\;none\;) = 0$.

Similarly for a non-student:

- $P(\;\neg Student\;|\;GCS\;) = 0$.
- $P(\;\neg Student\;|\;GC\;) = 1/4$.
- $P(\;\neg Student\;|\;CS\;) = 1$.
- $P(\;\neg Student\;|\;C\;) = 2/3$.
- $P(\;\neg Student\;|\;none\;) = 1$.


You may recognize that we just computed the **joint probabilities**! 

Note we must store $2^3$ parameters in total -- the presence or absence of three features.

#### Using the Generative Model

+ Suppose we see someone with $GCS$. We would then guess with confidence 1 he or she is a GA Student. If someone is $GC$, we would guess GA Student with confidence $3/4$.

+ Suppose we want to generate a sequence of "typical" GA Students. Easy -- with probability $\frac{1}{1 + 3/4 + 1/3} = \frac{12}{25}$ generate a $GCS$ person. With probability $\frac{3/4}{1 + 3/4 + 1/3} = \frac{9}{25}$ generate a $GC$ person!

<a id="the-joint-probability-generative-model-is-not-practical"></a>
### The Joint Probability Generative Model is not Practical

We saw earlier how to make a generative model strictly from joint probabilities. However, this method has major problems.

+ The number of parameters stored in the model increases exponentially. For example, if each feature is binary and we have 100 features, we would need $2^{100} \approx 10^{30}$ parameters for every joint probability!

+ Hence, we would need _enormous_ amounts of data to ensure we have sufficient training examples to evaluate each joint probability robustly (again, just to emphasize -- $2^{100}$ joint probabilities for only $100$ binary features).

Although each probability is easy to calculate, the joint probability model is simply not practical.

<a id="making-a-discriminative-model-logistic-regression"></a>
### Making a Discriminative Model (logistic regression)

For comparison, let's make a discriminative model using the same data. 

|                   | GCS | GC | CS | S | C | none
| ---               | --- | -- | -- | - | - | -
| **Student**       | 5   | 3  | 0  | 0 | 2 | 0 
| **$\neg$Student** | 0   | 1  | 1  | 0 | 4 | 4

Let's just eyeball a hyperplane that separates the classes. For simplicity, suppose that our hyperplane that separates the classes (and corresponding link function) is the following formula, where $\sigma$ is the logistic function and $X$ is our feature vector. $G = 1$ if true and $G = -1$ if false:

$$P(\;Student\;|\;X\;) = \sigma(5G - 2C + S)$$

You can see this model is approximately correct. Recall that $5G - 2C + S > 0$ indicates a probability over 0.5. This model makes sense -- being seen at GA is a strong indicator of being a GA student! Having a computer is a bit of a less positive signal since so many non-students also have computers.

#### Using the Discriminative Model

+ We can predict the probability that $GC$ is a student by letting $G = 1, C = 1, S = -1$. So $5G - 2C + S = 2 > 0$, so likely a student.

+ This model is more compact since we store fewer parameters -- $4$ (3 plus a bias) instead of $2^3$. (It even scales well -- linearly instead of exponentially!)

+ However, we cannot generate "typical" students with any accuracy. (Actually, in this particular case we can attempt to; however, with any substantial number of features typical members would be far too ambiguous.)

<a id="a-better-generative-model"></a>
## A Better Generative Model?
---

There must be some reasonable simplifications we can make to build a practical generative model. There are -- and they're nearly as simple to calculate!

Let's understand why we are doing this first. Recall that we are looking for this: $P\;(\;class\;|\;features\;)$. Here, 'class' refers to a category such as 'vertosa', 'versicola', 'Student', etc.

+ Computing this directly implies that we have lots of data for every feature combination. We typically don't have that! Typically, our training data only ensures we have sufficient examples for each **class**. For example, 10 examples of GA students and 10 examples of non-GA students.

+ Because our training data has lots of examples of each class, it would be great if we can flip that conditional probability! Bayes' Theorem to the rescue.

<a id="bayes-theorem"></a>
### Bayes' Theorem

### $$P\left(\;class\;|\;features\;\right) = \frac{P\left(\;features\;|\;class\;\right)P\left(\;class\;\right)}{P(\;features\;)} $$
 
Luckily for us, it's easy to compute these probabilities directly from a data table! 

+ **$P(\;class\;)$**. For example: $P(\;student\;) = P(\;\neg student\;) = 1/2$.
+ **$P(\;features\;)$**. For example, we had 5 $GCS$ combinations in 20 examples, so $P(\;GCS\;) = 5/20 = 1/4$.
+ **$P(\;features\;|\;class\;)$**. For example, for $GCS$ we had 5 students of 10. $P(\;GCS\;|\;Student\;) = 1/2$.

|                   | GCS | GC | CS | S | C | none
| ---               | --- | -- | -- | - | - | -
| **Student**       | 5   | 3  | 0  | 0 | 2 | 0 
| **$\neg$Student** | 0   | 1  | 1  | 0 | 4 | 4


Then what is $P(\;Student\;|\;GCS\;)$? Easy, it's just $\frac{1/2 * 1/2}{1/4} = 1$. That's exactly what we computed above as a joint probability!

+ Does this work out for all of our earlier joint probabilities? See if you can understand why.
+ Hint: Recall that Bayes' is just a single conditional probability. The numerator is just the chain rule: $P(\;A \cap B\;) = P(\;A\;|\;B\;)\;P(\;B\;)$.

|                    | GCS                           | GC                 | ... | 
| ---                | ---------------------------   | ------------------ | --  | 
| **Student**        | Student $\cap$ GCS            | Student $\cap$ GC  | ... | P(Student) = 1/2
| **$\neg$ Student** | $\neg$ Student $\cap$ GCS     | $\neg$ Student $\cap$ GC | ... | P($\neg$ Student) = 1/2
|                    | P(GCS) = 5/20                 | P(GC) = 4/20       |     | |

<a id="conditional-probability"></a>
### Conditional probability

Let's review and see how Bayes' Theorem is just a big conditional probability.

In general, for events $A$ and $B$ the **conditional probability** is:

### $$ P(\;A\;|\;B\;) = \frac{P(\;A \cap B\;)}{P(\;B\;)} $$

Hence (since $\cap$ is commutative, i.e. $P(A \cap B) = P(B \cap A)$):

### $$ P(A \cap B) = P(A\;|\;B) \; P(B) = P(B\;|\;A) \; P(A) $$

This is often referred to as the "chain rule" of probability!

#### Bayes' thereom

From the above, just substitute the second equation into the first:

### $$P\left(\;A\;|\;B\;\right) = \frac{P(\;A \cap B\;)}{P(\;B\;)} = \frac{P\left(\;B\;|\;A\;\right)P\left(\;A\;\right)}{P(\;B\;)}$$


### $$P\left(\;class\;|\;features\;\right) = \frac{P\left(\;features\;|\;class\;\right)P\left(\;class\;\right)}{P(\;features\;)} $$

As we saw earlier, it is very easy to compute $P(\;features\;|\;class\;)$ from the training data!

<a id="a-really-simple-spam-example"></a>
## A really simple spam example
---

Here is another example. We are trying to predict spam emails.  For now, we have one feature: whether the email mentions 'guarantee'.

$G$ = Guarantee, $S$ = Is Spam.

 $$P\left(\;S\;|\;G\;\right) = \frac{P\left(\;G\;|\;S\;\right)P\left(\;S\;\right)}{P(\;G\;)} = \frac{P\left(\;G\;|\;S\;\right)P\left(\;S\;\right)}{P(\;G\;|\;S)P(\;S\;) + P(\;G\;|\;\neg{S})P(\;\neg{S}\;)}$$

We saw earlier how it is possible to compute $P\;(\;G\;)$ directly. In many books, you will see it computed in this alternative fashion which is again based on our classes rather than based on every combination of features. 

> The denominator looks complicated, but it actually isn't. Since $G$ is binary (either present or not), then:

> $$P(G) = P(G \cap S) + P(G \cap \neg{S})$$

> Now, just expand each term using the "chain rule" and you get the denominator!

Again, note we started with a term $P\;(\;features\;)$ -- in general, using this would require the calculation of every combination of features (i.e. we would also need to compute $P\;(\;GS\;)$, $P\;(\;GCS\;)$, etc). However, by expanding $P\;(\;G\;)$ via the chain rule, we got an expression that depends only on the individual classes that we have data about.

<a id="problem-multiple-features-require-joint-probabilities"></a>
### Problem: Multiple features require joint probabilities

In this spam example, we only had one feature $G$. But in all likelihood, we'll use more than one feature. Really, we want to see some feature vector $X_1, X_2, ..., X_n$:

$$P\left(\;S\;|\;X_1, ..., X_n\;\right) = \frac{P\left(\;X_1,  ..., X_n\;|\;S\;\right)}{P(\;X_1,  ..., X_n\;|\;S) + P(\;X_1, ..., X_n\;|\;\neg{S})}$$

For example, what is the likelihood that something is spam given that the email mentions Guarantee, Oil, Prince, and Nigeria ... but not Meeting, Colleague, and Dad?

With a lot of features, calculating the joint probabilities gets really complicated really quickly. We would definitely need lots of data to ensure we have enough feature combinations. If you reason this out, you quickly may realize we run into the same joint probability problem as before, requiring exponentially many joint probabilities!

No matter how diligent we are, we may never collect a single training example that contains the precise combination of feature words we need. Hence, we would be unable to classify a new email containing a particular combination of words.

<a id="solution-the-naive-bayes-independence-assumption"></a>
### Solution: The Naive Bayes Independence Assumption

We are stuck again, since conditional joint probabilities are required for multiple features. This means exponentially many probabilities to compute, and exponentially more data collection.

To get around this, let's make an assumption: **All $X_i$ are conditionally independent given $S$** (where $S$ indicates "is spam"). Despite the fancy words, this just means that given $S$, then no two $X_i$ depend on each other. For example, the words Nigeria and Prince each independently indicate that an email is spam. So, it is not the complex interaction of words that determines spam; each feature independently can indicate whether an email is spam.

> Of course, this assumption is rarely (if ever) true! Often it requires precise reading to tell whether an email was written by a native speaker, for example. In this case, it is often not the particular words used but how they are used in context.

Recall that if events $A$ and $B$ are independent, then the probability $P\;(\;AB\;) = P\;(\;A\;)\;P\;(\;B\;)$. Similarly, if A and B are conditionally independent on S, then $P\;(\;AB\;|\;S\;) = P\;(\;A\;|\;S\;)\;P\;(\;B\;|\;S\;)$.

> This formula works out really well in general too:

> $$P\left(\;X_{1}X_{2} \dots X_{n}\;|\;S\;\right) = P\left(\;X_{1} |\;S\;\right) * P\left(\;X_{2} |\;S\;\right) ... P\left(\;X_{n} |\;S\;\right)$$



To see if this **conditional independence assumption** might simplify the numerator $P(X_1, ..., X_n\;|\;S)$ to remove the joint probabilities, let's first apply the definition of conditional probability followed by applying the independence assumption:

$$P\;(\;SGM\;) = P\;(\;GM\;|\;S\;)\;P\;(\;S\;) = P\;(\;G\;|\;S\;)\;P\;(\;M\;|\;S\;)\;P\;(\;S\;)$$


> None of these probabilities require us to examine multiple features at once in our dataset, making them drastically easier to compute. For example, $P(\;A\;|\;S\;)$ could indicate just the probability of the word A occuring in a spam email!

In reality, model parameters / coefficients are unlikely to be independent.  But Naive Bayes makes exactly this assumption ... and it turns out to often work well despite this!

<a id="spam-application-multiple-features"></a>
### Spam application: Multiple Features

How is this used in practice? Let's combine the naive bayes simplification above with our original formula (the denominator is computed the same as before combined with our naive assumption):

 $$P\left(\;S\;|\;GM\;\right) = \frac{P(\;SGM\;)}{P(\;GM\;)} = \frac{P\left(\;GM\;|\;S\;\right)P\left(\;S\;\right)}{P(\;GM\;|\;S)P(\;S\;) + P(\;GM\;|\;\neg{S})P(\;\neg{S}\;)} = \frac{P\left(\;G\;|\;S\;\right)P\left(\;M\;|\;S\;\right)P\left(\;S\;\right)}{P(\;G\;|\;S)P(\;M\;|\;S)P(\;S\;) + P(\;G\;|\;\neg{S})P(\;M\;|\;\neg{S})P(\;\neg{S}\;)}$$
 
Typically, we compute this probability for each class (in this case, just S or not S), then predict the class with highest probability. Note for all of these, the denominator $P(GM)$ is constant. Hence, this formula is often written as "proportional" ($\propto$), considerably simplifying it. Instead of comparing the exact probabilities, we can just see how they score relative to each other. So:

 $$P\left(\;S\;|\;GM\;\right) \propto P\left(\;G\;|\;S\;\right)P\left(\;M\;|\;S\;\right)P\left(\;S\;\right)$$

So again: don't be scared by the denominator!

<a id="production-issues"></a>
## Production Issues
---

Recall Naive Bayes is proportional to this:

 $$P\left(\;S\;|\;GM\;\right) \propto P\left(\;G\;|\;S\;\right)P\left(\;M\;|\;S\;\right)P\left(\;S\;\right)$$

Accidentally using a zero probability for any of these could present major problems -- the entire probabililty estimation would be zero!

- **New features.** What if a particular feature was never seen in our training data? Instead of using a zero probability, we should use a technique such as [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) to estimate a small non-zero probability for it.

- **Underflow.** Probabilities could be very small if some features rarely occur in some classes. Recall that floating point often gives us trouble due to its limited precision -- small floats tend toward zero. We can approach this problem by storing the logarithm of each probability $P_i$ instead of $P_i$ itself:

$$log(P_1P_2) = log\ P_1 + log\ P_2$$

$$e^{log\ P_1} = P_1$$

So: $$P_1P_2 \dots P_n = e^{log\ P_1 + ... + log\ P_n}$$

<a id="summary"></a>
## Summary
---

**Why is this Naive Bayes formula important?** With the independence assumption, we do not need to compute every joint probability distribution! Even if none of the training data contains 'guarantee' and 'millions', we only need to compute the probability of each word separately: $P(\;G\;|\;S\;)$ and $P(\;M\;|\;S\;)$. 

These calculations can be quickly performed using our training data. The downside is that if spam is actually determined by some complex interaction between 'guarantee' and 'millions' (e.g. only the presence of one but not the other), then the independence assumption does not hold and our model will not have the capacity to predict spam correctly.

To make Naive Bayes a classifier, all we have to do is compute the probability of $P(y|X)$ for each class $y$. In math notation, this is:

### $$ P(y \;|\; x_1, ..., x_n) \propto P(y) \prod_{i=1}^n P(x_i \;|\; y) \\
\downarrow \\
\hat{y} = arg \; \underset{y}{max} \; P(y) \prod_{i=1}^n P(x_i \;|\; y)$$

> Recall that $arg\underset{y}{max}$ means we find the class in vector of categories $y$ that gives us the maximum expression value!

<a id="implementation-in-scikit-learn"></a>
## Implementation in Scikit-learn
---


- [Docs 1](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)
- [Docs 2](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)
- [Docs 3](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html)

<img src="./images/naive-bayes.png">

The differences can be summarized as follows
-    ***BernoulliNB*** is designed for binary/boolean features
-    The ***multinomial Naive Bayes classifier*** is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as `tf-idf` may also work
-    ***GaussianNB*** is designed for continuous features (that can be scaled between 0,1) and is assumed to be normally distributed

<a id="guided-practice-scikit-learn-implementation"></a>
## Guided practice: Scikit-learn implementation

In [2]:
from sklearn import naive_bayes
import numpy as np
import pandas as pd

data = pd.read_csv('./datasets/spam_base.csv')

In [3]:
data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.40,0.41,0.42,0.778,0.43,0.44,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


> Check: what do you think is going on with this dataset?

> Which sklearn NB implementation should we use?

In [4]:
feature_set = data.iloc[:, :-1]
target = data.iloc[:, -1]

classifier1 = naive_bayes.MultinomialNB().fit(feature_set, target)

In [6]:
from sklearn.model_selection import cross_val_score
print cross_val_score(classifier1, feature_set, target, cv=5)

[ 0.78718784  0.8165038   0.80978261  0.78346028  0.69314472]


> Check: is that good?

In [9]:
1. - np.mean(target)

0.606086956521739

<a id="conclusion"></a>
## Conclusion
---


How does Naive Bayes fit into your toolkit? What are the pros and cons? How do you choose between variants?

#### Additional Resources

- [An interesting slide from a Stanford MOOC which had a section on Naive Bayes](https://web.stanford.edu/class/cs124/lec/naivebayes.pdf)
- [A much more technical paper comparing Naive Bayes to Logistics Regressions](https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf)
- [More exposition on Naive Bayes](http://blog.yhat.com/posts/naive-bayes-in-python.html)
- [Naive Bayes from scratch](http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)