## Setup

This guide was written in Python 3.6.

### Python and Pip

If you haven't already, please download [Python](https://www.python.org/downloads/) and [Pip](https://pip.pypa.io/en/stable/installing/).

Let's install the modules we'll need for this tutorial. Open up your terminal and enter the following commands to install the needed python modules: 

```
pip3 install scikit-learn==0.18.1
pip3 install nltk==3.2.4
```

## Introduction

In this tutorial set, we'll review the Naive Bayes Algorithm used in the field of machine learning. Naive Bayes works on Bayes Theorem of probability to predict the class of a given data point, and is extremely fast compared to other classification algorithms. 

Because it works with an assumption of independence among predictors, the Naive Bayes model is easy to build and particularly useful for large datasets. Along with its simplicity, Naive Bayes is known to outperform even some of the most sophisticated classification methods.

This tutorial assumes you have prior programming experience in Python and probablility. While I will overview some of the priciples in probability, this tutorial is **not** intended to teach you these fundamental concepts. If you need some background on this material, please see my tutorial [here](https://github.com/lesley2958/intro-stats).


### Bayes Theorem

Recall Bayes Theorem, which provides a way of calculating the *posterior probability*: 

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

Before we go into more specifics of the Naive Bayes Algorithm, we'll go through an example of classification to determine whether a sports team will play or not based on the weather. Specifically, we'll classify whether or not a team will play if it is sunny.

To start, we'll load in the data, which you can find [here](https://github.com/lesley2958/ml-bayes/blob/master/data/weather.csv).

In [1]:
import pandas as pd
f1 = pd.read_csv("./data/weather.csv")

Before we go any further, let's take a look at the dataset we're working with. It consists of 2 columns (excluding the indices), *weather* and *play*. The *weather* column consists of one of three possible weather categories: `sunny`, `overcast`, and `rainy`. The *play* column is a binary value of `yes` or `no`, and indicates whether or not the sports team played that day.

In [2]:
f1.head(3)

Unnamed: 0,Weather,Play
0,Sunny,No
1,Overcast,Yes
2,Rainy,Yes


#### Frequency Table

If you recall from probability theory, frequencies are an important part of eventually calculating the probability of a given class. In this section of the tutorial, we'll first convert the dataset into different frequency tables, using the `groupby()` function. First, we retrieve the frequences of each combination of weather and play columns: 

In [3]:
df = f1.groupby(['Weather','Play']).size()
print(df)

Weather   Play
Overcast  Yes     4
Rainy     No      3
          Yes     2
Sunny     No      2
          Yes     3
dtype: int64


It will also come in handy to split the frequencies by weather and yes/no. Let's start with the three weather frequencies:

In [4]:
df2 = f1.groupby('Weather').count()
print(df2)

          Play
Weather       
Overcast     4
Rainy        5
Sunny        5


And now for the frequencies of yes and no:

In [5]:
df1 = f1.groupby('Play').count()
print(df1)

      Weather
Play         
No          5
Yes         9


#### Likelihood Table

The frequencies of each class are important in calculating the likelihood, or the probably that a certain class will occur. Using the frequency tables we just created, we'll find the likelihoods of each weather condition and yes/no. We'll accomplish this by adding a new column that takes the frequency column and divides it by the total data occurances:

In [6]:
df1['Likelihood'] = df1['Weather']/len(f1)
df2['Likelihood'] = df2['Play']/len(f1)
print(df1)
print(df2)

      Weather  Likelihood
Play                     
No          5    0.357143
Yes         9    0.642857
          Play  Likelihood
Weather                   
Overcast     4    0.285714
Rainy        5    0.357143
Sunny        5    0.357143


Now, we're able to use the Naive Bayesian equation to calculate the posterior probability for each class. The highest posterior probability is the outcome of prediction.


#### Calculation

Now, let's get back to our question: *Will the team play if the weather is sunny?*

From this question, we can construct Bayes Theorem. Because the *know* factor is that it is sunny, the $P(A \mid B)$ becomes $P(Yes \mid Sunny)$. From there, it's just a matter of plugging in probabilities. 

$$ P(Yes \mid Sunny) = \frac{P(Sunny \mid Yes) \, P(Yes)}{P(Sunny)} $$

Since we already created some likelihood tables, we can just index `P(Sunny)` and `P(Yes)` off the tables:

In [8]:
ps = df2['Likelihood']['Sunny']
py = df1['Likelihood']['Yes']

That leaves us with $P(Sunny \mid Yes)$. This is the probability that the weather is sunny given that the players played that day. In `df`, we see that the total number of `yes` days under `sunny` is 3. We take this number and divide it by the total number of `yes` days, which we can get from `df`:

In [9]:
psy = df['Sunny']['Yes']/df1['Weather']['Yes']

And finally, we can just plug these variables into bayes theorem: 

In [10]:
p = (psy*py)/ps
print(p)

0.6


This tells us that there's a 60% likelihood of the team playing if it's sunny. Because this is a binary classification of yes or no, a value greater than 50% indicates a team *will* play. 

### Naive Bayes Evaluation

Every classifier has pros and cons, whether that be in terms of computational power, accuracy, etc. In this section, we'll review the pros and cons of Naive Bayes.

#### Pros

Naive Bayes is incredibly easy and fast in predicting the class of test data. It also performs well in multi-class prediction.

When the assumption of independence is true, the Naive Bayes classifier performs better thanother models like logistic regression. It does this, and with less need of a lot of data.

Naive Bayes also performs well with categorical input variables compared to numerical variable(s), which is why we're able to use it for text classification. For numerical variables, normal distribution must be assumed.

#### Cons

If a categorical variable has a category not observed in the training data set, then model will assign a 0 probability and will be unable to make a prediction. This is referred to as â€œZero Frequencyâ€. To solve this, we can use the smoothing technique, such as Laplace estimation.

Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.


## Naive Bayes Types

With `scikit-learn`, we can implement Naive Bayes models in Python. There are three types of Naive Bayes models, all of which we'll review in the following sections.

### Gaussian

The Gaussian Naive Bayes Model is used in classification and assumes that features will follow a normal distribution. 

We begin an example by importing the needed modules:

In [11]:
from sklearn.naive_bayes import GaussianNB
import numpy as np

As always, we need predictor and target variables, so we assign those:

In [12]:
x = np.array([[-3,7],[1,5], [1,2], [-2,0], [2,3], [-4,0], [-1,1], [1,1], [-2,2], [2,7], [-4,1], [-2,7]])
y = np.array([3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 4, 4])

Now we can initialize the Gaussian Classifier:


In [14]:
model = GaussianNB()

Now we can train the model using the training sets:


In [15]:
model.fit(x, y)

GaussianNB(priors=None)

In [17]:
predicted = model.predict([[1,2],[3,4]])
print(predicted)

[3 4]


### Multinomial

MultinomialNB implements the multinomial Naive Bayes algorithm and is one of the two classic Naive Bayes variants used in text classification. This classifier is suitable for classification with discrete features (such as word counts for text classification). 

The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts may also work.

First, we need some data, so we import numpy and 


In [18]:
import numpy as np
x = np.random.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])

Now let's build the Multinomial Naive Bayes model: 


In [19]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(x, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try an example:


In [20]:
print(clf.predict(x[2:3]))

[3]


### Bernoulli

Like MultinomialNB, this classifier is suitable for discrete data. BernoulliNB implements the Naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions, meaning there may be multiple features but each one is assumed to be a binary value. 

The decision rule for Bernoulli Naive Bayes is based on

![alt text](https://github.com/lesley2958/ml-bayes/blob/master/bernoulli.png?raw=true "Logo Title Text 1")

In [21]:
import numpy as np
x = np.random.randint(2, size=(6, 100))
y = np.array([1, 2, 3, 4, 4, 5])

In [22]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(x, y)
print(clf.predict(x[2:3]))

[3]


### Tips for Improvement

If continuous features don't have a normal distribution (recall the assumption of normal distribution), you can use different methods to convert it to a normal distribution.

As we mentioned before, if the test data set has a zero frequency issue, you can apply smoothing techniques â€œLaplace Correctionâ€ to predict the class.

As usual, you can remove correlated features since the correlated features would be voted twice in the model and it can lead to over inflating importance.

## Joint Models

If you have input data x and want to classify the data into labels y. A <b>generative model</b> learns the joint probability distribution `p(x,y)` and a discriminative model learns the <b>conditional probability distribution</b> `p(y|x)`.

Here's an simple example of this form:

```
(1,0), (1,0), (2,0), (2, 1)
```

`p(x,y)` is

```
      y=0   y=1
     -----------
x=1 | 1/2   0
x=2 | 1/4   1/4
```

Meanwhile,

```
p(y|x) is
```

```
      y=0   y=1
     -----------
x=1 | 1     0
x=2 | 1/2   1/2
```

Notice that if you add all 4 probabilities in the first chart, they add up to 1, but if you do the same for the second chart, they add up to 2. This is because the probabilities in chart 2 are read row by row. Hence, `1+0=1` in the first row and `1/2+1/2 = 1` in the second.

The distribution `p(y|x)` is the natural distribution for classifying a given example `x` into a class `y`, which is why algorithms that model this directly are called discriminative algorithms. 

Generative algorithms model `p(x, y)`, which can be tranformed into `p(y|x)` by applying Bayes rule and then used for classification. However, the distribution `p(x, y)` can also be used for other purposes. For example you could use 
6`p(x,y)` to generate likely `(x, y)` pairs.