# Lesson 5.07 Naïve Bayes

## What is Naïve Bayes?

Naïve Bayes relies on [Bayes theorem](https://www.mathsisfun.com/data/bayes-theorem.html) to calculate conditional probabilities. 

In order to understand Bayes theorem, we need to remember conditional probabilities. 


**A quick example to understand this intuitively** - *Suppose If you pick a card from a standard 52-card deck, what is the probability of drawing a queen given the card is a heart?*

* I have told you the condition: that the card is a heart. Therefore, we only have 13 options to choose from, since there are 13 hearts in a deck of cards. Out of these, only 1 card is a queen (there is one queen in each suit), so the probability of drawing a queen given the card is a heart is 1/13.
    
* It is important to note here that the probability of drawing a queen given the card is a heart is not the same as the probability of drawing a heart given the card is a queen! This would be 1/4.

If we know $P(B|A)$, Bayes theorem allows us to calculate the probability of $P(A|B)$ by relating the probability of $P(A|B)$ to $P(B|A)$. 

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)}
\end{eqnarray*}
$$

- Let $A$ be that a message is spam.
- Let $B$ represent the words used in the message.

$$
\begin{eqnarray*}
\text{Bayes' Theorem: } P(A|B) &=& \frac{P(B|A)P(A)}{P(B)} \\
\Rightarrow P(\text{message is spam}|\text{words in message}) &=& \frac{P(\text{words in message}|\text{message is spam})P(\text{message is spam})}{P(\text{words in message})}
\end{eqnarray*}
$$

We want to calculate the probability that a post is spam **given** the words that are in the message! Our model can learn this from the training data.

## Naïve Bayes Assumptions

Naïve Bayes makes the assumption that all features are independent of one another **(this is why it is called *naïve*)**.

**Why is this assumption not realistic with our data?**
    
Text data is never independent! Certain words can change the context of a sentence when used with other words. The way language works, we have words that are more or less likely to follow other words.

Despite this assumption not being realistic with NLP data, we still use Naïve Bayes pretty frequently.
- It's a very fast modeling algorithm (which is great especially when we have lots of features and/or lots of data!).
- It is often an excellent classifier, outperforming more complicated models.

## Naïve Bayes Model Types

There are three common types of Naive Bayes models: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
- How do we pick which of the three models to use? It depends on our $X$ variable.

    - [Bernoulli Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB): most suitable when we have 0/1 variables.
    - [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB): most suitable when our variables are positive integers.
    - [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB): most suitable when our features are Normally distributed.

## Naive Bayes Classifiers

In [45]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score

In [37]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

### Load Data File
- Definition of columns can be found at this [link](https://archive.ics.uci.edu/ml/datasets/spambase)
- The first 48 columns refer to the proportion of the number of occurences of a given word
- The next 6 columns refer to the proportion of the number of occurences of a given character
- The remaining columns refer to the average length of uninterrupted sequences of capital letters, length of longest uninterrupted sequence of capital letters, total number of capital letters in the e-mail, and whether the e-mail was considered spam (1) or not (0)

In [42]:
import pandas as pd

# Data file contains email attributes for determining the Spam status for subsequent incoming emails
# Data file contains the predictors of Spam Emails in the first 57 columns while the last column is the actual spam label
data = pd.read_csv('spambase.data', header=None)
data 

# display first row of data
# data.iloc[0]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [33]:
# Assume that we have decided to use the first 48 columns as predictors for the last column which is the outcome variable
X = data.iloc[:, 0:48]

# Negative index means to search from back to front
# -1 means retrieve the last column
y = data.iloc[:, -1]

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=17)

### Using Bernoulli Naive Bayes to predict spam

In [38]:
# converts X-column data
# binarize param means values <= to default threshold value = 0.0 is mapped to 0, else to 1. 
BernNB = BernoulliNB(binarize=True)
BernNB.fit(X_train, y_train)
print(BernNB)

# Optional - rename variable if it is clearer
y_expect = y_test

y_pred = BernNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

BernoulliNB(binarize=True)
0.8577633007600435


### Using Multinomial Naive Bayes to predict spam

In [39]:
MultiNB = MultinomialNB()
MultiNB.fit(X_train, y_train)
print(MultiNB)


y_pred = MultiNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

MultinomialNB()
0.8816503800217155


### Using Gaussian Naive Bayes to predict spam

In [40]:
GausNB = GaussianNB()
GausNB.fit(X_train, y_train)
print(GausNB)


y_pred = GausNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

GaussianNB()
0.8197611292073833


### Optimisation of Bernoulli Naive Bayes

In [44]:
# Optimal results when you set binarize=0.1. This was found through trial and error.
# Values less than or equal to 0.1 is mapped to 0, else to 1.
# You may also consider using Hyperparameter Tuning techniques such as GridSearch to find the optimal threshold value
BernNB = BernoulliNB(binarize=0.1)
BernNB.fit(X_train, y_train)
print(BernNB)

y_expect = y_test
y_pred = BernNB.predict(X_test)

print(accuracy_score(y_expect, y_pred))

BernoulliNB(binarize=0.1)
0.9109663409337676


We have managed to bring up the accuracy score of the Bernoulli Naiive Bayes model from 0.85 to 0.91, making it the best of the 3 models.