In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Lecture 9 -  Naive Bayes Classifier

# Table of Contents
* [Lecture 9 -  Naive Bayes Classifier](#Lecture-9----Naive-Bayes-Classifier)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
	* &nbsp;
		* [Example dataset - Wine](#Example-dataset---Wine)
* [Naive Bayes](#Naive-Bayes)
	* &nbsp;
		* &nbsp;
			* [Probabilities](#Probabilities)
			* [Joint probabilities](#Joint-probabilities)
		* [Probability density functions](#Probability-density-functions)
		* [Steps for calculating the classification for naive Bayes](#Steps-for-calculating-the-classification-for-naive-Bayes)
		* [Comparing Naive Bayes to k-NN](#Comparing-Naive-Bayes-to-k-NN)
* [Putting it all together in scikit-learn](#Putting-it-all-together-in-scikit-learn)


---

### Content

1. Naive Bayes
2. Scikit-learn and Naive Bayes

### Learning Outcomes

At the end of this lecture, you should be able to:

* explain how Naive Bayes works
* implement Naive Bayes in python
* apply Naive Bayes to classification
* use the scikit-learn module to train and test Naive Bayes classifiers
---

### Example dataset - Wine

We will return to the Wine dataset to explore classification.

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mtpl
import seaborn as sns


%matplotlib inline

df = pd.io.parsers.read_csv(
    '../datasets/wine_data.csv',
     usecols=[0,6,7]
    )

df.columns=['Class','Magnesium','Flavanoids']

df['Class'].replace('3', 0, inplace=True)
df.to_csv('../datasets/wine_data_test.csv', header=None, index=None)

df.head(5)

Confirm we have 3 class labels:

In [None]:
df.Class.unique()


Confirm data types:

In [None]:
df.dtypes

Get counts for each class:

In [None]:
df.groupby('Class').count()

In [None]:
df.groupby('Class').count() / df.groupby('Class').count()['Magnesium'].sum()

In classification problems, the **ability to separate classes** from one another is the most important consideration. Histograms of the feature values per class type, can be a useful tool for **eyeballing** some features and to get a rough feeling for their **discriminative power**. 

Here we are visualising the histograms of the two features for each of the three classes:

In [None]:
from matplotlib import pyplot as plt
plt.figure(figsize=(10,8))

colors = ('blue', 'red', 'green')

for label,color in zip(range(1,4), colors):
    mean = np.mean(df['Magnesium'][df['Class'] == label]) # class sample mean
    stdev = np.std(df['Magnesium'][df['Class'] == label]) # class standard deviation
    df['Magnesium'][df['Class'] == label].hist(alpha=0.3, # opacity level
             label='class {} ($\mu={:.2f}$, $\sigma={:.2f}$)'.format(label, mean, stdev), 
             color=color,
             bins=15)

plt.title('Wine data set - Distribution of Magnesium content')
plt.xlabel('Magnesium content', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.legend(loc='upper right')

plt.show()

In [None]:
from matplotlib import pyplot as plt

plt.figure(figsize=(10,8))

colors = ('blue', 'red', 'green')

for label,color in zip(range(1,4), colors):
    mean = np.mean(df['Flavanoids'][df['Class'] == label]) # class sample mean
    stdev = np.std(df['Flavanoids'][df['Class'] == label]) # class standard deviation
    df['Flavanoids'][df['Class'] == label].hist(alpha=0.3, # opacity level
             label='class {} ($\mu={:.2f}$, $\sigma={:.2f}$)'.format(label, mean, stdev), 
             color=color,
             bins=15)

plt.title('Wine data set - Distribution of Flavanoids content')
plt.xlabel('Flavanoids content', fontsize=14)
plt.ylabel('count', fontsize=14)
plt.legend(loc='upper right')

plt.show()

# Naive Bayes - Probability of the Causes

Naive Bayes (NB) is a classical statistical machine learning algorithm. It is based on one of the most important equations both in statistics as well as in science as whole - the **Bayes Theorem**. 

The Bayes Theorem is the foundation of a large branch of statistics that has an increasing relevance in solving real-world data science problems called **Bayesian Statistics**. Under this subfield of statistics, advanced methods like, Bayesian networks, Hidden Markov Models, Markov Random fields and probabilistic relational models are situated.

With the Bayes Theorem the **evidence about the true state of the world is expressed in terms of degrees of belief (probabilities)**. As such, the Bayes Theorem deals with **conditional probabilities between different events**. 

*It allows us to calculate the probability of some even A occurring given that some evidence B is true.*

#### Probabilities

Say we are developing an email spam filter. We are dealing with two classes, **spam** and **non-spam**. We look through all the spam emails in our email account and we find that 20% of them have the word 'Viagra' in them, while 80% contain the word 'Bank'. We would express this as: 

\begin{equation}
p(Viagra) = 0.2
\end{equation}

and

\begin{equation}
p(Bank) = 0.8
\end{equation}

If we assume that the occurrence of the above words in the emails is independent of each other (which means that one does not influence the other), than we can calculate the probability of encountering a spam email having both these words is:

0.2 $\times$ 0.8 = 0.16, which is p(Viagra)p(Bank).

This can be fully expressed as:

\begin{equation}
p(Viagra\ and\ Bank | Spam)
\end{equation}

meaning the probability of Viagra and Bank occurring together, given a spam email.

The above in effect expresses:

\begin{equation}
p(Words | Class)
\end{equation}

meaning the probability of certain words occurring for a given class which could be spam or non-spam. However, we are interested in prediction, so for us the above needs to be inverted because we want to know what is the probability of something being spam or non-spam given certain words we have come across in an email. In effect, what we are interested in is:

\begin{equation}
p(Class | Words)
\end{equation}

Our problem is that 

\begin{equation}
p(Class | Words) \neq p(Words | Class)
\end{equation}


Fortunately, 250 years ago, the British mathematician Reverend **Thomas Bayes** first proposed how to do this through a thought experiment. The great French mathematician, **Pierre Laplace**, also figured this out at independently and gave it mathematical formalism which we have today.

#### Joint probabilities

Since

\begin{equation}
p(A\ and\ B) = p(B)p(A) 
\end{equation}

**is not always true**, generally the joint probability of two events is therefore expressed as follows:

\begin{equation}
p(A\ and\ B) = p(B|A)p(A)
\end{equation}

where p(B|A) represents the probability of B occurring, given that A has occurred. The above equation is also  interchangeable as:

\begin{equation}
p(A\ and\ B) = p(A|B)p(B)
\end{equation}

by pulling the two pieces together we have: 

\begin{equation}
p(B|A)p(A) = p(A|B)p(B)
\end{equation}

This now brings us to the Bayes Theorem and to the solution to our problem. For our problem, if we make B the class label of 'spam' (which is the probability we would like to predict), and A the words in the email (like viagra or bank), then we can solve by re-arranging this equation as:

\begin{equation}
P(B|A) = \frac{P(A | B)\, P(B)}{P(A)}\cdot
\end{equation}

By plugging in our problem into the formula, we would get:

\begin{equation}
P(Class|Words) = \frac{P(Words | Class)\, P(Class)}{P(Words)}\cdot
\end{equation}

Breaking the formula down, P(Words|Class) is called the **likelihood** and we can calculate this from our training set. The likelihood can be phrased as given all the spam emails, what is the probability that the words like 'viagra' and 'bank' occur in any document. It is essentially the product of each of the probabilities for every individual word.

Naive Bayes is naive, because it assumes independence amongst the features in the calculation of the likelihood which in our example we express simply as p(Viagra)p(Bank).

The P(Class) is called the **prior**. This we can also calculate from our dataset as being the proportion of emails that are classed as spam.

The P(Words) is called the **normalising constant** (which is simply the probability of seeing this pattern without knowing what class it belongs to  - this is the least important component and in the end reduces down to simply the sum of all class probabilities), while the P(Class|Words) is called the **posterior** and is the result we are looking for.

Let us continue the email spam filter example, but before we proceed we will need to complete the describing our dataset.

Say we were examining a total of 100 emails, both spam and non-spam. 40 of those were spam and the rest were not. Of the 60 emails that were not spam, the word 'Viagra' appeared 5% and the word 'Bank' 10%. Across the entire dataset, the probability of finding 'Viagra' or 'Bank' in the emails, irrespective of what class the email belonged to was 7% (but this part can be left out).

Our goal is to find out which class label is the most probable given the particular word features:


In [None]:
#caclulate p(non-spam| viagra and bank)
((0.05 * 0.1) * 0.6) / (0.07)

In [None]:
#caclulate p(spam| viagra and bank)
((0.2 * 0.8) * 0.4) / (0.07)

Usually we use the normalising constant as below: 

In [None]:
print('non-spam: ', (0.05 * 0.1) * 0.6 / ((0.05 * 0.1) * 0.6 + (0.2 * 0.8) * 0.4) )
print('spam: ', (0.2 * 0.8) * 0.4 / ((0.05 * 0.1) * 0.6 + (0.2 * 0.8) * 0.4) )


From this we can say that the probability of an email containing words viagra and bank is more probable to be spam than non-spam and we therefore assign the class label to the feature ['Viagra','Bank'] as **spam**.

In many classification tasks, you have to deal with incomplete or missing values. As it turns out **Naive Bayes is really good for dealing with missing values and is able to produce a classification without having all the features, whereby the likelihood is simply calculated by excluding the particular missing feature.**

The above simplistic example considered features that were **categorical** and were represented as frequencies. 

How would we apply Naive Bayes to the wine dataset where the features for 'Magnesium' and 'Flavanoids' are numerical?

### Probability density functions

Below is an example of the Gaussian (normal) probability density function (PDF).

In [None]:
import numpy as np
import scipy.stats as stats

h = np.random.randn(1000)
h.sort()
fit = stats.norm.pdf(h, np.mean(h), np.std(h))  #this is a fitting indeed

plt.figure(figsize=(10,8))
plt.plot(h,fit,'-')
plt.title('Gaussion PDF')
plt.xlabel('X with mean 0 and std 1')
plt.ylabel('Probability')



The PDF for the normal distribution is given by a rather formidable looking expression:
<div style="font-size: 150%;"> 
\begin{equation}
f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x-\mu)^2}{2\sigma^2} }
\end{equation}



Take a deep breath and relax though. 

All we need to do is calculate the mean $\mu$, standard deviation $\sigma$ and plug them together with our value $x$ into the equation in order to get the probability that our quantity lies within a small region of $x$ for each of the class labels.

### Steps for calculating the classification for naive Bayes

For each class:

1. Calculate the probability for each of the features and then calculate their product - this represents the likelihood component of the Bayes Theorem.

2. Multiply the likelihood by the prior for a given class (the proportion of all samples belonging to this class that make up the training dataset) - to give us the posterior probability that we are finally looking for.

3. The class with the largest posterior probability is the class that wins the classification.

4. Divide the posterior of each class with the sum of all posteriors in order to normalize and to finish up with a more interpretable probability.

**Exercise:** write a function to calculate the PDF for a normal distribution as below:

In [None]:
def pdf_gaussian(x, mean, std):
    #YOUR CODE HERE
    
    return 
 

In [None]:
x = 66 
mean = 73
std = 6.2
pdf_gaussian(x, mean, std)



Given the wine dataset we saw previously: 

In [None]:
df.head(5)

In [None]:
df = pd.read_csv(
    '../datasets/wine_data.csv',
     usecols=[0,6,7]
    )
print(df.columns)
df.columns=['Class','Magnesium','Flavanoids']
df['Class'].replace(3, 0, inplace=True)
df.head()

### Classifier Representation and Training

We are now going to implement the Naive Bayes classifier algorithm, and we are going to create a structure that stores and represents all the parameters that we need for this classifier.

Fortunately, we do not need very much for Naive Bayes. We need:

1. For each feature, the mean value for every class (**likelihood component**)
2. For each feature, the standard deviation for every class (**likelihood component**)
3. The probability for each class (**prior**)


**Exercise:** Write a function takes a data frame and a class label and returns a dictionary containing the keys for each of the feature names in the dataset with an associated list having 2 elements, where the first is the mean and the second the standard deviation of the given feature:

In [None]:
def calculate_class_statistics(df, class_label):
    class_distr = {}
    feature_types = []
    
    feature_types =  df.columns[1:]
    means = (df[feature_types][df.Class == class_label]).mean()
    stds = (df[feature_types][df.Class == class_label]).std()
    
    #YOUR CODE HERE

       
    return class_distr



In [None]:
calculate_class_statistics(df, 1)
#SHOULD RETURN:
# {'Magnesium': [2.8401694915254234, 0.3389613523154669],
# 'Flavanoids': [2.982372881355932, 0.3974936086367577]}

**Exercise:** Write a function to train a NB classifier that takes in a data frame and returns a dictionary containing the following:


where the key is the class label and the value is a dictionary, containing means and standard deviations for each feature for each class, as well as a key for the prior and its associated value:

In [None]:
def train_naive_bayes(df):
    NB_classifier = {}
    class_labels = df.Class.unique()
    
    #YOUR CODE HERE

    
    return NB_classifier



In [None]:
NB_classifer = train_naive_bayes(df)
NB_classifer


**Exercise:** We are now going to move to classification. Write a function that takes as input a dictionary above as well as a series object to classify and returns the Bayes probability for this class:

In [None]:
def calculate_NB_probability_single_class(nb_classifier, series_x):
    likelihood = 1.0 
    
    #YOUR CODE HERE

    
    return posterior



In [None]:
series_x = pd.Series([2.0, 2.3], index=['Flavanoids', 'Magnesium'])

calculate_NB_probability_single_class(NB_classifer[str(1)], series_x)
#SHOULD RETURN 0.0051879593513192053



**Exercise:** Write a function that classifies a series object into a class and returns its probability:

In [None]:
def classify_naive_bayes(NB_classifer, series_x):
    classification = -1
    classification_probabilities = []
    prob = 0.0
    total_ptobabilities = 0.0
    
    #YOUR CODE HERE
    for i in NB_classifer:
        
    
    #select winning class
    for i in range(len(classification_probabilities)):
        
    
    #normalize probability
    return 



In [None]:
series_x = pd.Series([4.0, 2.3], index=['Flavanoids', 'Magnesium'])

winning_class, prob = classify_naive_bayes(NB_classifer, series_x)
winning_class, prob

#SHOULD RETURN ('1', 0.50456450520692542)


**Congratulations you have now fully implemented, trained and deployed your first classifier!**

**Exercise:** Write a function that takes in at NB classifier and a dataset and classifies each sample in the dataset. It creates two new columns on the dataset called 'Classification' and 'Probability' of the classification and returns the data frame:

In [None]:
def classify_dataset(NB_classifer, df):
    res = []
    #YOUR CODE HERE
    for i in range(len(df)):
        
    
    return pd.concat([df, res_df], axis=1)



In [None]:
df_result = classify_dataset(NB_classifer, df)
df_result.head(100)



### Comparing Naive Bayes to k-NN

>  Naive Bayes is a linear classifier, while k-NN is not. The
curse of dimensionality and large feature sets are a problem for k-NN,
while Naive Bayes performs well. k-NN requires no training (just load
in the dataset), whereas Naive Bayes does. Both are examples of supervised
learning (the data comes labeled). 

Schutt, R., & O'Neil, C. (2013). Doing Data Science: Straight Talk from the Frontline. " O'Reilly Media, Inc.".


kNN is in many ways a special case of a supervised machine learning algorithm. It is unique in that the data itself is the model and no training of a classifier takes place explicitly. **kNN is particularly susceptible to deteriorating accuracy** if there are meaningless features in the dataset. Hence, feature analysis should always be performed together with dimensionality reduction if using kNN.

**Naive Bayes is more robust against outliers as well as un-informative features** than kNN. However, one has to be careful with Naive Bayes because it is naive in terms of assuming independence between features, which is almost never true in reality. Therefore **Naive Bayes is vulnerable when this assumption is strongly violated and when there is a large presence of highly correlated and redundant features**, the training will bias the final result towards those features, and the final probabilities are unlikely to be accurate for interpretation.

**Exercise:** 

Go to https://archive.ics.uci.edu/ml/index.php and select 3 machine learning datasets. 

Load them into the notebook. Prepare them for classification.

Perform 10-fold stratified-cross-validation using the built in scikit-learn Naive Bayes.

Evaluate the generalisation using several metrics.

Perform feature selection and investigate if the generalisation has improved.