# Naive Bayes Algorithm

Bayes’ Theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as

$$P(class \mid data) = \frac{P(data \mid class) \, P(class)}{P(data)}$$

Where $P(class \mid data)$ is the probability of class given the provided data. Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.

Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

## Exercise 1 - Explore the Data

The test problem we will be using is Iris classification. The problem is comprised of 150 observations of iris flowers from three different species. There are 4 measurements of given flowers: sepal length, sepal width, petal length and petal width, all in the same unit of centimeters. The predicted attribute is the species, which is one of setosa, versicolor or virginica.

It is a standard dataset where the species is known for all instances. As such we can split the data into training and test datasets and use the results to evaluate our algorithm implementation. Good classification accuracy on this problem is above 90% correct, typically 96% or better.

You can download the dataset for free from [UCI](https://archive.ics.uci.edu/ml/datasets/Iris), see the resources section for further details.

In [8]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import itertools

from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
% matplotlib inline
iris= pd.read_csv('iris.csv')

In [11]:
iris.head()
#print(iris.info)
#iris.describe()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [16]:
iris.groupby('species').std()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,0.35249,0.379064,0.173664,0.105386
versicolor,0.516171,0.313798,0.469911,0.197753
virginica,0.63588,0.322497,0.551895,0.27465


In [17]:
iris.groupby('species').mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [18]:
iris['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [19]:
import operator
from scipy.stats import norm
from functools import reduce


## Exercise 2 - Build a NaiveBayes Class

The derivation can be [found here on Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier).

The general steps are:
1. Separate By Class
2. Summarize Dataset
3. Summarize Data By Class
4. Generate Gaussian Probability Density Function
5. Class Probabilities

In [79]:
class NaiveBayes:            

    def ComputeGaussPDF(self, X,y):        
        temp_df = pd.concat([X,y], axis=1)
        y_name= y.name
        
        gauss= []
        for category in temp_df[y_name].unique():
            expr= y_name + "==" + "'"+ category +"'"
            temp_data = temp_df.query(expr).drop(y_name, axis =1)
            
            temp_gauss = {}
            
            for col in temp_data.columns:
                temp_gauss[col] = norm(loc=temp_data[col].mean(), scale=temp_data[col].std())
            gauss.append(temp_gauss)
        
        gauss_dict = {}
        i= 0   
        for category in temp_df[y_name].unique():
            gauss_dict[category] =gauss[i]
            i+=1
        return (gauss_dict)             

    def fit(self, X,y):
        self.prior= y.value_counts() /len(X)
        self.gauss = self.ComputeGaussPDF(X,y)           
        
    def predict(self, X,y):   
        probs={} 
        for category in self.gauss:
            temp_prob= []
            for col in self.gauss[category]:
                temp_prob.append(self.gauss[category][col].pdf(X[col]))
            
            probs[category] = reduce(operator.mul,temp_prob, 1)* self.prior[category]
        
        for category in probs:
            probs[category] /= reduce(operator.add, probs[category], 1)
            
        probs = pd.DataFrame(probs)
        
        return (probs.idmax(axis=1))

## Exercise 3 - Try it out on the Iris Dataset. 

In [80]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

train, test = train_test_split(iris, test_size = .5, random_state= 0)

x_train = train.drop("species", axis=1)
y_train = train["species"]
x_test = test.drop("species", axis=1)
y_test = test["species"]

nb= NaiveBayes()
nb.fit(x_train, y_train)
accuracy_score(y_test, nb.predict(x_test))

TypeError: predict() missing 1 required positional argument: 'y'

## Exercise 4 - Check via Statsmodels or Scikit-learn

In [82]:
from sklearn.naive_bayes import GaussianNB
GaussianNB().fit(x_train,y_train).score(x_test, y_test)

0.94666666666666666

# Additional Optional Exercises

- Proper documentation for class methods and attributes
- Build with NumPy methods and compare computation time
- Calculate class probabilities as a ratio
- Take the log probabilities
- Update the implementation to support nominal attributes
- Utilize a different density function instead of Gaussian (ie. Multinomial, Bernoulli, Kernel)