**The purpose of this notebook will be to build the Naive Bayes algorithm from scratch. The algorithm will then be used on the iris dataset and compared to the sklearn implementation**

In [143]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [266]:
from sklearn.datasets import load_iris

In [267]:
iris=load_iris()

In [268]:
x=iris.data
y=iris.target


In [269]:
from sklearn.model_selection import train_test_split

In [271]:
x_train, x_test, y_train, y_test=train_test_split(x,y, random_state=5) #split into test and train

Now let's build the Gaussian Naive Bayes class

In [257]:
class GaussianNB():
    
    def __init__(self):
        pass
    
    def fit(self, X,y):
        
        def separate_classes(X,y):
            '''
            This function separates features by class
            '''
            
            classes=[[x for x,t in zip(X,y) if t==c] for c in np.unique(y)]
            return classes
        
        
        def make_class_dicts(classes, labels):
            '''
            This function creates dictionaries for each class containing required stats
            '''
            
            num_samples=len(labels)
            class_dict=[]

            for x,y in zip(classes, np.unique(labels)):
                b={}
                b['class']=y
                b['p_class']=len(x)/num_samples
                b['mean']=np.mean(x, axis=0)
                b['var']=np.mean(x, axis=0)
                class_dict.append(b)

            return class_dict
        
        # for the fit method, the overall goal is to make these class dicts for each of our classes
        # we can then use these class dicts to get probabilities that new points belong to each class
        
        self.classes=separate_classes(X, y)
        self.class_dict=make_class_dicts(self.classes, y)
        
    
    def predict(self, X):
        
        def find_proba_for_each_class(point, class_dict):
            '''
            This function compares a point to each class and returns the prob of belonging to each class
            '''
            
            probs=[]
            for y in class_dict:
                p = 1/(np.sqrt(2*np.pi*y['var'])) * np.exp((-(point-y['mean'])**2)/(2*y['var']))
                p=np.prod(p)*y['p_class']
                probs.append((p, y['class']))
            return probs
        
        def get_labels(probs):
            '''
            This function returns an array with the class label belonging to highest probability 
            '''
            return [max(x, key=lambda x:x[0])[1] for x in probs]
        
        # the basis of the predict method will be to find the probability that each feature belongs to each class
        # then we simply return the class label associated with the greatest probability of occurence 
        
        probs=np.array([find_proba_for_each_class(feature, class_dict) for feature in X ])
        
        self.predicted_labels=get_labels(probs)
        
        return self.predicted_labels
    
    
    def score(self, X, y): 
        
        predicted_labels=self.predict(X)
        
        score=0
        
        for i in range(len(y)):
            
            if predicted_labels[i]==y[i]: #check to see if labels match
                score+=1
                
        return score/len(y) #this returns the accuracy (or ratio of correct guesses)
    
    
    
        
        
        
        
        
        
    


In [258]:
nb=GaussianNB()

In [272]:
nb.fit(x_train,y_train)

In [280]:
nb.score(x_test,y_test)

0.8947368421052632

Our model was able to achieve an accuracuy of 89.5%. Now lets see how this compares to sklearn

In [274]:
from sklearn import naive_bayes

In [275]:
sk_nb=naive_bayes.GaussianNB()

In [283]:
sk_nb.fit(x_train, y_train)

GaussianNB(priors=None)

In [277]:
sk_nb.score(x_test, y_test)

0.9210526315789473

Oh no, the sklearn implementation was able to achieve a 92.1% accuracy. This could be due to the smoothing parameter that sklearn implements by default. 