# Naive Bayes Classifier form scratch

Naive Bayes is a very popular Supervised learning technique. It is very simple but very powerful algorithm which works very well with large datasets and sparse matrices, like preprocessed text data which creates thousand of vectors depending on the number of words in dictionary.It works really well with text data projects like sentiment data analysis, performs good with document categorization projects, and also it is great in predicting categorical data in projects such as email spam classification.

<img src='http://shatterline.com/blog/wp-content/uploads/2013/09/bayes-pictorial5.png' height='500px' width='900px'/>

In [3]:
import numpy as np
import pandas as pd

In [101]:
class GaussianNaiveBayes():
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._priors = np.zeros(n_classes, dtype=np.float64)
        
        # calculating the mean, variance and prior P(H) for each class
        for i, c in enumerate(self._classes):
            X_for_class_c = X[y==c]
            self._mean[i, :] = X_for_class_c.mean(axis=0)
            self._var[i, :] = X_for_class_c.var(axis=0)
            # Prior = no of class occurs in y/ total no of samples
            self._priors[i] = X_for_class_c.shape[0]/float(n_samples)
            
    
    def _calculate_likelihood(self, class_idx, x):
        '''
        Function to calculate likelihood, P(X | Class_j)
        of data X given the mean and variance
        '''
        mean = self._mean[class_idx]
        var = self._var[class_idx]
        num = np.exp(-(x-mean)**2 / (2*var))
        denom = np.sqrt(2 * np.pi * var)
        return num/denom
    
    
    def predict(self, X):
        y_pred = [self._classify_sample(x) for x in X]
        return np.array(y_pred)
    
    
    def _classify_sample(self, x):
        posteriors = []
        # calculating posteriors probability for each class
        for i, c in enumerate(self._classes):
            prior = np.log(self._priors[i])
            posterior = np.sum(np.log(self._calculate_likelihood(i, x)))
            posterior = prior + posterior
            posteriors.append(posterior)
            # return the class with highest posterior probability
        return self._classes[np.argmax(posteriors)]

## Loading data

In [102]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import datasets
import time

In [103]:
X, y = datasets.make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# splitting data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Training model and comparing it with sklearn's NB

In [105]:
start = time.perf_counter()
nb = GaussianNaiveBayes()
nb.fit(X_train, y_train)
predictions = nb.predict(X_test)
end = time.perf_counter()
print(f'Numpy NB accuracy: {accuracy_score(y_test, predictions)}')
print(f'Finished in {round(end-start, 3)} second(s)')

Numpy NB accuracy: 0.796
Finished in 0.026 second(s)


In [106]:
 from sklearn.naive_bayes import GaussianNB
 start = time.perf_counter()
 sk_nb = GaussianNB()
 sk_nb.fit(X_train, y_train).predict(X_test)
 sk_predictions = sk_nb.predict(X_test)
 end = time.perf_counter()
 print(f"scikit-learn Naive Bayes accuracy: {accuracy_score(y_test, sk_predictions)}")
 print(f'Finished in {round(end-start, 3)} second(s)')  

scikit-learn Naive Bayes accuracy: 0.796
Finished in 0.004 second(s)
