<a href="https://colab.research.google.com/github/dropthejase/ml_training/blob/main/ml_from_scratch/naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal as mvn

**Bayes Theorem**

\begin{align}
        P(Y | X) = \frac{P(X | Y) P(Y)}{P(X)}
    \end{align}

*Notes*

* For all P(X | Y) we assume a Gaussian distribution
* We can ignore P(X) as this is the same for all P(Y | X)s we'll be comparing
* We can log the full formula, enabling us to use log(P(X|Y)) + log(P(Y)) 



**PSEUDO CODE**

In [None]:
'''
1. For each class, create means, vars for the likelihoods, and the prior probabilities P(Y)
2. For each row of X_test, calculate the posteriors for each class
3. For each row of X_test, compare the posteriors
4. Create output of final posteriors for each row in X_test
'''

'\n1. For each class, create means, vars for the likelihoods, and the prior probabilities P(Y)\n2. For each row of X_test, calculate the posteriors for each class\n3. For each row of X_test, compare the posteriors\n4. Create output of final posteriors for each row in X_test\n'

**BUILDING THE NAIVE BAYES CLASSIFIER**

In [None]:
class NaiveBayes():
  
  def fit(self, X, y, smoothing=1e-2):

    self.X_train = X
    self.y_train = y

    self.classes = np.unique(y)

    self.x_means = {}
    self.x_vars = {}
    self.priors = {}

    for y in self.classes:
      temp_x = self.X_train[self.y_train == y]

      self.x_means[y] = temp_x.mean(axis=0)
      self.x_vars[y] = temp_x.var(axis=0) + smoothing
      self.priors[y] = len(self.y_train[self.y_train == y]) / len(self.y_train)

  def predict(self, X):
    
    X = np.array(X)

    # for each row of X_test, calculate posteriors for each class
    posteriors = []

    for idx, x in enumerate(X):
      temp_list = []

      for key, item in self.priors.items():
        posterior = mvn.logpdf(x, self.x_means[key], self.x_vars[key]) + np.log(item)
        temp_list.append(posterior)
        
      posteriors.append(temp_list)

    posteriors = np.array([np.argmax(posteriors[i]) for i in range(len(posteriors))])

    return posteriors

  def score(self, X, y):
    X = self.predict(X)
    return np.mean(X == y)

**TESTING THE MODEL**

In [None]:
test = pd.DataFrame(data={'money':[1,0,1,0,1,1,0,0,1,0],
                          'free':[1,1,0,1,1,0,0,1,0,0],
                          'pills':[1,1,1,1,0,0,0,0,0,0],
                          'spam':[1,1,1,1,1,0,0,0,0,0]},columns=['money','free','pills','spam'])

X_train = test.drop('spam',axis=1)
y_train = test['spam']

X_test = test.iloc[8:].drop('spam',axis=1)
y_test = test.iloc[8:]['spam']

In [None]:
nb = NaiveBayes()
nb.fit(X_train,y_train)

nb.predict(X_test)

array([0, 0])

In [None]:
nb.score(X_test, y_test)

1.0