# Naive Bayes from scratch using numpy

### Bayes Theorem = $P(A|B) = \frac {P(B|A) \cdot P(B)} {P(A)}$

$P(y|X) = \dfrac {P(X|y) \cdot P(y)} {P(X)} = \dfrac {\prod P(x_i|y) \cdot P(y)} {P(X)}$

Where:
 - $P(A)$ is the probability of $A$ occurring
 - $P(A|B)$ is the conditional probability of $A$ occuring given that B has occurred

Naive refers to the fact that it assumes the data features to be independent and mutually exclusive.

### Classification result = Class with highest posterior probability

$y = argmax_y(P(y|X)) \approx argmax_y(\prod P(x_i|y) \cdot P(y))$  
$\because P(X)$ does not depend on $y$

Since values in the above expression are very small, we use log of these values.  
$y = argmax_y(\sum log(P(x_i|y)) + log(P(y)))$
  
- $P(y)$ = Prior Probability = Frequency of each class
- $P(x_i|y)$ = Posterior = Class Conditional Probability = Gaussian Model = $\dfrac 1 {\sqrt {2 \pi \sigma_y^2}} \cdot e^{- \bigg(\dfrac {(x_i - \mu_y)^2} {2 \sigma_y^2}\bigg)}$

In [None]:
import numpy as np

In [None]:
class NaiveBayes:
  def fit(self, X, y):
    nSamples, nFeatures = X.shape
    self._classes = np.unique(y)
    nClasses = len(self._classes)

    self._mean = np.zeros((nClasses, nFeatures))
    self._var = np.zeros((nClasses, nFeatures))
    self._priors = np.zeros(nClasses)

    for idx, _class in enumerate(self._classes):
      X_c = X[y == _class]
      self._mean[idx, :] = X_c.mean(axis=0)
      self._var[idx, :] = X_c.var(axis=0)
      self._priors[idx] = X_c.shape[0] / float(nSamples)

  def predct(self, X):
    yPred = [self._predict(x) for x in X]
    return yPred

  def _predict(self, x):
    posteriors = []

    for idx in range(len(self._classes)):
      prior = np.log(self._priors[idx])
      posterior = np.sum(np.log(self._pdf(idx, x)))
      posteriors.append(posterior + prior)

    return self._classes[np.argmax(posteriors)]

  def _pdf(self, idx, x):
    mean, var = self._mean[idx], self._var[idx]
    return np.exp(-((x-mean)**2 / (2*var))) / np.sqrt(2*np.pi*var)

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.make_classification(
    n_samples=1000, n_features=10, n_classes=2, random_state=0)

XTrain, XTest, YTrain, YTest = train_test_split(
    X, y, test_size=0.2, random_state=0)

classifier = NaiveBayes()
classifier.fit(XTrain, YTrain)
predictions = classifier.predct(XTest)


def accuracy(yTest, yPred):
  return np.sum(yTest == yPred) / len(yTest)


acc = accuracy(YTest, predictions)

In [None]:
acc