Naive Bayes classifiers are a collection of classification algorithms based on **Bayes’ Theorem**. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

**Gaussian Naive Bayes**: continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution. When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values.

**Multinomial Naive Bayes**: Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification. Its is used when we have discrete data (e.g. movie ratings ranging 1 and 5 as each rating will have certain frequency to represent). In text learning we have the count of each word to predict the class or label.

**Bernoulli Naive Bayes**: In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used rather than term frequencies(i.e. frequency of a word in the document).

In [1]:
import pandas as pd
import numpy as np

### Multinomial and Bernoulli Naive Bayes
Lets look at following example

| X1 | X2 | X3 | y |
|--------|--------|--------|-------|
| A      | N      | E      | 0     |
| A      | M      | E      | 0     |
| B      | M      | G      | 0     |
| B      | L      | F      | 0     |
| A      | K      | G      | 1     |
| B      | L      | E      | 1     |
| A      | M      | G      | 1     |
| B      | N      | F      | 1     |
| B      | L      | G      | 1     |


$$ P(\frac{y}{X}) = \frac{ P(X|y) P(y) }{P(X)} $$

$ P(X) = P(X1) P(X2) P(X3) $, since all the features are independent of each other

$$ P( \frac{y}{x_1,x_2,...,x_n}) = \frac{ P(y) \prod_{i=1}^n P(x_i/y) }{ \prod_{i=1}^n P(x_i) } $$

Since, $ \prod_{i=1}^n P(x_i) $ is common y=(0,1) so we can ignore that

$$ y = argmax_y P(y) \prod_{i=1}^n P(x_i/y)  $$

Total samples = 9

$ P(y=0) = \frac{4}{9}, P(y=1) = \frac{5}{9} $

$ P(\frac{X1=A}{y=0} ) = \frac{2}{4}, P(\frac{X1=B}{y=0}) = \frac{2}{4}, P(\frac{X1=A}{y=1} ) = \frac{2}{5}, P(\frac{X1=B}{y=1} ) = \frac{3}{5} $

$ P(\frac{X2=K}{y=0}) = \frac{0}{4}, P(\frac{X2=L}{y=0}) = \frac{1}{4}, P(\frac{X2=M}{y=0}) = \frac{2}{4}, P(\frac{X2=N}{y=0}) = \frac{1}{4}, P(\frac{X2=K}{y=1}) = \frac{1}{5}, P(\frac{X2=L}{y=1}) = \frac{2}{5}, P(\frac{X2=M}{y=1}) = \frac{1}{5}, P(\frac{X2=N}{y=1}) = \frac{1}{5} $

$ P(\frac{X3=E}{y=0} ) = \frac{2}{4}, P(\frac{X3=F}{y=0} ) = \frac{1}{4}, P(\frac{X3=G}{y=0} ) = \frac{1}{4}, P(\frac{X3=E}{y=1} ) = \frac{1}{5}, P(\frac{X3=F}{y=1} ) = \frac{1}{5}, P(\frac{X3=G}{y=1} ) = \frac{3}{5}, $

Lets find out probs for a sample

$ P(\frac{y=0}{ X1=B, X2=M, X3=E }) = P(y=0) * P(\frac{X1=B}{y=0})) * P(\frac{X2=M}{y=0}) * P(\frac{X3=E}{y=0}) = \frac{4}{9}*\frac{2}{4}*\frac{2}{4}*\frac{2}{4} = 0.05555 $

$ P(\frac{y=1}{ X1=B, X2=M, X3=E }) = P(y=1) * P(\frac{X1=B}{y=1})) * P(\frac{X2=M}{y=1}) * P(\frac{X3=E}{y=1}) = \frac{5}{9}*\frac{3}{5}*\frac{1}{5}*\frac{1}{5} = 0.01333 $

$ P(y=0) = \frac{0.05555}{0.05555+0.01333} = 0.8064, P(y=1) = \frac{0.01333}{0.05555+0.01333} =  0.1935  $

In [2]:
X_temp = np.array( [ ['A','N','E'], ['A','M','E'], ['B','M','G'], ['B','L','F'], 
                ['A','K','G'], ['B','L','E'], ['A','M','G'], ['B','N','F'], ['B','L','G']])
y = np.array( [0,0,0,0,1,1,1,1,1] )

X = np.zeros_like(X_temp, dtype=np.int)

from sklearn.preprocessing import LabelEncoder
X[:,0] = LabelEncoder().fit_transform(X_temp[:,0])
X[:,1] = LabelEncoder().fit_transform(X_temp[:,1])
X[:,2] = LabelEncoder().fit_transform(X_temp[:,2])
X.shape

(9, 3)

In [3]:
from collections import defaultdict
summary = []
for label in sorted(np.unique(y)):
    X_label = X[y==label]
    array = []
    for feat_ind in range(X_label.shape[1]):
        probs = defaultdict(float)
        feats,cnts = np.unique(X_label[:,feat_ind], return_counts=True)
        for feat,cnt in zip(feats,cnts):
            probs[feat] = cnt/(y==label).sum()
        array.append(probs)
    summary.append(array)

def predict_row(x):
    probs = []
    labels, cnts = np.unique( y, return_counts=True )
    for label, cnt in zip(labels,cnts):
        prob = cnt/sum(cnts)
        for feat_ind in range(len(x)):
            prob *= summary[int(label)][feat_ind][x[feat_ind]]
        probs.append(prob)
    return np.array(probs)/sum(probs)
predict_row( [1,2,0] )

array([0.80645161, 0.19354839])

### Guassian Naive Bayes

$$  P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}}exp\bigg( -\frac{ (x_i-\mu_y)^2 }{ 2\sigma_y^2 } \bigg)  $$

In [8]:
summary = []
for label in sorted(np.unique(y)):
    X_label = X[ y==label ]
    means, std, cnt = X_label.mean(axis=0), X_label.std(axis=0)*np.sqrt( len(X_label)/(len(X_label)-1) ), len(X_label)
    summary.append( (means, std, cnt) )

def calculate_probability(x, mean, stdev):
    exponent = np.exp(-((x-mean)**2 / (2 * stdev**2 )))
    return (1 / (np.sqrt(2 * np.pi) * stdev)) * exponent

def predict_proba( x ):
    probs = []
    for means,stds,cnt in summary:
        prob = cnt/len(y)
        for feature_index in range(len(x)):
            prob = prob*calculate_probability( x[feature_index], means[feature_index], stds[feature_index] )
        probs.append(prob)
    return np.array(probs)/sum(probs)

In [9]:
data = pd.read_csv('data_banknote.txt',header=None).values
X = data[:,:-1]
y = data[:,-1]

y_pred =  np.argmax(np.array([predict_proba(x) for x in X]), axis=1)
(y_pred == y).sum()/len(y)

0.8411078717201166