# Naive Bayes

The Naive Bayes is classification algorithm that exploits the product rule in probability theory to determine the class of an object by comparing the probability that it comes from different distributions. The probability of which an unknown data sample $k$ belongs to class $C$ is given by:

$P(C_{k}|\textbf{x}) = \frac{P(C_{k})P(\textbf{x}|C_{k})}{P(\textbf{x})}$

In practice, it is extremely difficult to calculate the probability in the denominator. Luckily, it is the same for all classes and thus can be regarded as a constant which can be omitted. So we only need to pay attention to the numerator. Assume that there are $N$ samples, $k$ of which belong to class $C$, then it is easy to derive the prior probability:

$P(C_{k}) = \frac{k}{N}$

In order to compute the posterior probability, we assume the samples are retrieved from Gaussian distribution whose mean and covariance are the mean and covariance of the whole training dataset. Thus, the posterior probability can be computed as:

$P(\textbf{x}|C_{k}) = \textit{N}(\mu, \sigma^{2})$

In order to avoid numerical underflow, it is typical to take the logarithmic form of the probability(note that logarithmic function is monotonic), so the final formula becomes:

$log(P(C_{k}|\textbf{x})) = log(P(C_{k})) + log(P(x_{1}|C_{k})) + log(P(x_{2}|C_{k})) + ... + log(P(x_{t}|C_{k}))$

where $t$ is the dimension of the feature vector.

In [1]:
import pandas as pd
import numpy as np
from math import log
from scipy.stats import multivariate_normal

In [3]:
# import data
data = pd.read_csv('pima-indians-diabetes.csv', header = None).values

# Naive Bayes

split = int(0.8*data.shape[0]) # randomly split the data set into 80% training and 20% testing
acc = [] 

# perform cross validation 10 times

for i in range(10):
    
    rand_idx = np.arange(data.shape[0])
    np.random.shuffle(rand_idx) 
    train_data = data[rand_idx[0:split]]
    test_data = data[rand_idx[split:]]
    
    # data is either labeled as "1"(positive) or "0"(negative)
    train_pos = train_data[np.where(train_data.T[-1] == 1)]
    train_neg = train_data[np.where(train_data.T[-1] == 0)]
    test_pos = test_data[np.where(test_data.T[-1] == 1)]
    test_neg = test_data[np.where(test_data.T[-1] == 0)]
    
    # calculate mean and covariance of positive and negative data samples, respectively
    pos_mean = np.nanmean(train_pos, axis=0)
    pos_var = np.nanvar(train_pos, axis=0)
    neg_mean = np.nanmean(train_neg, axis=0)
    neg_var = np.nanvar(train_neg, axis=0)
    
    err = 0
    
    # calculate prior logarithmic probability
    pos_prior_prob = log(test_pos.shape[0]/test_data.shape[0])
    neg_prior_prob = log(test_neg.shape[0]/test_data.shape[0])
    
    for row in test_data:         
        
        # compute and compare the probability of both cases to predict label of test data
        pos_prob = pos_prior_prob + log(multivariate_normal.pdf(row[0:-1], pos_mean[0:-1], pos_var[0:-1]))
        neg_prob = neg_prior_prob + log(multivariate_normal.pdf(row[0:-1], neg_mean[0:-1], neg_var[0:-1]))

        if pos_prob >= neg_prob:
            label = 1
        else:
            label = 0
    
        if label != row[-1]:
            err += 1
            
    acc.append(1 - err/test_data.shape[0])

In [4]:
print(acc)

[0.7727272727272727, 0.7987012987012987, 0.7142857142857143, 0.7402597402597403, 0.7207792207792207, 0.7727272727272727, 0.7402597402597403, 0.7727272727272727, 0.6818181818181819, 0.7922077922077921]
