# Naive Bayes Classifier

## 1. Naive Bayes algorithm

This ipython notebook is used to implement naive bayes algorithms, which is easy to understand and for some problem it has a fair good accuracy. 

**The Question we are going to answer **:

This is the simple example I get from wikipedia, which is used to understand how Naive Bayes algorithm works in this case:

### Problem: classify whether a given person is a male or a female based on the measured features. The features include height, weight, and foot size.

In [4]:
## Import the packages

import numpy as np
import matplotlib as plt
import sklearn as sk

%matplotlib notebook

In [5]:
## dataset columns: [height(feet), weight(lbs), foot size(inches), gender(male:1, female:0)]


dataset = np.array([[6,180,12,1],[5.92,190,11,1],[5.58,170,12,1],[5.92,165,10,1],[5,100,6,0],[5.5,150,8,0],[5.42,130,7,0],[5.75,150,9,0]],dtype=float)

In [6]:
dataset

array([[   6.  ,  180.  ,   12.  ,    1.  ],
       [   5.92,  190.  ,   11.  ,    1.  ],
       [   5.58,  170.  ,   12.  ,    1.  ],
       [   5.92,  165.  ,   10.  ,    1.  ],
       [   5.  ,  100.  ,    6.  ,    0.  ],
       [   5.5 ,  150.  ,    8.  ,    0.  ],
       [   5.42,  130.  ,    7.  ,    0.  ],
       [   5.75,  150.  ,    9.  ,    0.  ]])

In [7]:
train_x = dataset[:,:-1]

target = dataset[:,-1].reshape((1,dataset.shape[0]))

In [8]:
target

array([[ 1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.]])

## Implement our Bayes algorithm

In [75]:
## define Gaussian distribution function

def Gauss(x, mean, var):
    """
    This is the function to calculate the probability distribution under the assumption that the data behaves as normla distribution
    """
    coef = 1/np.sqrt(2*np.pi*var)
    expo = np.exp(-(x-mean)**2/(2*var))
    return coef*expo


def Mean_array(train_X,target):
    """
    This function calculate the mean value for each feature (colume of dataset),
    this function is actually build the training model
    
    input: 1.training array with size m*n
           m: the number of observations
           n: the number of features
           
           2.target array with size 1*m
           m: the number of observations
    
    return: numpy array of mean values. with size k*n
            n: the number of features
            k: the number of class
    """
    
    k = len(np.unique(target))
    
    mean_array = np.array([[train_X[(target==k)[0]][:,i].mean() for i in range(train_X.shape[1])] for k in np.unique(target)])
    
    assert mean_array.shape==(k,train_X.shape[1])
    
    return mean_array


def Var_array(train_X,target):
    """
    This function calculate the variance value for each feature (colume of dataset),
    this function is actually build the training model
    
    input: 1.training array with size m*n
           m: the number of observations
           n: the number of features
           
           2.target array with size 1*m
           m: the number of observations
    
    return: numpy array of variance values. with size k*n
            n: the number of features
            k: the number of class
    """
    
    k = len(np.unique(target))
    
    var_array = np.array([[train_X[(target==k)[0]][:,i].var(ddof=1) for i in range(train_X.shape[1])] for k in np.unique(target)])
    
    assert var_array.shape==(k,train_X.shape[1])
    
    return var_array


def find_class(A):
    """
    This is a help function that return the index 
    """
    vmax=-1; midx=-1
    for i in range(len(A)):
        if (A[i]>vmax):
            vmax=A[i]
            midx=i
    return midx


def Cal_once(in_array,mean_matrix,var_matrix,target,prior_p):
    """
    This is the help function to calculate only one test case.
    
    input: 1. in_array with size (num_of_features,)
           2. mean_matrix with size (num_of_class, num_of_features)
           3. var_matrix with size (num_of_class, num_of_features)
           4. target with size (1, num_of_classes)
           5. prior_p with size (1, num_of_classes)
           
    return: result array with size (num_of_classes, )
    """
    
    num_of_features = in_array.shape[0]
    num_of_class = len(np.unique(target))
    
    
    likelyhood = np.array([[Gauss(in_array[i],mean_matrix[c,i],var_matrix[c,i]) for i in range(num_of_features)] for c in range(num_of_class)]).prod(axis=1)
    result = np.array([ prior_p[0][i]*likelyhood[i] for i in range(num_of_class)])
    
    return result


def Predict(train_X,target,test_x,prior_p):
    """
    This function calculate the prediction value of the test dataset,
    this function is used to make a prediction
    
    input: 1.training array with size m*n
           m: the number of observations
           n: the number of features
           
           2. target array with size 1*m
           m: the number of observations
           
           3. test set with size num_of_test*n
           num_of_test: the number of test
           n: the number of features
           
           4. prior_p with size 1*num_of_class, this is the prior probability of Bayes theorem.
           num_of_class: the number of classes
           
    return: numpy array of test values, which contains the class label. with size (num_of_test,)
            num_of_test: the number of test
    """
    
    num_of_observations = train_X.shape[0]
    num_of_features = train_X.shape[1]
    num_of_class = len(np.unique(target))
    
    mean_matrix = Mean_array(train_X,target)
    var_matrix = Var_array(train_X,target)
    
    result = np.array([Cal_one(test_x[i,:],mean_matrix,var_matrix,target,prior_p) for i in range(num_of_class)])
    
    
    return result



In [79]:
## Test:

gender = {0:'male',1:'female'}

mean_matrix = Mean_array(train_x,target)
var_matrix  = Var_array(train_x, target)

prior_p = np.array([[0.5,0.5]])

test = np.array([6,130,8])

gen = find_class(Cal_once(test,mean_matrix,var_matrix,target,prior_p))

print (Cal_once(test,mean_matrix,var_matrix,target,prior_p))
print (find_class(Cal_once(test,mean_matrix,var_matrix,target,prior_p)))
print ('The possible gender of the test data is: '+gender[gen])

[  5.37790918e-04   6.19707184e-09]
0
The possible gender of the test data is: male
