# Naive Bayes Classifier

## 1. Naive Bayes algorithm

This ipython notebook is used to implement naive bayes algorithms, which is easy to understand and for some problem it has a fair good accuracy. 

**The Question we are going to answer **:

This is the simple example I get from wikipedia, which is used to understand how Naive Bayes algorithm works in this case:

### Problem: classify whether a given person is a male or a female based on the measured features. The features include height, weight, and foot size.

In [1]:
## Import the packages

import numpy as np
import matplotlib as plt
import sklearn as sk
import sys

%matplotlib notebook

In [2]:
## dataset columns: [height(feet), weight(lbs), foot size(inches), gender(male:1, female:0)]


dataset = np.array([[6,180,12,1],[5.92,190,11,1],[5.58,170,12,1],[5.92,165,10,1],[5,100,6,0],[5.5,150,8,0],[5.42,130,7,0],[5.75,150,9,0]],dtype=float)

In [3]:
dataset

array([[   6.  ,  180.  ,   12.  ,    1.  ],
       [   5.92,  190.  ,   11.  ,    1.  ],
       [   5.58,  170.  ,   12.  ,    1.  ],
       [   5.92,  165.  ,   10.  ,    1.  ],
       [   5.  ,  100.  ,    6.  ,    0.  ],
       [   5.5 ,  150.  ,    8.  ,    0.  ],
       [   5.42,  130.  ,    7.  ,    0.  ],
       [   5.75,  150.  ,    9.  ,    0.  ]])

In [4]:
train_x = dataset[:,:-1]

target = dataset[:,-1].reshape((1,dataset.shape[0]))

In [5]:
target

array([[ 1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.]])

## Implement our Bayes algorithm

In [6]:
## define Gaussian distribution function

def Gauss(x, mean, var):
    """
    This is the function to calculate the probability distribution under the assumption that the data behaves as normla distribution
    """
    coef = 1/np.sqrt(2*np.pi*var)
    expo = np.exp(-(x-mean)**2/(2*var))
    return coef*expo


def Mean_array(train_X,target):
    """
    This function calculate the mean value for each feature (colume of dataset),
    this function is actually build the training model
    
    input: 1.training array with size m*n
           m: the number of observations
           n: the number of features
           
           2.target array with size 1*m
           m: the number of observations
    
    return: numpy array of mean values. with size k*n
            n: the number of features
            k: the number of class
    """
    
    k = len(np.unique(target))
    
    mean_array = np.array([[train_X[(target==k)[0]][:,i].mean() for i in range(train_X.shape[1])] for k in np.unique(target)])
    
    assert mean_array.shape==(k,train_X.shape[1])
    
    return mean_array


def Var_array(train_X,target):
    """
    This function calculate the variance value for each feature (colume of dataset),
    this function is actually build the training model
    
    input: 1.training array with size m*n
           m: the number of observations
           n: the number of features
           
           2.target array with size 1*m
           m: the number of observations
    
    return: numpy array of variance values. with size k*n
            n: the number of features
            k: the number of class
    """
    
    k = len(np.unique(target))
    
    var_array = np.array([[train_X[(target==k)[0]][:,i].var(ddof=1) for i in range(train_X.shape[1])] for k in np.unique(target)])
    
    assert var_array.shape==(k,train_X.shape[1])
    
    return var_array


def find_class(A):
    """
    This is a help function that return the index 
    """
    vmax = -sys.maxsize; midx = -1
    for i in range(len(A)):
        if (A[i]>vmax):
            vmax=A[i]
            midx=i
    return midx


def Cal_once(in_array,mean_matrix,var_matrix,target,prior_p):
    """
    This is the help function to calculate only one test case.
    
    input: 1. in_array with size (num_of_features,)
           2. mean_matrix with size (num_of_class, num_of_features)
           3. var_matrix with size (num_of_class, num_of_features)
           4. target with size (1, num_of_classes)
           5. prior_p with size (1, num_of_classes)
           
    return: result array with size (num_of_classes, )
    """
    
    num_of_features = in_array.shape[0]
    num_of_class = len(np.unique(target))
    
    
    likelyhood = np.array([[Gauss(in_array[i],mean_matrix[c,i],var_matrix[c,i]) for i in range(num_of_features)] for c in range(num_of_class)]).prod(axis=1)
    result = np.array([ prior_p[0][i]*likelyhood[i] for i in range(num_of_class)])
    
    return np.log(result)


def Predict(train_X,target,test_x,prior_p):
    """
    This function calculate the prediction value of the test dataset,
    this function is used to make a prediction
    
    input: 1.training array with size m*n
           m: the number of observations
           n: the number of features
           
           2. target array with size 1*m
           m: the number of observations
           
           3. test set with size num_of_test*n
           num_of_test: the number of test
           n: the number of features
           
           4. prior_p with size 1*num_of_class, this is the prior probability of Bayes theorem.
           num_of_class: the number of classes
           
    return: numpy array of test values, which contains the class label. with size (num_of_test,)
            num_of_test: the number of test
            Y_prediction
    """
    
    num_of_observations = train_X.shape[0]
    num_of_features = train_X.shape[1]
    num_of_class = len(np.unique(target))
    
    mean_matrix = Mean_array(train_X,target)
    var_matrix = Var_array(train_X,target)
    
    Y_prediction=np.zeros((1,test_x.shape[0]),dtype=int)
    
    result = np.array([Cal_once(test_x[i,:],mean_matrix,var_matrix,target,prior_p) for i in range(test_x.shape[0])])
    
    for i in range(result.shape[0]):
        Y_prediction[0,i]=find_class(result[i,:])

    return Y_prediction


def model_accuracy(test_y, result):
    
    assert test_y.shape == result.shape
    
    s=0.0
    for i in range(len(test_y)):
        if (test_y[i]==result[i]):
            s=s+1
    
    return s/len(test_y)

In [7]:
## Test1:

gender = {1:'male',0:'female'}

mean_matrix = Mean_array(train_x,target)
var_matrix  = Var_array(train_x, target)

prior_p = np.array([[0.5,0.5]])

test = np.array([6,130,8])

gen = find_class(Cal_once(test,mean_matrix,var_matrix,target,prior_p))

print (Cal_once(test,mean_matrix,var_matrix,target,prior_p))
print (find_class(Cal_once(test,mean_matrix,var_matrix,target,prior_p)))
print ('The possible gender of the test data is: '+gender[gen])


## Test2:

gender = {1:'male',0:"female"}

mean_matrix = Mean_array(train_x,target)
var_matrix = Var_array(train_x,target)

prior_p = np.array([[0.5,0.5]])

test = np.array([[6,130,8],[6,180,10]])

gen = Predict(train_x,target,test,prior_p)
print (', '.join(['The '+str(i)+'th test item is '+gender[i] for i in gen[0]]))

[ -7.5280407  -18.89918894]
0
The possible gender of the test data is: female
The 0th test item is female, The 1th test item is male


## 2. Real life problem:2 label class problem

This is the example that we discussed many times before. The male and female height and weight problem. We use it as a special example for only binary class problem

In [8]:
import pandas as pd
dflog = pd.read_csv("01_heights_weights_genders.csv")
dflog.head()

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


In [9]:
dflog['gender']=dflog['Gender'].apply(lambda x:1 if x=='Male' else 0)

In [10]:
dflog.head()

Unnamed: 0,Gender,Height,Weight,gender
0,Male,73.847017,241.893563,1
1,Male,68.781904,162.310473,1
2,Male,74.110105,212.740856,1
3,Male,71.730978,220.04247,1
4,Male,69.881796,206.349801,1


In [11]:
dflog['gender'].value_counts()

1    5000
0    5000
Name: gender, dtype: int64

In [12]:
## train test splitting

from sklearn.model_selection import train_test_split

# Split the data into a training and test set.
X_train, X_test, y_train, y_test = train_test_split(dflog[['Height','Weight']].values,dflog['gender'].values,random_state=5)

In [13]:
## reshape the dataset inorder to use my own algorithm

print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)

(7500, 2)
(2500, 2)
(7500,)
(2500,)


In [17]:
## Train the model:

y_train_my = y_train.reshape((1,y_train.shape[0]))
y_test_my = y_test.reshape((1, y_test.shape[0]))

prior_p = np.array([[0.5,0.5]])

res_train = Predict(X_train,y_train_my,X_train,prior_p)

res_train = res_train.reshape((res_train.shape[1],))

print("The traning set accuracy is {}".format(model_accuracy(y_train,res_train)))

res_test = Predict(X_train,y_train_my,X_test,prior_p)
res_test = res_test.reshape((res_test.shape[1],))

print ("The test set accuracy is {}".format(model_accuracy(y_test,res_test)))

The traning set accuracy is 0.8842666666666666
The test set accuracy is 0.896


In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

clf = GaussianNB()
clf.fit(X_train, y_train)
GaussianNB(priors=np.array([0.5,0.5]))
y_pred = clf.predict(X_test)


print ("The train set accuracy is {}".format(accuracy_score(y_train,clf.predict(X_train))))
print ("The test set accuracy is {}".format(accuracy_score(y_test, y_pred)))

The train set accuracy is 0.8842666666666666
The test set accuracy is 0.896
