# Naïve Bayes Gaussian Classification with [`sklearn.naive_bayes.GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

## [Dr Kieu's Lecture Notes on Naïve Bayes Classification](http://120.108.116.237/~ktduc/DA/Lecs/Topic01%20Classification%20Basics%20Jiawei%20Han.pdf)
Slides 51 to 80

## Summary

## Python Implementation from Lab

### Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, KFold   # for making train, test data splits
from sklearn.naive_bayes import GaussianNB

### Viewing the data from `diabetes.csv` with [`pandas`](https://pandas.pydata.org/pandas-docs/stable/10min.html#min)
Reading the data into a pandas dataframe and printing the data head

In [2]:
data = pd.read_csv("diabetes.csv")
print(data.head())   #prints the first 5 rows in the dataframe

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


Determining the number of distinct class labels

In [3]:
print('Distinct class labels and counts: \n{}'.format(data['Outcome'].value_counts()))

Distinct class labels and counts: 
0    500
1    268
Name: Outcome, dtype: int64


Viewing the data type for each attribute

In [4]:
print(data.dtypes) 

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


### Data cleaning

In [5]:
inputs = data.loc[:,:'Age']  #takes all columns up to and including the column labelled 'Age'
target = data['Outcome']  #gets the column labelled 'Outcome'

#removing header row
X = inputs.values
y = target.values

#the data is slip such that 75% is used for training and 25% is used for testing
x_train, x_test, y_train, y_test = train_test_split(X,y)  

### Performing the Gaussian Classification

In [6]:
clf = GaussianNB() #use the Gaussian instead of decision tree

clf.fit(x_train, y_train)  # traning the model on the training set

GaussianNB(priors=None)

### A quick Prediction and Accuracy Test
[**`confusion_matrix`**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)  
By definition a confusion matrix $C$ is such that $C_{i,j}$ is equal to the number of observations known to be in group $i$ but predicted to be in group $j$.

Thus in binary classification, the count of true negatives is $C_{0,0}$, false negatives is $C_{1,0}$, true positives is $C_{1,1}$ and false positives is $C_{0,1}$.
\begin{bmatrix}
    TN & FP \\
    FN & TP \\
\end{bmatrix}  
  
  
[**`accuracy_score`**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
\begin{equation}
    \frac{TP+TN}{P+N}
\end{equation}

In [7]:
y_pred = clf.predict(x_test)  # testing the model

print('Confusion Matrix: \n{}\n'.format(confusion_matrix(y_test, y_pred)))
print('Accuracy Score: \n{}'.format(accuracy_score(y_test, y_pred)*100))

Confusion Matrix: 
[[106  25]
 [ 22  39]]

Accuracy Score: 
75.52083333333334


### Rapid accuracy tesing with [`sklearn.model_selection.KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
`KFold` makes multiple train, test data splits (in this case $k=5$) for quick accuracy testing on **multiple models**.  
  
  
**NOTE:** `KFold` returns lists of indicies that refer to the dataset, whereas `train_test_split` returns copies of the actual data from the set.  
  
  

- `get_n_splits([X, y, groups])`	Returns the number of splitting iterations in the cross-validator  
- `split(X[, y, groups])`	Generate `np.array` of indices to split data into training and test set.

In [8]:
k_fold = KFold(n_splits=5)   
k_fold.get_n_splits(X)        # get_n_splits  returns the number of splitting iterations 
accuracies=[]

# new model for accuracy tesing on k subsets of diabetes.csv
clf2 = GaussianNB()
for train_idx, test_idx in k_fold.split(X):     #split generates indices to split data into training and test set.
    train_X, test_X = X[train_idx], X[test_idx]
    train_y, test_y = y[train_idx], y[test_idx]
    clf2.fit(train_X, train_y)
    predictions = clf2.predict(test_X)
    accuracy = accuracy_score(test_y, predictions)*100
    accuracies.append(accuracy)
    
print('Accuracies for 5 tests: {}'.format(accuracies))    
print("The mean accuracy is: ", np.mean(accuracies))

Accuracies for 5 tests: [75.32467532467533, 71.42857142857143, 74.67532467532467, 80.3921568627451, 74.50980392156863]
The mean accuracy is:  75.26610644257703
