## Naive Bayes Classifier

This Naive Bayes tutorial is based on "Evaluating a Classification Model" post available at http://www.ritchieng.com/machine-learning-evaluate-classification-model/

## The Data

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

https://www.kaggle.com/uciml/pima-indians-diabetes-database/version/1#


In [34]:
# read the data into a Pandas DataFrame
import pandas as pd

df = pd.read_csv('pima_indians_diabetes.csv')
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [35]:
# define X and y
X = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]

y = df['Outcome']

In [36]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [37]:
# train a logistic regression model on the training set
from sklearn.naive_bayes import GaussianNB

# instantiate model
nb = GaussianNB()

# fit model
nb.fit(X_train, y_train)

GaussianNB(priors=None)

In [49]:
# make class predictions for the testing set
y_pred_class = nb.predict(X_test)

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0])

**Classification accuracy**: percentage of correct predictions

In [39]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.765625


**Confusion matrix**: a table that describes the performance of a classification model

In [40]:
# IMPORTANT: first argument is true values, second argument is predicted values
# this produces a 2x2 numpy array (matrix)
print(metrics.confusion_matrix(y_test, y_pred_class))

[[114  16]
 [ 29  33]]


|n = xxx     |Predicted = 0|Predicted = 1|
|------------|-------------|-------------|
|Actual = 0  |114          |16           |
|Actual = 1  |29           |33           |

**Basic terminology**

* True Positives (TP): we correctly predicted that they do have diabetes: 33
* True Negatives (TN): we correctly predicted that they don't have diabetes: 114
* False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error"): 16
* False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error"): 29

In [41]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
print(confusion)
#[row, column]
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

[[114  16]
 [ 29  33]]


In [42]:
# Classification Accuracy: Overall, how often is the classifier correct?

# use float to perform true division, not integer division
print((TP + TN) / float(TP + TN + FP + FN))
print(metrics.accuracy_score(y_test, y_pred_class))

0.765625
0.765625


In [43]:
# Classification Error: Overall, how often is the classifier incorrect?

# Also known as "Misclassification Rate"
classification_error = (FP + FN) / float(TP + TN + FP + FN)

print(classification_error)
print(1 - metrics.accuracy_score(y_test, y_pred_class))

0.234375
0.234375


**Sensitivity**: When the actual value is positive, how often is the prediction correct?

* Something we want to maximize
* How "sensitive" is the classifier to detecting positive instances?
* Also known as "True Positive Rate" or "Recall"
* TP / all positive
    * all positive = TP + FN

In [44]:
sensitivity = TP / float(FN + TP)

print(sensitivity)
print(metrics.recall_score(y_test, y_pred_class))

0.532258064516
0.532258064516


**Specificity**: When the actual value is negative, how often is the prediction correct?

* Something we want to maximize
* How "specific" (or "selective") is the classifier in predicting positive instances?
* TN / all negative
    * all negative = TN + FP

In [45]:
specificity = TN / (TN + FP)

print(specificity)

0.876923076923


**False Positive Rate**: When the actual value is negative, how often is the prediction incorrect?

In [46]:
false_positive_rate = FP / float(TN + FP)

print(false_positive_rate)
print(1 - specificity)

0.123076923077
0.123076923077


**Precision**: When a positive value is predicted, how often is the prediction correct?

How "precise" is the classifier when predicting positive instances?

In [47]:
precision = TP / float(TP + FP)

print(precision)
print(metrics.precision_score(y_test, y_pred_class))

0.673469387755
0.673469387755


**Receiver Operating Characteristic (ROC) Curve**

In [53]:
%matplotlib inline
import matplotlib.pyplot as plt



# IMPORTANT: first argument is true values, second argument is predicted probabilities

# we pass y_test and y_pred_prob

y_pred_prob = nb.predict_proba(X_test)[:, 1]

# we do not use y_pred_class, because it will give incorrect results without generating an error
# roc_curve returns 3 objects fpr, tpr, thresholds
# fpr: false positive rate
# tpr: true positive rate
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)

plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.rcParams['font.size'] = 12
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

ValueError: bad input shape (192, 2)