In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

## Reload our Naive Bayes Classifier from 2.2

Here we'll quickly reload the Naive Bayes classifier from earlier. This is all code you've seen before. It is worth noting how little code is actually required to generate this model. It's a relatively simple exercise, and SKLearn makes it impressively easy.

In [2]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

## Success Rate

Now we have our model as well as our returned predictions. 

The first thing to note is what data is directly comparable for model evaluation: our target and y_pred variables. Target is the actual outcomes, whether something was spam or ham. The y_pred is the predicted outcomes from our classifier. Both are ordered arrays with the results from each row of the dataframe. When the two agree that means our model was able to successfully predict whether a given message was spam or ham. When they disagree our model was incorrect.

The most basic measure of success, then, is how often our model was correct. This is called the accuracy. It's a metric you've seen before as it was our method of evaluation in the past lesson, but translated from a count to a rate or percentage.

Go ahead and calculate it in the cell below. If you're stuck look back at the previous lesson. If you haven't yet, make your own copy of this notebook to work with locally so you don't lose your work.

In [5]:
# Calculate the accuracy of your model here.
accuracy = (target == y_pred).sum()/target.shape[0]
print('Model accuracy: {:.2f}%'.format(accuracy*100))

Model accuracy: 89.16%


You should be getting __89.16%__ off of 4968 correctly classified messages and 604 incorrectly classified.

## Confusion Matrix

The next level of analysis of your classifier is often something called a Confusion Matrix. This is a matrix that shows the count of each possible permutation of target and prediction. So in our case, it will show the counts for when a message was ham and we predicted ham, when a message was ham and we predicted spam, when a message was spam and we predicted ham, and when a message was spam and we predicted spam.

SKLearn has a built in confusion matrix function, so let's quickly import that and generate one here.

In [4]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[4770,   55],
       [ 549,  198]])

## DRILL:

It's worth calculating these with code so that you fully understand how these statistics work, so here is your task for the cell below. Manually generate (__meaning don't use the SKLearn function__) your own confusion matrix and print it along with the sensitivity and specificity.

In [18]:
# Build your confusion matrix and calculate sensitivity and specificity here.
pos_index = target.index[target] # all spam
neg_index = target.index[~target] # all ham
confusion_matrix = (np.zeros(shape=(2,2))).astype(int)
confusion_matrix[0,0] = (target[neg_index] == y_pred[neg_index]).sum()
confusion_matrix[0,1] = len(target[neg_index]) - confusion_matrix[0,0]
confusion_matrix[1,1] = (target[pos_index] == y_pred[pos_index]).sum()
confusion_matrix[1,0] = len(target[pos_index]) - confusion_matrix[1,1]
print(confusion_matrix)

[[4770   55]
 [ 549  198]]
