Accuracy and Error Types

Using Naive Bayes classifier to classifify spam emails.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
sms_raw = pd.read_csv('SMSSpamCollection.txt', delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# spam keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

# display results.
print("Number of mislabeled points out of a total {} points : {}".format(
    data.shape[0],
    (target != y_pred).sum()
))

print("Percentage of correctly classified messages {}".format((target == y_pred).sum() / data.shape[0]))

Number of mislabeled points out of a total 5572 points : 604
Percentage of correctly classified messages 0.8916008614501076


In [3]:
sms_raw.head()

Unnamed: 0,spam,message,click,offer,winner,buy,free,cash,urgent,allcaps
0,False,"Go until jurong point, crazy.. Available only ...",False,False,False,False,False,False,False,False
1,False,Ok lar... Joking wif u oni...,False,False,False,False,False,False,False,False
2,True,Free entry in 2 a wkly comp to win FA Cup fina...,False,False,False,False,False,False,False,False
3,False,U dun say so early hor... U c already then say...,False,False,False,False,False,False,False,False
4,False,"Nah I don't think he goes to usf, he lives aro...",False,False,False,False,False,False,False,False


In [4]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[4770,   55],
       [ 549,  198]], dtype=int64)

In [7]:
print((target == 1).sum())
print((target == 0).sum())
print((y_pred == 1).sum())
print((y_pred == 0).sum())
print((target == y_pred).sum())
print((target != y_pred).sum())

747
4825
253
5319
4968
604


Here is the code to build out a confusion matrix without sklearn.

In [13]:
print(sum((target == 0) & (y_pred == 0)))
print(sum((target == 1) & (y_pred == 0)))
print(sum((target == 0) & (y_pred == 1)))
print(sum((target == 1) & (y_pred == 1)))

4770
549
55
198


Sensitivity is the percentage of positives correctly identified, in our case 198/747 or 27%. 
This shows how good we are at catching positives, or how sensitive our model is to identifying positives.

Specificity is just the opposite, the percentage of negatives correctly identified, 4770/4825 or 99%.

Using python, we have:

In [16]:
sensitivity = (sum((target == 1) & (y_pred == 1)))/((target == 1).sum())
specificity = (sum((target == 0) & (y_pred == 0)))/((target == 0).sum())

print('sensitivity: ', sensitivity)
print('specificity: ', specificity)

sensitivity:  0.26506024096385544
specificity:  0.9886010362694301
