# Evaluating classifiers
Unit 2 / Lesson 3


So you’ve made your first classifier!
Now the tough question: is it any good?
Evaluating models is an essential part of the data science process, and it can get as detailed or complex as the data scientist wants it to.

In this lesson we’ll cover some of the ways you could evaluate a classifier, using the spam filter from the previous lesson as the key example.

# Accuracy and Error Types
Unit 2 / Lesson 3 / Assignment 1


### Reload our Naive Bayes Classifier from 2.2

Here we'll quickly reload the Naive Bayes classifier from our last lesson.
This is all code you've seen before, but it is worth noting how little code is actually required to generate this model.
It's a relatively simple exercise, and SKLearn makes it impressively easy.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [16]:
# Grab and process the raw data
PATH = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
       )

sms_raw = pd.read_csv(PATH, delimiter='\t', header=None)
sms_raw.columns = ['spam', 'message']

# enumerate spam keywords
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for i in keywords:
    sms_raw[str(i)] = sms_raw.message.str.contains(
        ' ' + str(i) + ' ',
        case=False
    )

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')

sms_raw.head()

Unnamed: 0,spam,message,click,offer,winner,buy,free,cash,urgent,allcaps
0,False,"Go until jurong point, crazy.. Available only ...",False,False,False,False,False,False,False,False
1,False,Ok lar... Joking wif u oni...,False,False,False,False,False,False,False,False
2,True,Free entry in 2 a wkly comp to win FA Cup fina...,False,False,False,False,False,False,False,False
3,False,U dun say so early hor... U c already then say...,False,False,False,False,False,False,False,False
4,False,"Nah I don't think he goes to usf, he lives aro...",False,False,False,False,False,False,False,False


In [17]:
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

# display results
print('Number of mislabeled points out of a total {}: {}'.format(
    data.shape[0], (target != y_pred).sum()
))
print('Model accuracy:',
      ((data.shape[0] - (target != y_pred).sum()) / data.shape[0]*100),
      '%')

Number of mislabeled points out of a total 5572: 604
Model accuracy: 89.16008614501077 %


### Success Rate

Now we have our __model__ as well as our returned predictions.

The first thing to note is _what data is directly comparable for model evaluation:_ our __target__ and __y_pred variables__.
___Target__ is the actual outcomes_, whether something was spam or ham.
_The __y_pred__ is the predicted outcomes from our classifier_.
Both are ordered arrays with the results from each row of the dataframe.
When the two agree that means our model was able to successfully predict whether a given message was spam or ham.
When they disagree our model was incorrect.

_The most basic measure of __success__, then, is how often our model was correct._
This is called the __accuracy__.
It's a metric you've seen before as it was our method of evaluation in the past lesson, but translated from a count to a rate or percentage.

Now __success rate__ is a popular way to evaluate a model, and what most people get excited about when discussing a model.
However, for a data scientist, success rate is usually not sufficient.

Not all errors are created equal.
Think of the situation we're currently working with: a spam filter.
Are all types of errors equal here?
If you were using this to remove messages from your inbox, letting in a spam message is not nearly as egregious as throwing out a real (and quite possibly very important) message.
Knowing more about the kinds of errors you're generating can therefore be incredibly useful.

Understanding how your model is failing can be key to improving it.
If a certain outcome is not being predicted accurately you may want to focus on engineering more features to identify that outcome.

### Confusion Matrix

The next level of analysis of your __classifier__ is often something called a __Confusion Matrix__.
This is a matrix that _shows the count of each possible permutation of target and prediction._
So in our case, it will show the counts for when a message was ham and we predicted ham, when a message was ham and we predicted spam, when a message was spam and we predicted ham, and when a message was spam and we predicted spam.

SKLearn has a built in __confusion matrix__ function.

In [8]:
from sklearn.metrics import confusion_matrix
confusion_matrix(target, y_pred)

array([[4770,   55],
       [ 549,  198]], dtype=int64)

Here the columns are prediction and the rows are actual.

So what do we learn?

We learn the majority of our error is coming from times where we failed to identify a spam message.
549 of our 604 errors are from failing to identify spam.
So we need to get a little bit better at identifying spam messages.

But before we move on or iterate on the model, let's talk about some key terms that you may run into when thinking about this kind of matrix.

Let's assume our goal is to identify spam (rather than identify ham).

Firstly, when we talk about errors in a binary classifier (where there are only two outcomes) we're generally referring to two kinds of errors.
A __false positive__ is when we identify something as spam that is not. In this case we had 55 of these. _This is sometimes also called a "Type I Error" or a "false alarm"_.

A __false negative__ is therefore when we mistakenly identify something as not spam when it is. We had 549 of these. _This is also called a "Type II Error" or a "miss"_.

This also brings us to a conversation of sensitivity vs specificity.

__Sensitivity__ is the percentage of positives correctly identified, in our case 198/747 or 27%. This shows how good we are at catching positives, or how sensitive our model is to identifying positives.

__Specificity__ is just the opposite, the percentage of negatives correctly identified, 4770/4825 or 99%.

Again this confirms that we're not great at identifying spam, though we do label ham quite accurately. You should get familiar with these terms as in the practicing world they will often be used with little explanation and you will be expected to understand them.

### DRILL:

It's worth calculating these with code so that you fully understand how these statistics work.
Manually generate (__meaning don't use the SKLearn function__) your own confusion matrix and print it along with the sensitivity and specificity.

In [28]:
y_actu = pd.Series(sms_raw['spam'], name='actual')
y_pred = pd.Series(y_pred, name='predicted')
df_confusion = pd.crosstab(y_actu, y_pred,
                           rownames=['actual'], colnames=['predicted'], margins=True)
df_confusion

predicted,False,True,All
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,4770,55,4825
True,549,198,747
All,5319,253,5572
