In [14]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

In [9]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

In [10]:
# Test your model with different holdout groups.

from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.884304932735426
Testing on Sample: 0.8916008614501076


These scores look really consistent! It doesn't seem like our model is overfitting. Part of the reason for that is that it's so simple (more on that in a bit). But we should look and see if any other issues are lurking here. So let's try a more robust evaluation technique, cross validation.

## Cross Validation

Cross validation is a more robust version of holdout groups. Instead of creating just one holdout, you create several.

The way it works is this: start by breaking up your data into several equally sized pieces, or __folds__. Let's say you make _x_ folds. You then go through the training and testing process _x_ times, each time with a different fold held out from the training data and used as the test set. The number of folds you create is up to you, but it will depend on how much data you want in your testing set. At its most extreme, you're creating the same number of folds as you have observations in your data set. This kind of cross validation has a special name: __Leave One Out__. Leave one out is useful if you're worried about single observations skewing your model, whereas large folds combat more general overfitting.



In [11]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([0.89784946, 0.89426523, 0.89426523, 0.890681  , 0.89605735,
       0.89048474, 0.88150808, 0.89028777, 0.88489209, 0.89568345])

That's exactly what we'd hope to see. The array that `cross_val_score` returns is a series of accuracy scores with a different hold out group each time. If our model is overfitting at a variable amount, those scores will fluctuate. Instead, ours are relatively consistent.

Above we used the SKLearn built in functions for both of these kinds of cross validation, the documentation for which can be found [here](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators-with-stratification-based-on-class-labels). However, the outputs from that are somewhat limited. By default it uses the `score` method. You can adjust what is returned, but you don't get all of the error types or outputs you may be interested in. That's why it's not uncommon for people to code up their own cross validation.

To make sure you understand how cross validation works, try to code it up yourself below, not relying on SKLearn:


In [44]:
# Implement your own cross validation with your spam model.
def custom_cross_val_score(func, data, target, cv=10):
    shuffled_data = data.sample(frac=1)
    split_data = np.array_split(shuffled_data, cv)
    res = np.empty(shape=(cv,1))
    for i in range(cv): # split test and train based on excluded group indices
        X_train = data[~data.index.isin(split_data[i].index)]
        X_test = data[data.index.isin(split_data[i].index)]
        y_train = target[~target.index.isin(split_data[i].index)]
        y_test = target[target.index.isin(split_data[i].index)]
        res[i] = func.fit(X_train, y_train).score(X_test, y_test)
    print(res)
    
custom_cross_val_score(func=bnb, data=data, target=target, cv=10)

[[0.87634409]
 [0.89605735]
 [0.87253142]
 [0.91202873]
 [0.90305206]
 [0.86355476]
 [0.91202873]
 [0.88689408]
 [0.90843806]
 [0.88150808]]


## What's a good score?

When we're looking at this model, we've been getting accuracy scores around .89. Intuitively that seems like a pretty good score, but in the start of this lesson we mentioned different kinds of error. We also mentioned class imbalance. Both of these things are at play here. Using the topics we introduced earlier in this lesson, try to do a more in depth evaluation of the model looking at the kind of errors we're generating and what accuracy we'd get if we just randomly guessed. You may want to use what's known as a [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to show different kinds of errors.


In [52]:
# Perform your additional evaluation here.
def confussion_cross_val_score(func, data, target, cv=10):
    shuffled_data = data.sample(frac=1)
    split_data = np.array_split(shuffled_data, cv)
    res = np.empty(shape=(cv*2, 2))
    for i in range(cv): # split test and train based on excluded group indices
        X_train = data[~data.index.isin(split_data[i].index)]
        X_test = data[data.index.isin(split_data[i].index)]
        y_train = target[~target.index.isin(split_data[i].index)]
        y_test = target[target.index.isin(split_data[i].index)]
        y_pred = func.fit(X_train, y_train).predict(X_test)
        row =i*2
        res[row:row+2, 0:2] = confusion_matrix(y_test, y_pred)
        print('Confusion matrix from run number:{}\n'.format(i), res[row:row+2, 0:2],'\n')
    
confussion_cross_val_score(func=bnb, data=data, target=target, cv=10)

Confusion matrix from run number:0
 [[476.   3.]
 [ 50.  29.]] 

Confusion matrix from run number:1
 [[474.   7.]
 [ 54.  23.]] 

Confusion matrix from run number:2
 [[473.   6.]
 [ 57.  21.]] 

Confusion matrix from run number:3
 [[480.   7.]
 [ 53.  17.]] 

Confusion matrix from run number:4
 [[475.   7.]
 [ 59.  16.]] 

Confusion matrix from run number:5
 [[479.   4.]
 [ 61.  13.]] 

Confusion matrix from run number:6
 [[476.   6.]
 [ 60.  15.]] 

Confusion matrix from run number:7
 [[489.   5.]
 [ 42.  21.]] 

Confusion matrix from run number:8
 [[479.   4.]
 [ 50.  24.]] 

Confusion matrix from run number:9
 [[469.   6.]
 [ 63.  19.]] 




## Thinking like a Data Scientist

How you choose to validate your model in real life will depend upon the kind of data you're working with and the kinds of concerns you have about the model's performance. Remember, your model is trained to fit the data you feed it, so if the situation changes your model will become less accurate. For example, if there are seasonal changes to your observed variable but you only train on one month's data, you're going to have a problem. You could test that by seeing how accurate your model is with a specific time period as your holdout, rather than a random sample. We'll cover techniques for dealing with time more later.

## Overfitting and Naive Bayes

Overfitting is always possible, but some models are more susceptible to it than others. Naive Bayes is actually pretty good for avoiding overfitting. This is largely because the assumptions are so simple, particularly the assumed independence between any two independent variables. One of the sources of overfitting is when a model tries to map complex interactions between variables that aren't really there or significant. Naive Bayes cannot do this because it assumes they are all independent and therefore not interacting. It's a nice characteristic at times, but it does mean it doesn't take into account how your features affect each other.

Also, one final note on our models here. They weren't overfitting, but they weren't telling us much either. They were just barely more accurate than the dominant class. Discuss with your mentor why that is and what you could do to improve the model.
