Another source of error = OVERFITTING
* Overfitting = when model is so excessively complex that it starts to catch random noise instead of describing the true underlying relationship
* typically manifested with model that evaluates as more accurate than it really is. 
* in most situations you shouldn't be able to build a perfect model--> some error is to be expected. 
* it is extremely common and easy to do 

Weve been using same data to train the model and see how well model is doing. 
* some danger of this approach can be apparent. 
* if we create a very elaborate model, it will pick up on the nuances of the data that are just from random noise. 
* if we evaluate the model on the training data, that ability to pick up noise will be returned as accuracy 
* in reality, this isnt the case and doesnt depict how wed really want to evaluate a model
* we generally dont care about predicting things we already know. 
* we care about other data, new info, or other situations



### Holdout Groups

Simplest way to combat overfitting is with --> Holdout group (holdback group)
* You do not include all of your data in your trianing set, and instead reserve some of it exclusively for testing. 
* there is a cost to having less training data, but evaluation will be more reliable

when directly comparing two models that are based on different techniques or different specifications, this holdout method combats overfitting. 
* overfit models will see a drop in success rate outside of their training data, and os their performance will not be artificially inflated as it would be if you trained and validated your model using the whold data set. 
* this is b/c they got really good at matching the patterns within the data they were trained with but didnt actually learn the things that matter but random noise. 
* when they try to match that random noise on new data their accuracy suffers. 

how much data you choose to keep in a holdout is really up to you and depends on how much and what kind of data you have to begin with as well as what kind of model youre training with . 
* should chekc and see how much varianve your model has as you add more data as well as how much data it would take to maintain a reasonably representative test sample. 
* 30% is a common starting point but really anything from 50% to 1% of original dataset is reasonable. 

In [1]:
import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Grab and process the raw data.
data_path = ("https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
             "master/sms_spam_collection/SMSSpamCollection"
            )
sms_raw = pd.read_csv(data_path, delimiter= '\t', header=None)
sms_raw.columns = ['spam', 'message']

# Enumerate our spammy keywords.
keywords = ['click', 'offer', 'winner', 'buy', 'free', 'cash', 'urgent']

for key in keywords:
    sms_raw[str(key)] = sms_raw.message.str.contains(
        ' ' + str(key) + ' ',
        case=False
)

sms_raw['allcaps'] = sms_raw.message.str.isupper()
sms_raw['spam'] = (sms_raw['spam'] == 'spam')
data = sms_raw[keywords + ['allcaps']]
target = sms_raw['spam']

from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
y_pred = bnb.fit(data, target).predict(data)

In [3]:
# Test your model with different holdout groups.

from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=20)
print('With 20% Holdout: ' + str(bnb.fit(X_train, y_train).score(X_test, y_test)))
print('Testing on Sample: ' + str(bnb.fit(data, target).score(data, target)))

With 20% Holdout: 0.884304932735426
Testing on Sample: 0.8916008614501076


### Cross Validation

The more robust version of holdout groups--> cross validation
* instead of creating just one holdout, you create several

The way it works: 
* break up data into several equally sized pieces (folds)
    * lets say you make x folds. 
* go through the training and testing process x times, each time with a different fold held out from the training data and used as the test set. 
* number of folds you create is up to you but will depend on how much data you want in your testing set. 
* This kind of cross validation is called--> Leave one out
    * useful if youre worried about single obs skewing your model. whereas large folds combat more general overfitting. 

In [4]:
from sklearn.model_selection import cross_val_score
cross_val_score(bnb, data, target, cv=10)

array([0.89784946, 0.89426523, 0.89426523, 0.890681  , 0.89605735,
       0.89048474, 0.88150808, 0.89028777, 0.88489209, 0.89568345])

the array that cross_val_score returns = series of accuracy scores with a different hold out group each time. if our model is overfitting at a variable amount, those scores will fluctuate. 
* the one above is realtively constant. 



To make sure you understand how cross validation works, try to code it up yourself below, not relying on SKLearn:

In [1]:
# Implement your own cross validation with your spam model.
1.) test
    cross_val_1 = 0.20 * data w. random number generated seed 1
    cross_val_2 = 0.20 * data w. random number generated seed 2
    cross_val_3 = 0.20 * data w. random number generated seed 3
    cross_val_4 = 0.20 * target w. random number genereated seed 4
    cross_val_5 = 0.20 * target w. random number generated seed 5
    cross_val_6 = 0.20 * target w. random numbe generated seed 6
    
    train
    cross_val_1_t = 0.80 * data w. random number generated seed 1
    cross_val_2_t = 0.80 * data w. random number generated seed 2
    cross_val_3_t = 0.80 * data w. random number generated seed 3
    cross_val_4_t = 0.80 * target w. random number genereated seed 4
    cross_val_5_t = 0.80 * target w. random number generated seed 5
    cross_val_6_t = 0.80 * target w. random numbe generated seed 6
    
2.) 
    train
    model_1 = nbn(cross_val_1_t = data * cross_val_4_t = target)
    model_2 = nbn(cross_val_2_t = data * cross_val_5_t = target)
    model_3 = nbn(cross_val_3_t = data * cross_val_6_t = target)
    
    test
    model_1.score(cross_val_1, cross_val_4)
    model_2.score(cross_val_2, cross_val_5)
    model_3.score(cross_val_3, cross_val_6)
    
3.) score = [0.88, 0.87, 0.86]

SyntaxError: invalid syntax (<ipython-input-1-baf16bdde77f>, line 2)

### What's a good score?
When look at model, getting accuracy scores around .89
* seems like a pretty good score, but we there are different kind of errors
* Class imbalance
* Both are played here

### THinking like a data scientist
how you choose to balidate model will depend on kind of data and concerns
* model is trained to fit the data you feed it 


### Overfitting and Naive Bayes
Niave bayes is good for avoiding overfitting
* B/c assumptions are so simple
* particualarly the assumed independence b/w any two ind variables
* one sources of overfitting is when model tries to map complex interactions b/w variables that arent really there or significant
* naive bayes cant do this b/c it assumes they are all ind and not interacting
