# Cross-Validation of Machine Learning Models

## Overfitting

As we develop models to try and predict outcomes, we are faced with an important trade-off that might not be obvious at first. How good should we make our model?

Believe it or not, there are real risks to creating models that appear to be highly performant as we design them. The risk with these amazing models is that they will perform very well on data that we have access to right now, but will perform at a much lower level when used in the real world. This happens when our model is **overfit** to the available data. **Overfitting** occurs when our model assumes that the available data says more about the real world than the data is actually capable of predicting.

For example, for many consecutive years it was possible to predict stock market performance in the United States based on outcomes in the Super Bowl (the championship game of the National Football League). Anyone with any sense would agree that it is highly unlikely that a single sporting event (even if it is one of the largest annual sporting events in the world) could change the performance of the US stock markets. These stock markets have orders of magnitude greater value than ANY sporting event. If our model of the stock market decided that the outcome of this game was a predictor of the stock market, then we would say that our model is **overfitted** to the data, and is making associations that are unlikely to remain true in the future.

We can overfit a model in several ways.

First, if we are careless about the variables included in our model, then we may end up with spurious correlations like that between the stock market and the Super Bowl. In these cases, we begin to believe that a relationship between two unrelated outcomes exists. This relationship may *appear* true in some arbitrary time span, but is unlikely to continue into the future, since the relationship was likely random. Thus, our model believes something about the world will predict outcomes, but we know that this is not probable.

![](https://www.cookandbynum.com/wp-content/uploads/2018/11/per-capital-cheese.png)

Second, we can overfit a model by choosing a model of high complexity. As we increase the complexity of our models, we increase the probability that our model will "bend" itself in order to match our data. We can imagine a random collection of observations that have a linear relationship in realtiy, but the relationship is noisy. In this case, a linear model is the best model to represent the system. On the other hand, we might be able to reduce the error in our model when measured against available data by choosing a non-linear model. Doing so will improve our perceived accuracy when we test our model, but will decrease its ability to make accurate predictions in the real world. The image below represents this kind of overfitting:

![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Overfitted_Data.png/300px-Overfitted_Data.png)

## How does Cross-Validation help?

Cross-validation is a key component of preventing overfitting. The key concept behind cross-validation is that we create multiple models using sub-samples of our data in order to verify that the model will behave similarly when given new data that it has not yet observed. The concept is similar to our train-test splitting that we have done with previous models, but performed several times.

The most common form of cross-validation is called **$k$-fold cross validation**. The concept is very simple:

1. Separate training and testing data
2. Break the training data into $k$ equally-sized portions (typically done through random sampling).
3. Train the model $k$ times using identical model parameters, and where each iteration uses a different combination of training and testing data
4. Record the performance of each iteration
5. Calculate the average performance of the $k$ models
6. If performance is satisfactory, train the model on the full trainind data set, and then test performance on the testing data for final validation of predictive ability
    - If the model is not satisfactory, refine the model parameters and go back to step 2.

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" width=500></img>

By implementing $k$-fold cross validation, we are better able to understand whether or not our model is robust to new observations. For example, I might see that the accuracy of one split is very different from the accuracy of another split. This will give me information about how reliable the model will be as new real-world data is fed into the model. This provides more realistic expectations for the performance of a model than we might otherwise have.

## Using Representative Data

No matter how we attempt to predict performance in the real world, the most important factor is that we ensure that the data we use to train our model actually resembles the data that we will observe in practice. No matter how well-trained our model is, the model **will fail** if it has not been trained on representative data!

## Implementing Cross-Validation

We will implement **stratified** $k$-fold cross validation. **Stratification** simply means that we will break our data apart so that our dependent variable (or label) is evenly mixed in each fold of data, and is particularly valuable where some outcomes are infrequent, and we want to ensure that our model is always given the opportunity to observe those outcomes. In more balanced data, using stratification is of less importance.

While in practice we should first split our data into training and testing data, we will skip this step below in order to keep our example as simple as possible.

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier as dt
from sklearn.metrics import accuracy_score

mnist = pd.read_csv("https://github.com/dustywhite7/pythonMikkeli/blob/master/exampleData/mnistTrain.csv?raw=true")

# Separate our features from our labels
y = mnist['Label']
x = mnist.drop('Label', axis=1)

# Make 5 folds in the data
skf = StratifiedKFold(n_splits=5)

# Create the model
clf = dt(max_depth=15)

# Create a list to store accuracy values
accuracy = []
n=1

# For loop to train the model on each fold
for train_index, test_index in skf.split(x, y):
    # Store the folded data
    x_train = x.loc[train_index, :]
    x_test = x.loc[test_index, :]
    y_train =  y[train_index]
    y_test = y[test_index]
    
    # Fit the model
    clf.fit(x_train, y_train)
    
    # Calculate model accuracy on left-out data
    acc = accuracy_score(clf.predict(x_test), y_test)
    
    # Print results
    print("Fold {0} Accuracy: {1}%".format(n, round(acc*100, 2)))
    
    # Store results
    accuracy.append(acc)
    
    # Add one to our label count
    n+=1
    
# Print overall results
print("\nAverage Accuracy: {}%".format(round(np.mean(accuracy)*100, 2)))
print("Accuracy Standard Deviation: {}%".format(round(np.std(accuracy)*100), 2))


Fold 1 Accuracy: 74.08%
Fold 2 Accuracy: 74.55%
Fold 3 Accuracy: 75.3%
Fold 4 Accuracy: 76.35%
Fold 5 Accuracy: 72.62%

Average Accuracy: 74.58%
Accuracy Standard Deviation: 1.0%


The `StratifiedKFold` object is created with the number of folds that we specify through the argument `n_splits`. This object contains lists that indicate the index values of the observations that belong to each fold. We are then able to *iterate* over the object to create our training and testing groups. 

The rest of our code simply trains models on each split of the data, and reports the accuracy of the model in each case. We then report the average accuracy across all splits together with the standard deviation of accuracy. These measures tell us that our accuracy is highly consistent across splits, and so we should expect the model to continue to perform at near these levels on similar data.

**Solve it:**

Implement cross-validation using either decision tree or random forest classifiers to gauge the robustness of models trained to predict whether or not individuals called by insurance salespeople will sign up to buy car insurance during the call. The data can be accessed [here](https://raw.githubusercontent.com/dustywhite7/pythonMikkeli/master/exampleData/carInsuranceTrain.csv). 

You should implement 10-fold cross-validation, and should report the accuracy of each fold, as well as the average model accuracy. Place your code in the cell below. You will be graded based on the following:

- Preparing the data to be used in a classification model [1 point]
- A working `sklearn` decision tree or random forest classifier [1 point]
- 10-fold cross-validation implemented on the classifier [1 point]
- Accuracy reported for each fold [1 point]
- Overall average accuracy reported [1 point]