<center><h2>Cross Validation</h2></center>
<center><img src="https://i.stack.imgur.com/c6ECF.png" width="70%"/></center>

By The End Of This Session You Should Be Able To:
----

- Define Cross Validation (CV)
- Explain why and how to do CV
- Explain the difference between train, validate, and test datasets

<center><h2>What is the goal of Machine Learning?</h2></center>

Learn a function from data that can __generalize__ to novel data

<center><h2>ML Workflow</h2></center>

<center><img src="images/img-26.png" width="75%"/></center>

Cross-Validation (CV)
-----

Primary idea: Do __not__ use all your data for training.

Instead, split your data into a 3 separate groups:

1. Test set - Final dataset evaluation with final metric
1. Training set - Dataset with a specific hyperparameter configuration
1. Validation set - Paired with an training to test generalization of each set of hyperparameters

<center><img src="images/three_way.png" width="75%"/></center>

<center><img src="images/train_test.png" width="90%"/></center>

Definitions
------

Training set: Data points used to train the model.

Validation set: Data points to keep checking the performance of your model in order to improve training

Testing set: Data points used to check the performance once training is __completely finished__

Common uses of validation set
-----

- Compare different hyperparameter configurations
- Compare different feature engineering representations or feature selection
- Compare different algorithms
- Estimate Variance (e.g., average model performance)

Source: https://www.quora.com/What-is-the-definition-of-development-set-in-machine-learning

k-fold CV
------

The training set is split into k smaller sets.

A model is trained using k-1 of the folds as training data.

The final model is tested on the completely hold-out test data (the last k split).

Source: https://scikit-learn.org/stable/modules/cross_validation.html

In [11]:
reset -fs

In [12]:
# Simulate splitting a dataset of 25 observations into 5 folds
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=False).split(range(20))

print(f"{'Iteration'} {'Training set observations':^48} {'Validate set observations'}")
for iteration, data in enumerate(kf, start=1):
    print(f"{iteration:^9} {data[0]} {data[1]}")

Iteration            Training set observations             Validate set observations
    1     [ 4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [0 1 2 3]
    2     [ 0  1  2  3  8  9 10 11 12 13 14 15 16 17 18 19] [4 5 6 7]
    3     [ 0  1  2  3  4  5  6  7 12 13 14 15 16 17 18 19] [ 8  9 10 11]
    4     [ 0  1  2  3  4  5  6  7  8  9 10 11 16 17 18 19] [12 13 14 15]
    5     [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15] [16 17 18 19]


Source: https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb

Common Sizes for Train Test Splits
------

Train / test:

- 70/30 
- 80/20
- 90/10

Mostly an empirical choice based on domain complexity and size of the data.

Common  Sizes for Folds 
------

- k=5 
- k=10 (tends to be the most popular)

Again, an empirical choice based on how many hyperparameters and size of the data.

scikit-learn make CV easy
-------

You'll see an example later.

Let's Compare Different Model Evaluation Procedures 
-----

1. Training and testing on the same data
1. Train/test split
1. K-fold cross-validation

Training and testing on the same data
----

Rewards overly-complex models that overfit the training data.

Might not necessarily generalize.

Check for understanding
-----

When would you want to train and test on the same data?

- Have very little data to begin with
- Very small time budget (e.g., learning in a real-time system)
- Sure that the new data is the same as the training data

Train/test split
-----

Split the dataset into two sets, the model is trained and tested on different data

Better estimate of out-of-sample performance, but still a could be a high variance estimate

Useful because:

- Common
- Simple
- Flexible
- Fast

K-fold cross-validation
-----

Systematically create k train/validation/test splits

Runs k times slower than train/test split

Protip: Stratified Sampling
-----

For classification problems, stratified sampling is recommended for creating the folds

Each target class should be represented with equal proportions in each of the K folds

scikit-learn's `cross_val_score` function does this by default

Caveat: Moar Data
-------

The size of the dev and test set should be big enough for the dev and test results to be representative of the performance of the model. 

If the dev set has 100 examples, the dev accuracy can vary a lot depending on the chosen dev set. For bigger datasets (>1M examples), the dev and test set can have around 10,000 examples each for instance (only 1% of the total data). 

Summary
-----

- (Almost) always do train/test splits
- Whenever possible do k-fold CV
- scikit-learn makes it easy to do the right thing

Example code
------

- https://github.com/justmarkham/scikit-learn-videos/blob/master/07_cross_validation.ipynb

- https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb

<br>
<br> 
<br>

----