# Module 2: Cross-Validation

In this lab you will learn about another important methodology for evaluating the machine learning model, 
namely **cross-validation**,
which involves the splitting dataset into multiple folds then validate on one of them after training the model on the rest of the folds.
This establishes a reliable performance measure that assesses how the model will likely to generalize to an independent data set.
Cross-validation is widely used for estimating test error for the following reasons:

1. Provides less biased evaluation, which in turn, helps you reduce overfitting.
2. Provides reliable way to validate model when no explicit validation set is made available.

We are going to use **Gaussian Naive Bayes model** to fit the **red wine quality** dataset and create 5-fold and 10-fold cross-validation then compare.
There are different variations of cross-validation and we will take a closer look into **K-fold cross-validation**.

sklearn API reference:

+ [sklearn.model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import numpy as np
import pandas as pd
import sklearn.model_selection
from sklearn.naive_bayes import GaussianNB

# Uncomment following line to view original output.
# np.random.seed(18937)

## Load Dataset

Load dataset from files into Panda data frame.

In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)
dataset.describe()

## Cross-validation with sklearn

In this example, we use a few of the feature columns as input **X** and the `quality` column as output **y**.
Then perform a 5-fold cross-validation using **cross_val_score()**,
which splits the data into 5 folds (based on the **cv** argument).
Then for each fold it fits the data on 4 folds and scores the 5th fold.
Then it gives you the 5 scores from which you can calculate a mean and variance for the score.
This potentially allows you to cross-validate in order to tune parameters and get an estimate of the score. 

Note that the cross-validation process involves fitting the model by definition,
so you don't need to fit the model prior to cross-validation.

In [None]:
model = GaussianNB()

    # Convert the loaded dataset (data frame) into a multi-dimensional array with columns 1,2,6,7,10 as input data
X = np.array(dataset.iloc[:,:-1])[:, [1,2,6,9,10]]
    # Slice out the quality column as the expected value.
y = np.array(dataset.quality)

# Do the cross-validation
sklearn.model_selection.cross_val_score(model, X, y, cv=5)

Above shows 5 scores from the 5-fold cross-validation.
For each round of cross-validation, the model was fit on 4 of the folds and scored on the one held out.
You should see different model scores, five in this case.
This indicates that certain training instances validated against their test fold better than others.

Next, we will be sure to get very familiarized with this workflow by implementing our own. 
Then discuss.

## Create folds

The original dataset should be **randomly** sampled into equal-sized folds.
But here the random resample was already done when we loaded the dataset previously.

Now we split the data into 5 folds. 
This can be achieved using **array_split()** from numpy.

In [None]:
help(np.array_split)

In [None]:
X_folds = np.array_split(X, 5) # split the array into 5 chunks by rows chunks (axis = 0)
[i.shape for i in X_folds]

We have around 1600 entries and 5 types of features in the input, so we have confirmed their shapes look good after splitting.
The following demonstrates how **array_split()** behaves on dataset size that aren't evenly divisible by number of folds.
This has ensured that the folds are divided as evenly as possible.
Same could be achieved via array slicing, but would look more complicated.

In [None]:
for t in range(120, 130):
    print(t, 'entries into', 10, 'folds:', [i.shape[0] for i in np.array_split(np.zeros(t), 10)])

Same for Y folds.

In [None]:
y_folds = np.array_split(y, 5)
[i.shape for i in y_folds]

## Cross-validation

For each round **i**:
1. concatenate all folds _except fold **#i**_ to create the training set and fit the model
2. then score the model based on the fold **#i** that was withheld from training.

Each round is similar to what's been covered in Module 1: Train and Validate.

In [None]:
for i in range(5):
    X_train = np.concatenate([X_folds[j] for j in range(5) if j!=i])
    X_test = X_folds[i]
    y_train = np.concatenate([y_folds[j] for j in range(5) if j!=i])
    y_test = y_folds[i]
    print('CV', i,
          'X_train', X_train.shape, 'X_test', X_test.shape,
          'y_train', y_train.shape, 'y_test', y_test.shape)
    model.fit(X_train, y_train)
    print('Score:', round(model.score(X_test, y_test), 3))

## Putting things together

Now we can replicate the general functionality of **cross_val_score()** from sklearn, 
and have a better understanding of the cross-validation workflow.

**Note:** As an exercise to help you get in the habit of congnitively processing code you read, instead of just running it, 
you could comment each code line with your interpretation.

In [None]:
def cross_val_score(model, X, y, cv = 10):
    X_folds = np.array_split(X, cv)
    Y_folds = np.array_split(y, cv)
    
    for i in range(cv):
        X_train = np.concatenate([X_folds[j] for j in range(cv) if j!=i])
        X_test = X_folds[i]
        y_train = np.concatenate([Y_folds[j] for j in range(cv) if j!=i])
        y_test = y_folds[i]
        model.fit(X_train, y_train)
        yield model.score(X_test, y_test)



print('Our CV:', list(cross_val_score(model, X, y, cv=5)))
print('sklearn CV:', sklearn.model_selection.cross_val_score(model, X, y, cv=5))

## 5-fold vs 10-fold cross-validation

While the implementation of **k**-fold cross-validation is straightforward, 
it's important that we understand the strengths and limitations of this methodology before its application.

In [None]:
s5 = sklearn.model_selection.cross_val_score(model, X, y, cv=5)
s10 = sklearn.model_selection.cross_val_score(model, X, y, cv=10)
print('5-fold mean', np.mean(s5), 'variance', np.var(s5))
print('10-fold mean', np.mean(s10), 'variance', np.var(s10))

In [None]:
print('5-fold scores', s5)
print('10-fold scores', s10)

It is a known issue that cross-validated scores can have large variance especially on smaller datasets.
Here we compare 5-fold vs 10-fold cross-validation, and 10-fold cross-validation has shown higher variance. 

+ Larger number of folds usually means less bias. However, as we use more folds, the testing dataset also gets smaller, and variance of cross-validation scores increases.
+ Too large number of folds mean that only a low number of sample combinations is possible, thus limiting the number of iterations that are different. That is to say the training data for each round will have large overlap.

In [None]:
# Note: Second argument is a list-comprehension generated by running the for-loop of cross_val_score()
plt.scatter([3,5,6,7,8,9,10],
    [np.var(sklearn.model_selection.cross_val_score(model, X, y, cv=i))*100 for i in [3,5,6,7,8,9,10]])

Above figure shows how variance of scores changes with respect to number of folds used.

In order to lower the variance of the cross-validation result, you should iterate the cross-validation with new random splits.
If possible, use a number of folds that is a divisor of the sample size.

_ Limitations of cross-validation are mostly relevant to small datasets. _

## Conclusion

In this lab we learned about:

+ Cross-validation workflow and its implementation
+ Compared 5-fold vs 10-fold cross-validation
+ Strengths and limitations of k-fold validation