# Cross Validation

Using the sklearn ```test_train_split``` helper method to partition data into training and test/holdout

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import datasets 
from sklearn import svm

boston = datasets.load_boston()
boston.data.shape, boston.target.shape

((506, 13), (506,))

In [2]:
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.4, random_state=0)
# 40% of data used for testing
regression = svm.SVR(kernel='linear', C=1).fit(X_train, y_train)
regression.score(X_test, y_test)

0.667431382173115

When estimating the value for a model's hyperparameters, such as ```C``` above, it is possible to over/underfit. 

To solve this, we can use a validation set between the train and test data to find the optimal value of these hyperparameters. However, this drastically reduces our training data, and the results can depend on the random choice of data for the training and validation data. 

We can use Cross Validation (CV) as a solution to this problem. A test set will still be held until the end, but the validation set is no longer needed. In k-fold CV, the training set is split into k smaller sets, with each called a fold. The following is done for all k folds:

- A model is trained using k-1 folds as training data
- The model is then tested on the remaining data

This performance measure is the average of all the scores across the k-folds used for training. This can be expensive, but does not waste data

## Computing CV Metrics

In [5]:
from sklearn.model_selection import cross_val_score
regression = svm.SVR(kernel='linear', C=1)
scores = cross_val_score(regression, boston.data, boston.target, cv=5)
scores

array([0.77285459, 0.72771739, 0.56131914, 0.15056451, 0.08212844])

In [10]:
print('Accuracy: %.2f with error %.2f' % (scores.mean(), scores.std() ** 2))

Accuracy: 0.46 with error 0.08


When the ```cv``` argument is an integer ```cross_val_score``` uses ```KFold``` strategies by default.

# K-Fold

Divides all samples in k groups of samples, called folds, or equal sizes (if possible). The prediction function is leanred using k - 1 folds, and the fold left out is for testing. 

Example of 2-fold CV on dataset with 4 samples:

In [12]:
from sklearn.model_selection import KFold

X = ['a', 'b', 'c', 'd']
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print('%s %s' % (train, test))
# splits array in two, and uses one as the test and the other as the train, twice 

[2 3] [0 1]
[0 1] [2 3]


## Stratified K-Fold

Variation of k-Fold which returns stratified folds, each containing the same percentage of samples of each target class as the complete set. 

3-Fold CV with 10 samples from two unbalanced classes:

In [14]:
from sklearn.model_selection import StratifiedKFold

X = np.ones(10)
y = [0,0,0,0,1,1,1,1,1,1]
skf = StratifiedKFold(n_splits=3)
for train,test in skf.split(X,y):
    print('%s %s' % (train,test))
# numbers represent indicies. splits in such a way that the 0s within Y are represented in both train and test
# point of stratified test is to make subgroups representative of data as a whole

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


In [15]:
X

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [16]:
y

[0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import svm
from sklearn.pipeline import make_pipeline
# using train/test data created above from boston data
# pipeline chains together data preprocessing steps, can complete CV in just a few lines
pipe_svm = make_pipeline(
    StandardScaler(), PCA(n_components=2), 
    svm.SVR(kernel='linear', C=1)
)
pipe_svm.fit(X_train, y_train)
y_pred = pipe_svm.predict(X_test)
print('Accuracy: %.3f' % pipe_svm.score(X_test, y_test))

Accuracy: 0.391


In [23]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
    estimator=pipe_svm, X=X_train, y=y_train, 
    cv=10, n_jobs=-1
)
print('CV accuracy: %.3f' % np.mean(scores))

CV accuracy: 0.428
