# Cross Validation - REVISION

The goal of cross-validation is to test the model’s ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or underfitting and to give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem). 


## Traditional KFold Cross Validation
Most simple(but slightly coding intensive) way of doing it is to split the data you have into multiple folds(say n) and then training on n-1 folds and testing on the remaining one fold.

In [1]:
from sklearn import datasets, svm
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
svc = svm.SVC(C=1, kernel='linear')
svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])


0.97999999999999998

Now we can split the data into folds and successively test the accuracy

In [2]:
import numpy as np
X_folds = np.array_split(X_digits, 3)
y_folds = np.array_split(y_digits, 3)
scores = list()
for k in range(3):
# We use 'list' to copy, in order to 'pop' later on
    X_train = list(X_folds)
    X_test  = X_train.pop(k)
    X_train = np.concatenate(X_train)
    y_train = list(y_folds)
    y_test  = y_train.pop(k)
    y_train = np.concatenate(y_train)
    scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
print(scores) 

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]


## Scikit Learn makes the job easy

In [3]:
from sklearn.model_selection import KFold, cross_val_score
X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"]
k_fold = KFold(n_splits=5)
for train_indices, test_indices in k_fold.split(X):
      print('Train: %s | test: %s' % (train_indices, test_indices))

Train: [2 3 4 5 6 7 8 9] | test: [0 1]
Train: [0 1 4 5 6 7 8 9] | test: [2 3]
Train: [0 1 2 3 6 7 8 9] | test: [4 5]
Train: [0 1 2 3 4 5 8 9] | test: [6 7]
Train: [0 1 2 3 4 5 6 7] | test: [8 9]


In [4]:
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
          for train, test in k_fold.split(X_digits)]  

[0.96388888888888891,
 0.92222222222222228,
 0.96378830083565459,
 0.96378830083565459,
 0.93036211699164351]

In [5]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)

array([ 0.96388889,  0.92222222,  0.9637883 ,  0.9637883 ,  0.93036212])

In [6]:
cross_val_score(svc, X_digits, y_digits, cv=k_fold,
                 scoring='precision_macro')

array([ 0.96578289,  0.92708922,  0.96681476,  0.96362897,  0.93192644])

# Source:
 
1. https://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html