Goal
=====

In this session we will focus mainly on the cross-validation with several splitting strategies. 

Dataset Information
=====================

In this example, we use the iris dataset. We split the data into a train and test dataset.



Load the dataset
======


Load the libraries and data

In [22]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

X, y = datasets.load_iris(return_X_y=True)
X.shape, y



((150, 4),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]))

In [32]:
# Simple splitting
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)


# A Support Vector Machine is used as a classifier method.
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

1.0

cross_validate function and multiple metric evaluation
======

In [39]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['recall_micro','precision_micro','recall_macro','precision_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
# cv is the number of folds or the splitting approach. Default value is 5.
scores = cross_validate(clf, X, y, scoring=scoring, cv=5)
sorted(scores.keys())
# Scores for the 5 folds.
scores


{'fit_time': array([0.       , 0.       , 0.       , 0.       , 0.0156312]),
 'score_time': array([0.00852942, 0.        , 0.01552677, 0.        , 0.        ]),
 'test_recall_micro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ]),
 'test_precision_micro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ]),
 'test_recall_macro': array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ]),
 'test_precision_macro': array([0.96969697, 1.        , 0.96969697, 0.96969697, 1.        ])}

In [40]:
# Aggregation metric: mean and std
np.mean(scores['test_recall_micro']), np.std(scores['test_recall_micro'])

(0.9800000000000001, 0.016329931618554516)

Cross-Validation data splitting iterators
=====
k-fold
=====
KFold divides all the samples in groups of samples, called folds (if, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using folds, and the fold left out is used for test.

Example of 2-fold cross-validation on a dataset with 4 samples:

In [15]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
y = np.array([0, 1, 0, 1])
# Fold 2
X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

[2 3] [0 1]
[0 1] [2 3]


Stratified k-fold
====
StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Here is an example of stratified 3-fold cross-validation on a dataset with 50 samples from two unbalanced classes. We show the number of samples in each class and compare with KFold.

In [None]:
from sklearn.model_selection import StratifiedKFold, KFold
import numpy as np
X, y = np.ones((50, 1)), np.hstack(([0] * 45, [1] * 5))
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        np.bincount(y[train]), np.bincount(y[test])))

kf = KFold(n_splits=3)
for train, test in kf.split(X, y):
    print('train -  {}   |   test -  {}'.format(
        np.bincount(y[train]), np.bincount(y[test])))

Leave One Out (LOO)
====
LeaveOneOut (or LOO) is a simple cross-validation. Each learning set is created by taking all the samples except one, the test set being the sample left out. Thus, for samples, we have different training sets and different tests set. This cross-validation procedure does not waste much data as only one sample is removed from the training set:

In [9]:
from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]


Group k-fold
====
GroupKFold is a variation of k-fold which ensures that the same group is not represented in both testing and training sets. For example if the data is obtained from different subjectswith several samples per-subject and if the model is flexible enough to learn from highly person specific features it could fail to generalize to new subjects. GroupKFold makes it possible to detect this kind of overfitting situations.

Imagine you have three subjects, each with an associated number from 1 to 3:

In [10]:
from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


StratifiedGroupKFold
====
StratifiedGroupKFold is a cross-validation scheme that combines both StratifiedKFold and GroupKFold. The idea is to try to preserve the distribution of classes in each split while keeping each group within a single split. That might be useful when you have an unbalanced dataset so that using just GroupKFold might produce skewed splits.

Example:

In [11]:
from sklearn.model_selection import StratifiedGroupKFold
X = list(range(18))
y = [1] * 6 + [0] * 12
groups = [1, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 4, 5, 5, 5, 6, 6, 6]
sgkf = StratifiedGroupKFold(n_splits=3)
for train, test in sgkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[ 0  2  3  4  5  6  7 10 11 15 16 17] [ 1  8  9 12 13 14]
[ 0  1  4  5  6  7  8  9 11 12 13 14] [ 2  3 10 15 16 17]
[ 1  2  3  8  9 10 12 13 14 15 16 17] [ 0  4  5  6  7 11]


Leave One Group Out
====
LeaveOneGroupOut is a cross-validation scheme where each split holds out samples belonging to one specific group. Group information is provided via an array that encodes the group of each sample.

Each training set is thus constituted by all the samples except the ones related to a specific group. This is the same as LeavePGroupsOut with n_groups=1 and the same as GroupKFold with n_splits equal to the number of unique labels passed to the groups parameter.

For example, in the cases of multiple experiments, LeaveOneGroupOut can be used to create a cross-validation based on the different experiments: we create a training set using the samples of all the experiments except one:

In [12]:
from sklearn.model_selection import LeaveOneGroupOut

X = [1, 5, 10, 50, 60, 70, 80]
y = [0, 1, 1, 2, 2, 2, 2]
groups = [1, 1, 2, 2, 3, 3, 3]
logo = LeaveOneGroupOut()
for train, test in logo.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[2 3 4 5 6] [0 1]
[0 1 4 5 6] [2 3]
[0 1 2 3] [4 5 6]


Cross-Validation running using a loop
 ====


In [45]:
from sklearn.metrics import accuracy_score, classification_report

X, y = datasets.load_iris(return_X_y=True)

# Splitting hold-out for model performance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)

clf = svm.SVC(kernel='linear', C=1, random_state=0)

i = 1
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X_train,y_train):
    X_trainFold = X_train[train_index]
    X_testFold = X_train[test_index]
    y_trainFold = y_train[train_index]
    y_testFold = y_train[test_index]
        
    #Train the model
    clf.fit(X_trainFold, y_trainFold) #Training the model
    print(f"Accuracy for the fold no. {i} on the validation set: {accuracy_score(y_testFold, clf.predict(X_testFold))}")
    i += 1

# Hold-out validation
clf.fit(X_train, y_train)
print(f"Accuracy for the hold-out: {accuracy_score(y_test, clf.predict(X_test))}")


Accuracy for the fold no. 1 on the validation set: 0.9583333333333334
Accuracy for the fold no. 2 on the validation set: 0.9166666666666666
Accuracy for the fold no. 3 on the validation set: 1.0
Accuracy for the fold no. 4 on the validation set: 1.0
Accuracy for the fold no. 5 on the validation set: 0.875
Accuracy for the hold-out: 1.0
