In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style("whitegrid")
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

boston = datasets.load_boston()
boston.data.shape, boston.target.shape


((506, 13), (506,))

We can now quickly sample a trainging set while holding 40% of data for testing (evaluating our regressor)

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    boston.data, boston.target, test_size=0.4, random_state=0)

X_train.shape, y_train.shape

X_train.shape, y_test.shape

regression = svm.SVR(kernel='linear', C=1).fit(X_train, y_train)
regression.score(X_test, y_test)

0.667431382173115

When evaluating different settings ("hyperparameters") for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally.

This way knowledge about the test set can "leak" into the model and evaluation metrics no longer report on generalization performance.

To solve this problem, yet another part of the dataset can be held out as a so-called "validation-set": training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by  partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on ta particular random choice for the pair of (train, validation) sets.

A solution to this problem, as discussed earlier, is a procedure. called cross-validation. A test set should still be held out for final evaluation, but the validation set in no longer needed when doing CV. In basic approach, called k-fold CV, the training set is split into k smaller sets.

<b>Computing cross-validated metrics<b>

In [6]:
from sklearn.model_selection import cross_val_score
regression = svm.SVR(kernel='linear', C=1)
scores = cross_val_score(regression, boston.data, boston.target, cv=5)
scores

array([0.77285459, 0.72771739, 0.56131914, 0.15056451, 0.08212844])

The mean score and the 95% confidence interval of the score estimates are hence given by:

In [7]:
print("Accuracy : %0.2f (+/- %0.2f)" % (scores.mean(), scores.std()**2))

Accuracy : 0.46 (+/- 0.08)


By default the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameters:


In [10]:
from sklearn import metrics
scores = cross_val_score(
        regression, boston.data, boston.target, cv=5, 
        scoring='neg_mean_squared_error')
scores

array([ -7.84451123, -24.78772444, -35.13272326, -74.50555945,
       -24.40465975])

When the "cv" argument is integer, "cross_val_score" uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimators derives from ClassifierMixin

<b>K-fold</b>

kFold divides all the samples in k groups of samples, called folds (if k=n, this is equivalent to the Leave One Out strategy), of equal sizes (if possible). The prediction function is learned using k-1 folds and the fold left out is used for test.

In [11]:
from sklearn.model_selection import KFold

In [12]:
X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("%s %s" % (train, test))

[2 3] [0 1]
[0 1] [2 3]


<b>Stratified k-fold<b>

Example of stratified 3-fold cross-validation on a dataset with 10 samples from two slightly unbalanced classes

In [13]:
from sklearn.model_selection import StratifiedKFold

In [15]:
X = np.ones(10)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
skf = StratifiedKFold(n_splits=3)
for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[2 3 6 7 8 9] [0 1 4 5]
[0 1 3 4 5 8 9] [2 6 7]
[0 1 2 4 5 6 7] [3 8 9]


In [17]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn import svm
from sklearn.pipeline import make_pipeline

pipe_svm = make_pipeline(StandardScaler(),
                            PCA(n_components=2),
                            svm.SVR(kernel='linear', C=1))

pipe_svm.fit(X_train, y_train)
y_pred = pipe_svm.predict(X_test)
print('Test Accuracy : %.3f' % pipe_svm.score(X_test, y_test))

Test Accuracy : 0.391


In [18]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=pipe_svm,
                        X=X_train,
                        y=y_train,
                        cv=10,
                        n_jobs=1)

print('CV accuracy scores: %s' % scores)

CV accuracy scores: [0.63971176 0.43579197 0.46977821 0.25027246 0.5124364  0.26221374
 0.30877195 0.54528563 0.37810066 0.47313549]
