- Notebook create date: 09 April 2019
- Notebook modified date: 

##### Introduction

- Cross Validation (CV) is a method that allows the user to split a dataset into `train` and `test` for model evaluation.
- It can also be used for `tuning` the parameters of a model.
- To avoid **overfitting**, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set `X_test, y_test`
- In scikit-learn a random split into training and test sets can be quickly computed with the `train_test_split` helper function. Let’s load the iris data set to fit a linear support vector machine on it:

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
import warnings
warnings.simplefilter('ignore')

In [8]:
# load the iris dataset
iris = datasets.load_iris()
print(iris.data.shape, iris.target.shape)

(150, 4) (150,)


Lets look at the function signature for `train_test_split()`

In [31]:
help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float, int or None, optional (default=0.25)
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. By default, 

(None, 10)

In [9]:
# Now sample a training set while holding out 40% of the data for testing (evaluating) our classifier:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=2019)

In [11]:
print(X_train.shape, X_test.shape)

(90, 4) (60, 4)


In [12]:
print(y_train.shape, y_test.shape)

(90,) (60,)


In [13]:
# Build a classifier object
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

In [15]:
# Now test the classifier on test data
print(clf.score(X_test, y_test))

1.0


WHOA! perfect accuracy

When evaluating different settings (“hyperparameters”) for estimators, such as the `C` setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.
However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called `cross-validation` (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. 

#### Types of Cross Validation metrics

The simplest way to use cross-validation is to call the `cross_val_score` helper function on the estimator and the dataset.

In [19]:
from sklearn.model_selection import cross_val_score

In [20]:
help(cross_val_score)

Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv='warn', n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score='raise-deprecating')
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like
        The data to fit. Can be for example a list, or an array.
    
    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.
    
    scoring : string, callable or None, optional, default: None
        A string (see model evaluation documentat

In [23]:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

The mean score and the 95% confidence interval of the score estimate are hence given by

In [24]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.98 (+/- 0.03)


By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter.
See the scoring parameter [help](https://scikit-learn.org/stable/modules/model_evaluation.html) for more options. Look at sub-section 3.3.1.1

In [25]:
from sklearn import metrics

In [47]:
scores_1 = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')
scores_1

array([0.96658312, 1.        , 0.96658312, 0.96658312, 1.        ])

In [48]:
scores_2 = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='accuracy')
scores_2

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [49]:
scores_3 = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_micro')
scores_3

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

Reference: https://scikit-learn.org/stable/modules/cross_validation.html