http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of **overfitting on the test set** because the parameters can be tweaked until the estimator performs optimally. 

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

iris = load_iris()

clf = SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores                                              

array([ 0.96666667,  1.        ,  0.96666667,  0.96666667,  1.        ])

In [3]:
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))                                             

Accuracy: 0.98 (+/- 0.03)


In [4]:
from sklearn import metrics
scores = cross_val_score(clf, iris.data, iris.target, cv=5, scoring='f1_macro')
scores 

array([ 0.96658312,  1.        ,  0.96658312,  0.96658312,  1.        ])

* The scoring method: http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

** It is also possible to use other cross validation strategies by passing a cross validation iterator instead, for instance: **

In [5]:
from sklearn.model_selection import ShuffleSplit
n_samples = iris.data.shape[0]
cv = ShuffleSplit(n_splits=3, test_size=0.3, random_state=0)
cross_val_score(clf, iris.data, iris.target, cv=cv)

array([ 0.97777778,  0.97777778,  1.        ])

### Scaling:

In [8]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
clf = SVC(C=1).fit(X_train_transformed, y_train)
X_test_transformed = scaler.transform(X_test)
clf.score(X_test_transformed, y_test)  

0.93333333333333335

### Predictions:

In [9]:
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
metrics.accuracy_score(iris.target, predicted) 

0.97999999999999998

## Splitting Strategy

### Cross-validation iterators for i.i.d data
when data are iid, the ocurrance of each value is not dependent on the previous values. Because of this, random shufling is good enough

* K-fold

In [11]:
import numpy as np
from sklearn.model_selection import KFold

X = ["a", "b", "c", "d"]
kf = KFold(n_splits=2)
for train, test in kf.split(X):
    print("The indexes for training and testing sets: %s %s" % (train, test))

The indexes for training and testing sets: [2 3] [0 1]
The indexes for training and testing sets: [0 1] [2 3]


* Leave One Out (LOO)
* Leave p Out (LPO)
* ShuffleSplit

* * ShuffleSplit is a good alternative to KFold cross validation that allows a finer control on the number of iterations and the proportion of samples on each side of the train / test split.

### Cross-validation iterators with stratification based on class labels
when there are more data samples with centain class labels than other class labels
* StratifiedKFold
* StratifiedShuffleSplit

In [13]:
from sklearn.model_selection import StratifiedShuffleSplit

X = np.ones(12)
y = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
skf = StratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=0)
for train, test in skf.split(X, y):
    print("%s %s" % (train, test))

[ 6  3  1 10  8  2  9  5  4] [ 0 11  7]
[ 7  0 10  4  2 11  8  6  1] [3 5 9]
[ 1 10  0 11  2  7  6  9  8] [5 4 3]


### Cross-validation iterators for grouped data
In this case we would like to know if a model trained on a particular set of groups generalizes well to the unseen groups. To measure this, we need to ensure that all the samples in the validation fold come from groups that are not represented at all in the paired training fold.
* GroupKFlod
* LeaveOneGroupsOut
* LeavePGroupsOut
* GroupShuffleSplit

In [14]:
from sklearn.model_selection import GroupKFold

X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
    print("%s %s" % (train, test))

[0 1 2 3 4 5] [6 7 8 9]
[0 1 2 6 7 8 9] [3 4 5]
[3 4 5 6 7 8 9] [0 1 2]


### Cross validation of time series data
Time series data is characterised by the correlation between observations that are near in time (autocorrelation). However, classical cross-validation techniques such as KFold and ShuffleSplit assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data. Therefore, it is very important to evaluate our model for time series data on the “future” observations least like those that are used to train the model. To achieve this, one solution is provided by TimeSeriesSplit.

## A note on shuffling

If the data ordering is not arbitrary (e.g. samples with the same class label are contiguous), shuffling it first may be essential to get a meaningful cross- validation result. However, the opposite may be true if the samples are not independently and identically distributed. For example, if samples correspond to news articles, and are ordered by their time of publication, then shuffling the data will likely lead to a model that is overfit and an inflated validation score: it will be tested on samples that are artificially similar (close in time) to training samples.
Some cross validation iterators, such as KFold, have an inbuilt option to shuffle the data indices before splitting them. Note that:
This consumes less memory than shuffling the data directly.
By default no shuffling occurs, including for the (stratified) K fold cross- validation performed by specifying cv=some_integer to cross_val_score, grid search, etc. Keep in mind that train_test_split still returns a random split.
The random_state parameter defaults to None, meaning that the shuffling will be different every time KFold(..., shuffle=True) is iterated. However, GridSearchCV will use the same shuffling for each set of parameters validated by a single call to its fit method.
To ensure results are repeatable (on the same platform), use a fixed value for random_state.