# A Gentle Introduction to k-fold Cross-Validation

by Jason Brownlee on August 3, 2020.[Here](https://machinelearningmastery.com/k-fold-cross-validation/) in [Statistics](https://machinelearningmastery.com/category/statistical-methods/)

Cross-validation is a statistical method used to estimate the skill of machine learning models.

Used in applied machine learning to `compare and select a model` for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

## Tutorial Overview
This tutorial is divided into 5 parts; they are:

1. k-Fold Cross-Validation
    - General procedure
2. Configuration of k
3. Worked Example
4. Cross-Validation API
5. Variations on Cross-Validation

## 1. k-Fold Cross-Validation
Cross-validation is a __resampling procedure__ used to evaluate machine learning models `on a limited data sample`.

The procedure has a `single parameter called k that refers to the number of groups that a given data sample is to be split into`. As such, the procedure is often called k-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

### The general procedure is as follows:

1. Shuffle the dataset randomly.
2. Split the dataset into k groups. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The __first fold__ is treated `as a validation set`, and the method is __fit on the remaining k − 1 folds__.
3. For each unique group:    
    - 3.1 Take the group as a hold out or test data set
    - 3.2 Take the remaining groups as a training data set
    - 3.3 Fit a model on the training set and evaluate it on the test set
    - 3.4 Retain the evaluation score and discard the model
4. Summarize the skill of the model using the sample of model evaluation scores

__Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times__.

It is also __important__ that any `preparation of the data prior to fitting the model occur on the CV-assigned training dataset within the loop rather than on the broader data set`. This also applies to any tuning of hyperparameters. A failure to perform these operations within the loop may result in data leakage and an optimistic estimate of the model skill.

The results of a k-fold cross-validation run are often `summarized with the mean of the model skill scores`. It is also good practice to include a measure of the variance of the skill scores, such as the `standard deviation` or `standard error`.

## 2. Configuration of k
The k value must be chosen carefully for your data sample.

A __poorly chosen value for k__ may result in a mis-representative idea of the skill of the model, such as a `score with a high variance` (that may change a lot based on the data used to fit the model), or a `high bias`, (such as an overestimate of the skill of the model).

Three common tactics for choosing a value for k are as follows:

- __Representative__: The value for k is chosen such that each train/test group of data samples is large enough to be statistically representative of the broader dataset.
- __k=10__: The value for k is fixed to 10, a value that has been found through experimentation to generally result in a model skill estimate with low bias a modest variance.
- __k=n__: The value for k is fixed to n, where n is the size of the dataset to give each test sample an opportunity to be used in the hold out dataset. This approach is called __leave-one-out cross-validation__.

## 3. Worked Example

Imagine we have a data sample with 6 observations:

$[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]$

The first step is to pick a value for k in order to determine the number of folds used to split the data. Here, we will `use a value of k=3`. That __means we will `shuffle the data` and then split the data into 3 groups__. Because we have 6 observations, each group will have an equal number of 2 observations.
For example:

- $Fold1: [0.5, 0.2]$
- $Fold2: [0.1, 0.3]$
- $Fold3: [0.4, 0.6]$

Three models are trained and evaluated with each fold given a chance to be the held out test set.
For example:

- __Model1__: Trained on Fold1 + Fold2, Tested on Fold3
- __Model2__: Trained on Fold2 + Fold3, Tested on Fold1
- __Model3__: Trained on Fold1 + Fold3, Tested on Fold2

The skill scores are collected for each model and summarized for use.

## 4. Cross-Validation API

The scikit-learn library provides an implementation that will split a given data sample up.

The KFold() scikit-learn class can be used. It takes as arguments the number of splits, whether or not to shuffle the sample, and the seed for the pseudorandom number generator used prior to the shuffle.

Create an instance that splits a dataset into `3 folds`, `shuffles prior` to the split, and uses a value of `1 for the pseudorandom` number generator.

In [1]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=3, shuffle=True, random_state=1)
kfold

KFold(n_splits=3, random_state=1, shuffle=True)

Called repeatedly, the split will return each group of train and test sets.

In [2]:
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
# enumerate splits
for train, test in kfold.split(data):
    print('train: %s, test: %s' % (train, test))

train: [0 3 4 5], test: [1 2]
train: [1 2 3 5], test: [0 4]
train: [0 1 2 4], test: [3 5]


Running the example prints the specific observations chosen for each train and test set. The indices are used directly on the original data array to retrieve the observation values.

In [3]:
# scikit-learn k-fold cross-validation
from numpy import array
from sklearn.model_selection import KFold

# data sample
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

# prepare cross validation
kfold = KFold(n_splits=3, shuffle=True, random_state=1)

# enumerate splits
for train, test in kfold.split(data):
    print('train: %s, test: %s' % (data[train], data[test]))

train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]
train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]
train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]


## 5. Variations on Cross-Validation
There are a number of variations on the k-fold cross validation procedure.

Three commonly used variations are as follows:

- __Train/Test Split__: Taken to one extreme, `k may be set to 2` (not 1) such that a single train/test split is created to evaluate the model.
- __LOOCV__: Taken to another extreme, `k may be set to the total number of observations` in the dataset such that each observation is given a chance to be the held out of the dataset. This is called __leave-one-out cross-validation__, or LOOCV for short.
- __Stratified__: The splitting of data into folds may be governed by criteria such as `ensuring that each fold has the same proportion of observations` with a given categorical value, such as the class outcome value. This is called stratified cross-validation.
- __Repeated__: This is where the k-fold cross-validation `procedure is repeated n times`, where importantly, the `data sample is shuffled prior to each repetition`, which results in a different split of the sample.
- __Nested__: This is where `k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter` tuning during model evaluation. This is called __nested cross-validation__ or __double cross-validation__.

## Related Tutorials
- [How to Configure k-Fold Cross-Validation](https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/)
- [LOOCV for Evaluating Machine Learning Algorithms](https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/)
- [Nested Cross-Validation for Machine Learning with Python](https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/)
- [Repeated k-Fold Cross-Validation for Model Evaluation in Python](https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/)
- [How to Fix k-Fold Cross-Validation for Imbalanced Classification](https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/)
- [Train-Test Split for Evaluating Machine Learning Algorithms](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/)