# KFold and Other Cross-Validation Techniques

In Machine Learning, we usually divide the dataset into a training set, a validation set, and a test set:
* **Training set:** used to train the model - it can vary but it's typically 60-70% of the available data.
* **Validation set:** once we select a model that performs well on the training data, we run the model on the validation set. The validation set helps to provide an uniased evaluation on the model's performance. If the error increases on the validation set then we have an overfitting model. This is typically 10-20% of the available data.
* **Test set:** or **holdout set**, this data set contains data that has never been used in the training process and helps evaluate the final model. Typically 5-20% of the available data.
![data split ex](https://cdn-images-1.medium.com/max/1200/1*v1kM3rIPwxTYyTk5eIunFg.png)

#### What's the problem with just using a train/test split?
Due to **sample variability** between the train and test set, our model can give a better prediction on the training data but fail to generalize on the test set (overfitting). This leads to a low training error rate but a high test error rate.

When we split the dataset into a train/validation/test split, we only use a subset of the data and we know when we train on fewer observations the model will not perform well and **overestimate** the test error rate for the model to fit on the entire dataset.

#### Solution: Cross-Validation
Cross-Validation is a technique which involves partitioning the data into subsets, training the data on a subset and using the other subset to evalutate the model's performance.

To reduce variability, we perform multiple rounds of cross-validation with different subsets from the same data. We combine the validation results from these multiple rounds to come up with an estimate of the model's predictive performance.

![cross val ex](https://cdn-images-1.medium.com/max/1200/1*sWQi89jsD84iYcWb4yBB3g.png)

### Implementation

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split

data = np.array([[1, 2], [3, 4]])

train, validation = train_test_split(data, test_size=0.5, random_state=42)

print(f"train: {train}")
print(f"validation: {validation}")

train: [[1 2]]
validation: [[3 4]]


## Common Types of Cross-Validation
* LOOCV - Leave one out cross-validation
* KFold
* Stratified cross-validation
* Time series cross-validation

## Leave one out cross-validation (LOOCV)
In LOOCV, we divide the dataset into two parts
1. A single observation which is our test data
2. All other observations from the dataset, forming our training data

If we have a dataset with $n$ observations, then the training data contains $n-1$ observations and the test set contains one observation. This process is iterated for each data point, processed $n$ times and generates $n$ times Mean Squared Error(MSE).

![LOOCV ex](https://cdn-images-1.medium.com/max/1200/1*AVVhcmOs7WCBnpNhqi-L6g.png)

##### Advantages
* Far less bias as we have used the entire dataset for training compared to the validation set approach where we only use a subset of the data for training
* No randoness in the train/test data as performing LOOCV multiple times will yield the same results
##### Disadvantages
* MSE will vary as test data uses a single observation, introducing variability. If the data point is an outlier than the variability will be much higher.
* Execution is expensive as the model has to be fitted $n$ times.

### Implementation

In [4]:
from sklearn.model_selection import LeaveOneOut

X = np.array([[1,2], [3,4]])
y = np.array([1,2])

loo = LeaveOneOut()
loo.get_n_splits(X)

for train_index, test_index in loo.split(X):
    print(f"train: {train_index}, validation: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

train: [1], validation: [0]
train: [0], validation: [1]


## KFold cross validation

This technique involves **randomly dividing the dataset into $k$ groups/folds** of approximately equal size. The **first fold is kept for testing** and **the model is trained on $k-1$ folds**

The process is repeated $K$ times and each time a different fold is used for validation.
![Kfold ex](https://cdn-images-1.medium.com/max/1200/1*M9amI9hGx45i9k5ORS6b8w.png)

As we repeat the process $k$ times, we get $k$ Mean Squared Errors (`MSE_1`,`MSE_2`,...,`MSE_K`), so KFold's CV erro is computed by taking the average of the MSE over KFolds:

$$
cv_{k} = \frac{1}{k} \sum^{k}_{i=1} MSE_i
$$

*Note - LOOCV is a variant of KFold where $k=1$*

Typical values of $K$ in KFolds is 5 or 10. When $K$ is 10, we could call it a *10 fold cross validation*

##### Advantages
* Computation time is reduced as we repeat the process only 10 times (in the case of $K=10$)
* Reduces bias
* Every data point gets to be tested exactly once and is used in training $k-1$ times.
* The variance of the resulting estimate is reduced as $k$ increases

##### Disadvantages
* The training algorithm is computationally expensive as the algorithm has to be rerun from scratch $k$ times.

### Implementation

In [13]:
from sklearn.model_selection import KFold

X = np.array([[1,2], [3,4], [5,6], [7,8]])
y = np.array([1,2,3,4])

kf = KFold(n_splits=2, random_state=None, shuffle=False)

print(kf.get_n_splits(X)) # >> 2

for train_index, test_index in kf.split(X):
    print(f"train: {train_index}, validation: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

2
train: [2 3], validation: [0 1]
train: [0 1], validation: [2 3]


## Stratified cross validation
Stratification is a technique where we rearrange the data in a way that each fold has a good representation of the whole dataset. It forces each fold to have at least $m$ instances of each class. This approach **ensures that one class of data is not overrepresented**, especially when the target variable is unbalanced.

For example, in a binary classification problem where we want to predict if a passenger on the Titanic survived or not, we ensure that each fold has a percentage of passengers that survived and a percentage of passengers that did not survive

![Stratified ex](https://cdn-images-1.medium.com/max/1200/1*TuWV2i98KmBxX5qkz_gX9g.png)

**Stratified cross validation helps with reducing both bias and variance**

### Implementation

In [17]:
from sklearn.model_selection import StratifiedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])

skf = StratifiedKFold(n_splits=2, random_state=None, shuffle=False)

for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]


## Time series cross validation
Splitting time series data randomly does not help because it breaks the importance of time in the data.

For time series cross validation, we use **forward chaining** or **rolling-origin**. In this, each day (or time period) is a test group and we consider the previous day (or time period) as the training set.

Below, `D1`, `D2`, etc are each day's data. Days in blue are used for training and days in yellow are used for testing.

![time series ex](https://cdn-images-1.medium.com/max/1200/1*WMJCAkveTgbdBveMMMZtUg.png)

We start training the model with a minimum number of observations and use the next day's data to test the model then keep moving through the data set. This ensures we consider the time series aspect of our data.

### Implementation

In [18]:
from sklearn.model_selection import TimeSeriesSplit

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([1,2,3,4,5,6])

tscv = TimeSeriesSplit(max_train_size=None, n_splits=5)

for train_index, test_index in tscv.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]
