# Improve Your Model Performance using Cross Validation 

If you run your machine learning model several times, even with the same configuration, you may notice that your model performance may go up and down. Why would that happen? Since machine learning models try to approximate the data - there is always __uncertainty__ in there. 

We want to limit the __uncertainty__ in our models, so that _the model can produce consistent results on unseen data_. In other words, if the model uncertainty is too high, the model may produce unreliable results.

## Why do models lose stability?
Let’s understand this using the below snapshot illustrating the fit of various models:

![Model Stability](https://www.analyticsvidhya.com/wp-content/uploads/2015/11/15.png)

Here, we are trying to find the relationship between size and price. To achieve this, we have taken the following steps:

1. In the first plot, you can observe high error (model fitting loosely to the data) - it is an example of “Underfitting”.
The first plot has a high error from training data points. 
2. In the second plot, we just found the right relationship between price and size, i.e., low training error and generalization of the relationship. 
3. In the third plot, we found a relationship which has almost zero training error. This is because the relationship is developed by considering each deviation in the data point (including noise), i.e., the model is too sensitive and captures random patterns which are present only in the current dataset. This is an example of “Overfitting”. 

A common practice in data science competitions is to iterate over various models to find a better performing model. However, it becomes difficult to distinguish whether this improvement in score is coming because we are capturing the relationship better, or we are just over-fitting the data. To find the right answer for this question, we use validation techniques. This method helps us in achieving more generalized relationships.



## Before Cross Validation

So far, in all but one tutorials in this class, we have been using the traditional train-test split method for validation purposes. This method is called the __(_fixed_) hold-out method__. In this method, a fixed portion of the data (e.g. _20%_) is reserved for evaluation purposes. 

Refresh your memory with the code below. 

In [0]:
# import the required packages
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import *
import pandas as pd
import numpy as np


from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import warnings
warnings.filterwarnings('ignore')

In [0]:
# load the wine dataset
my_data = load_breast_cancer()
my_data.keys()

In [0]:
my_df = pd.DataFrame(my_data.data, columns=my_data.feature_names)
my_df.head()

In [0]:
my_df['label'] = my_data.target
my_df.head()

In [0]:
# split the data into train/test
# test takes up 20% of the data
X_train, X_test,y_train, y_test\
    = train_test_split(my_data.data, my_data.target, test_size=0.2, random_state=2020)

In [0]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

We discussed sometimes we might want to reserve a portion of the data for model optimization purposes - that portion of the data is called the validation set. So that method is called the _three-way hold-out method_.

![three way hold out](https://i.stack.imgur.com/pXAfX.png)

We can also do that with `train_test_split()`. Say we want to reserve `20%` for validation.

In [0]:
X_train, X_val, y_train, y_val\
   = train_test_split(X_train, y_train, test_size=0.25, random_state=2020) # 0.25 x 0.8 = 0.2

In [0]:
X_train.shape, X_val.shape, y_train.shape, y_val.shape

Now we can try to fit the model multiple times to observe the variance in model performances with the __hold out method__.

In [0]:
clf = SVC(C=1.0)
for i in range(10):
  fit = clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print('Accruracy for ', i, 'th round training: ', round(accuracy_score(y_test, y_pred), 4)) # no variance

We can observe from above there are no variances in the results - since `sklearn` will always optimize your model within the current configuration.

If we want to see some variance, we need to have different training/test sets (with the same 80:20 split). We can do that via the __Repeated Holdout__ method. See the code below.

In [0]:
for i in range(10):
  # remove `random_state` so we have different training/test sample in every iteration
  X_train, X_test,y_train, y_test\
    = train_test_split(my_data.data, my_data.target, test_size=0.2)
  fit = clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print('Accruracy for ', i, 'th round training: ', round(accuracy_score(y_test, y_pred), 4)) # some variance

Now you can see some bumping ups and downs in the results (variance). 

So what is wrong with the hold out and the repeated holdout methods? Since you essentially is training your model in one shot, you may get a "lucky draw' of your data (in which your model outperforms the actual), or even worse, an "unlucky draw" (in which your model underperforms the actual). We do not want either situation - we want a __fair estimate__ of the model performance.

In other words, you want your model to be exposed to as much data as you can, so the model can learn a comprehensive pattern (not a partial image) from your data. Since we cannot use all the data for training, that is why we need __Cross Validation__.

## What is Cross Validation?
Cross Validation is a technique which involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it.

Here are the steps involved in k-fold cross validation:

1. Split your dataset into K (roughly) equal folds, and reserve 1 fold for evaluation/optimization purposes - note these two are related but different;
2. Train the model using the remaining (K-1) folds and the reserve sample as the test (validation) set. This will help you in gauging the effectiveness of your model’s performance. 

If your model delivers a positive result on validation data, go ahead with the current model. 

Even though k-fold cross validation is the most popular type, do not assume that it is the _only_ cross validation method.

## Common Methods for Cross Validation

Cross Validation (CV) is a family of sampling/model evaluation/model optimization methods. In the stats context, it is a sampling method. In the machine learning context, it is widely used for model evaluation and/or model optimization purposes. 

There is a recommendation that every model needs to go through CV once, either for model evaluation or model optimization purposes. 

Here is a list of common CV methods:
- Leave-One-Out Cross Validation (LOOCV) (_most extreme_)
- K-fold Cross Validation (_most popular_)
- Repeated K-fold Cross Validation
- Stratified K-fold Cross Validation (_best for imblanced data_)
- Cross Validation for Time Series (_fairly popular right now_)

Let's see how to implement them one by one.

### Leave-One-Out Cross Validation (LOOCV)

LOOCV is the most extreme CV method. In every iteration, only __one data point__ is used for testing, the remainder of the data is used for training.

Pros:
- Model is fit to almost the whole dataset; very little chance of having a "lucky/unlucky" draw;

Cons:
- Training is slow;
- Variance in model performances is high.

Even though `sklearn` has its own `LeaveOneOut` method, we can essentially use the `KFold()` method for that.

In [0]:
X = my_data.data

kf = KFold(n_splits=len(X)) # split the data 

# look at the first 3 iterations
i = 0
for train_index, test_index in kf.split(X):
  print("Training data contains", len(train_index), "data points")
  print("TEST data contains", len(test_index), "data points")     
  i += 1
  if i > 3:
    break


We can see in every iteration, the training dataset contains $569 - 1 = 568$ instances, and the test set contains $1$ data point. 

Even though we can implement the LOOCV in `sklearn`, it is not well supported in it. So we will stop here and move on to the next method, K-fold CV.

### K-fold Cross Validation

This is the most popular method in the context of CV.

Pros:
- Most balanced method;
- Can be used for both model evaluation and optmization purposes

Cons:
- if dataset is too small and K is too large, model might underfit
- if k is too small, model may overfit

For __evaluation purposes__, we can simply use the `cross_val_score` method.
- it takes the model, features, target, and K as function parameters
- by default the returned value is the accuracy score (e.g. classification accuracy for our model)

Below code performs a 5-fold CV using the `SVC` model above on our data - you can see five different _accuracy_ scores.


In [0]:
y = my_data.target
cross_val_score(clf, X, y, cv=5)

In [0]:
# change k to 3
cross_val_score(clf, X, y, cv=3)

In [0]:
# change k to 10
cross_val_score(clf, X, y, cv=10)

If you want the final score of the model, usually we will use the __average__ across the `k` iterations.

In [0]:
print('final model accuracy:', cross_val_score(clf, X, y, cv=10).mean())

You can also test how much the variance is in the results, you can check the _standard deviation_ of the scores.

In [0]:
print('model accuracy variance:', np.std(cross_val_score(clf, X, y, cv=10)))

You can also specify using different metrics. For instance, we may want to focus on the _f1-score_ or the _ROC/AUC_ metric. All supported scoring metrics are listed [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

We can do that using the code below.

In [0]:
cross_val_score(clf, X, y, cv=10, scoring='f1')

In [0]:
cross_val_score(clf, X, y, cv=10, scoring='roc_auc')

The `cross_val_score` method is a shortcut for model evaluation purposes. The regular method is `KFold` - if we want to use the K-fold CV for model optimization purposes, we can use `KFold`, or specific methods for hyperparameter tuning that we will see next week.

### Repeated K-fold Cross Validation

Repeated K-fold CV is basically conducting the K-Fold cross validation `i` times. See the comparison below.

In [0]:
# create some synthetic data for illustration
X_data = np.random.randint(5, size=(9, 2))
X_data

Regular K-fold CV:

In [0]:
kf = KFold(n_splits=3, random_state=2020)
for train_index, test_index in kf.split(X_data):
      print("Train:")
      print(X_data[train_index])
      print("Test:")
      print(X_data[test_index])
      print('\n')

Repeated K-fold CV:
Note that between the repeat, the data is variant (not simple repeat, repeat with randomness).

In [0]:
rkf = RepeatedKFold(n_splits=3, n_repeats=5, random_state=2020)
for train_index, test_index in rkf.split(X_data):
      print("Train:")
      print(X_data[train_index])
      print("Test:")
      print(X_data[test_index])
      print('\n')

### Stratified K-fold Cross Validation

Stratified K-fold CV is particularly useful when the data is imbalanced. See below code for the use of the `StratifiedKFold` method.

In [0]:
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
for train_index, test_index in skf.split(X,y): 
    print("Train:", train_index, "Validation:", val_index) 
    X_train, X_test = X[train_index], X[val_index] 
    y_train, y_test = y[train_index], y[val_index]


As said above, for model evaluation purposes, we can simply use the `cross_val_score` function. In the `cross_val_score` function, if the `cv` value is _integer_, the model (e.g. `clf`) is a classifier, and `y` is _categorical_ (e.g. _binary_ in this case),  `StratifiedKFold` is used. In all other cases, `KFold` is used. In short, `cross_val_score` by default apply stratified K-fold CV for classification problems.

### Cross Validation for Time Series

Time series data is very special - since the time sequence is implied in the data. Thus, splitting a time-series dataset randomly does not work because the time section of your data will be messed up. For a time series forecasting problem, we perform cross validation in the forward chaining manner.

```
fold 1: training [1], test [2]
fold 2: training [1 2], test [3]
fold 3: training [1 2 3], test [4]
fold 4: training [1 2 3 4], test [5]
fold 5: training [1 2 3 4 5], test [6]
.
.
.
fold n: training [1 2 3 ….. n-1], test [n]
```

In [0]:
tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X_data):
      print("Train:")
      print(X_data[train_index])
      print("Test:")
      print(X_data[test_index])
      print('\n')

# Conclusion

In this tutorial, we discussed the importance of cross validation in machine learning models, in particular we focused on the __model evaluation__ use of CV. Thus, you should be comfortable using the `cross_val_score` function in your model evaluation phase.

We also surveyed the most popular CV methods - `sklearn` support most of them natively. In other projects, you may want to use other CV methods.

In next week's tutorial, we will use CV for another important purpose: model optimization. Till then, try CV on your own to evaluate your models.