# **Cross Validation (CV)**

When training a model the first thing we do is:
>divide the data into training and test sets

The model is trained on the training set. <br>
Then to test the accuracy of the model predictions, it is tested using data it has never seen before... the test set. 

Once we have a model that is close to our performance criteria, we can do hyperparameter tuning to select the best hyperparameters for the model. 

The problem with this approach is:
>We may actually now fit the hyperparameters to the test set. <br>
**Leading to a risk of overfitting on the test set** because the parameters can be tweaked to get optimal model performance. <br>

A common method to solve this problem is to divide the test set into test and validate sets. Then train on the training set, test on the test set, and when ready, validate performance on the validation set.

The downsides of this method are: 
1. Reducing the amount of data available for training the model. 
2. The results of the model performance are variable depending upon the random selection of the training and test sets. 

This problem can be solved by using **Cross-validation(CV)**

If we use Cross-validation, a validation set is no longer necessary. We need only the training and the test set. 

There are several cross-validation methods: <br>
- Holdout
- K-fold CV
- Repeated random sub-sampling

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# **Holdout**
This is the basic method for testing models. <br>
1. Split the dataset into training and test sets.<br>
2. Train the model on the training set<br>
3. Test the model on the test set. 

**Create synthetic data**<br>
The data has 10 sets of a datapoint and its label.<br>

In [None]:
X, y = np.arange(20).reshape((10, 2)), range(10)
X

In [None]:
list(y)

**Train-test split**

Split the dataset into training and test sets. Use 80% of the data for training. 
30% for testing.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=4)

In [None]:
X_train

In [None]:
y_train

In [None]:
X_test

In [None]:
y_test

Since this is the method we normally use to test the model, we have a lot of examples of the Holdout method. <br>
We can stop here. 

# **K-Fold Cross Validation**

Create data

In [None]:
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 5], [0, 2], [3, 4], [5, 4], [2, 1]])
y = np.array([1, 2, 3, 4, 5, 6])

Setup the K-fold

In [None]:
kf = KFold(n_splits=2)
kf.get_n_splits(X)
print(kf)

In [None]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

# **Random Shuffle Split**

Create data

In [None]:
from sklearn.model_selection import ShuffleSplit
X = np.arange(10)
ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
X

Shuffle split

In [None]:
for train_index, test_index in ss.split(X):
    print("%s %s" % (train_index, test_index))

https://scikit-learn.org/stable/modules/cross_validation.html
