In [24]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [47]:
data = datasets.load_iris()

## The Holdout Method
-In this approach, we reserve 50% of the dataset for validation and the remaining 50% for model training. However, a major disadvantage of this approach is that since we are training a model on only 50% of the dataset, there is a huge possibility that we might miss out on some interesting information about the data which will lead to a higher bias.

In [46]:
# The Hold-Out Method
train, validation = train_test_split(data.data, test_size=0.50, random_state = 5)

array([[5.7, 3. , 4.2, 1.2],
       [6.9, 3.1, 5.4, 2.1],
       [6.8, 2.8, 4.8, 1.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.8, 2.7, 5.1, 1.9],
       [4.3, 3. , 1.1, 0.1],
       [4.8, 3.4, 1.9, 0.2],
       [5.2, 2.7, 3.9, 1.4],
       [4.8, 3. , 1.4, 0.3],
       [4.9, 3.1, 1.5, 0.2],
       [7.9, 3.8, 6.4, 2. ],
       [5. , 2.3, 3.3, 1. ],
       [4.6, 3.2, 1.4, 0.2],
       [6.5, 3. , 5.5, 1.8],
       [4.9, 3.1, 1.5, 0.1],
       [6. , 2.2, 5. , 1.5],
       [5.5, 2.6, 4.4, 1.2],
       [5.8, 4. , 1.2, 0.2],
       [5.4, 3.9, 1.3, 0.4],
       [6.4, 2.7, 5.3, 1.9],
       [6. , 3.4, 4.5, 1.6],
       [5.5, 2.5, 4. , 1.3],
       [5.5, 4.2, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [6.9, 3.2, 5.7, 2.3],
       [6. , 2.9, 4.5, 1.5],
       [6.1, 3. , 4.6, 1.4],
       [4.6, 3.1, 1.5, 0.2],
       [5.7, 2.8, 4.5, 1.3],
       [6. , 3. , 4.8, 1.8],
       [5.8, 2.7, 4.1, 1. ],
       [4.8, 3.4, 1.6, 0.2],
       [6. , 2.2, 4. , 1. ],
       [6.4, 3.1, 5.5, 1.8],
       [6.7, 2

### Leave one out cross validation (LOOCV)

-In this approach, we reserve only one data point from the available dataset, and train the model on the rest of the data. This process iterates for each data point. This also has its own advantages and disadvantages. Let’s look at them:

> We make use of all data points, hence the bias will be low

> We repeat the cross validation process $n$ times (where $n$ is number of data points) which results in a higher execution time

> This approach leads to higher variation in testing model effectiveness because we test against one data point. So, our estimation gets highly influenced by the data point. If the data point turns out to be an outlier, it can lead to a higher variation

In [70]:
# The Leave-One-Out Method
from sklearn.model_selection import LeaveOneOut
X = np.array([[1, 2], [3, 4]])
y = np.array([1, 2])
loo = LeaveOneOut()
loo.get_n_splits(X)

for train_index, val_index in loo.split(X):
        print("train:", train_index, "validation:", val_index)
        X_train, X_test = X[train_index], X[val_index]
        y_train, y_test = y[train_index], y[val_index]

train: [1] validation: [0]
train: [0] validation: [1]


- LOOCV leaves one data point out. Similarly, you could leave $p$ training examples out to have validation set of size $p$ for each iteration. This is called LPOCV (Leave P Out Cross Validation)

## k-fold cross validation

From the above two validation methods, we’ve learnt:

1) We should train the model on a large portion of the dataset. Otherwise we’ll fail to read and recognise the underlying trend in the data. This will eventually result in a higher bias

2) We also need a good ratio of testing data points. As we have seen above, less amount of data points can lead to a variance error while testing the effectiveness of the model
    
3) We should iterate on the training and testing process multiple times. We should change the train and test dataset distribution. This helps in validating the model effectiveness properly


Do we have a method which takes care of all these 3 requirements?

Yes! That method is known as “k-fold cross validation”. It’s easy to follow and implement. Below are the steps for it:

   1) Randomly split your entire dataset into $k$”folds”
    
   2) For each $k$-fold in your dataset, build your model on $k-1$ folds of the dataset. Then, test the model to check the effectiveness for $k$th fold
    
   3) Record the error you see on each of the predictions
    
   4) Repeat this until each of the $k$-folds has served as the test set
    
   5) The average of your $k$ recorded errors is called the cross-validation error and will serve as your performance metric for the model


Now, one of most commonly asked questions is, “How to choose the right value of $k$?”.

Always remember, a lower value of $k$ is more biased, and hence undesirable. On the other hand, a higher value of $k$ is less biased, but can suffer from large variability. It is important to know that a smaller value of $k$ always takes us towards validation set approach, whereas a higher value of $k$ leads to LOOCV approach.

Precisely, LOOCV is equivalent to $n$-fold cross validation where $n$ is the number of training examples.

In [71]:
# K-fold Cross Validation
from sklearn.model_selection import KFold 
kf = KFold(n_splits=2, random_state=None) 

for train_index, val_index in kf.split(X):
      print("Train:", train_index, "Validation:",val_index)
      X_train, X_test = X[train_index], X[val_index] 
      y_train, y_test = y[train_index], y[val_index]

Train: [1] Validation: [0]
Train: [0] Validation: [1]


## Stratified k-fold cross validation

-Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole. For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.

-It is generally a better approach when dealing with both bias and variance. A randomly selected fold might not adequately represent the minor class, particularly in cases where there is a huge class imbalance.

In [69]:
# Stratified K-fold Cross Validation
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2, random_state=None)
#skf.get_n_splits(X, y)

for train_index, test_index in skf.split(X, y):
   print("Train:", train_index, "Validation:", val_index)
   X_train, X_test = X[train_index], X[val_index]
   y_train, y_test = y[train_index], y[val_index]

Train: [1 3] Validation: [1]
Train: [0 2] Validation: [1]


Having said that, if the train set does not adequately represent the entire population, then using a stratified k-fold might not be the best idea. In such cases, one should use a simple k-fold cross validation with repetition.

In repeated cross-validation, the cross-validation procedure is repeated n times, yielding n random partitions of the original sample. The n results are again averaged (or otherwise combined) to produce a single estimation.

In [58]:
#Repeated K-Fold Cross Validation
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=2, n_repeats=10, random_state=None)
# X is the feature set and y is the target
for train_index, val_index in rkf.split(X):
     print("Train:", train_index, "Validation:", val_index)
     X_train, X_test = X[train_index], X[val_index]
     y_train, y_test = y[train_index], y[val_index]

Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [1] Validation: [0]
Train: [0] Validation: [1]
Train: [1] Validation: [0]
Train: [0] Validation: [1]


## References 

(https://www.analyticsvidhya.com/blog/2018/05/improve-model-performance-cross-validation-in-python-r/)