## Hold Out Cross Validation
Here  we split the data into 2 sets - train and test set. The split (either 70:30 or 80:20 or even 60:40) is totally dependent on the use case we are working on.

In [1]:
from sklearn.model_selection import train_test_split

In [3]:
X = [10,20,30,40,50,60,70,80,90,100]

X_train,X_test= train_test_split(X,test_size=0.3, random_state=0)
print("Train:",X_train,"Test:" ,X_test)

Train: [100, 20, 70, 80, 40, 10, 60] Test: [30, 90, 50]


## Leave One Out Cross Validation
This is a simple technique in which training data inlcudes all observations in the data except one observation which will be used to test.

For n samples, we have n different training sets. 

Although this model is trained on almost all of the data, the number of iterations and n different training sets, makes it computationally very expensive.

In [4]:
from sklearn.model_selection import LeaveOneOut
X = [10,20,30,40,50,60,70,80,90,100]
l = LeaveOneOut()

for train, test in l.split(X):
    print("%s %s"% (train,test))

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]


### K-Fold Cross-Validation

KFold divides the samples into k groups (folds) of approximately equal sizes. Out of these k groups, k-1 folds are used for training and the remaning one is used for testing.
This process is repeated k times 

##### KFold(n_splits=5, *, shuffle=False, random_state=None)

n_splits --> number of folds, default=5
shuffle: bool, default=False
Shuffle is used to shuffle the data before splitting it into batches. Samples within each split will not be shuffled.

random_state --> int, default=None
This is used to control the randomness of each fold and it affects the ordering of indices only when shuffle=True, else it doesn't have any effect

In [5]:
import numpy as np
from sklearn.model_selection import KFold
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

In [6]:
X = ["a",'b','c','d','e','f']
kf = KFold(n_splits=3,shuffle=False,random_state=None)

In [7]:
print(kf)

KFold(n_splits=3, random_state=None, shuffle=False)


In [4]:
#i=0
for train, test in kf.split(X):
    #print("Iteration:",i)
    print("Train:",train,"Test:",test)

Train: [2 3 4 5] Test: [0 1]
Train: [0 1 4 5] Test: [2 3]
Train: [0 1 2 3] Test: [4 5]


## Stratified K-Fold 

This technique is a variation of K-Fold, and it divides the data into k-stratified folds.
This way it preserves the percentage of samples of each class present in the data
* It generates test sets such that all sets contain the same distribution of classes, or as close as possible

##### sklearn.model_selection.StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)

In [9]:
from sklearn.model_selection import StratifiedKFold

In [10]:
X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1])
skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)

for train_index,test_index in skf.split(X,y):
    print("Train:",train_index,'Test:',test_index)
    X_train,X_test = X[train_index], X[test_index]
    y_train,y_test = y[train_index], y[test_index]


Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Train: [0 1 2 4] Test: [3 5]
