# Types of Cross Validation

* **Leave one out cross validation (LOOCV)**: LOOCV is an extreme version of k-fold cross-validation that has the maximum computational cost. It requires one model to be created and evaluated for each example in the training dataset.
* **K-fold cross validation**: 
    * The general procedure is as follows:
        * Shuffle the dataset randomly.
        * Split the dataset into k groups
        * For each unique group:
        * Take the group as a hold out or test data set
        * Take the remaining groups as a training data set
        * Fit a model on the training set and evaluate it on the test set
        * Retain the evaluation score and discard the model
        * Summarize the skill of the model using the sample of model evaluation scores
* **Stratified cross validation**: In the case of class imbalances in particular, to use stratified 10-fold cross-validation, which ensures that the proportion of positive to negative examples found in the original distribution is respected in all the folds. https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html
* **Time-series cross validation**
* **Spatial cross validation** : https://towardsdatascience.com/spatial-cross-validation-using-scikit-learn-74cb8ffe0ab9

In [25]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score,StratifiedKFold, ShuffleSplit,LeaveOneOut

import os
import sys


In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/krishnaik06/Types-Of-Cross-Validation/main/cancer_dataset.csv')

In [4]:
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [5]:
X = df.iloc[:,2:]
y = df.iloc[:,1]

In [6]:
X = X.dropna(axis=1)

In [7]:
X.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


# HoldOut Validation Approach - Train & Test Split

In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_state=0)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
result = model.score(X_test,y_test)
print(result)

0.9181286549707602


# K-fold Cross Validation

In [14]:
model = DecisionTreeClassifier()

# Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default).
kFold_validation = KFold(10) 

result = cross_val_score(model,X,y,cv=kFold_validation)
print(result)
print(np.mean(result))

[0.92982456 0.9122807  0.89473684 0.94736842 0.9122807  0.98245614
 0.9122807  0.94736842 0.92982456 0.94642857]
0.931484962406015


# Stratified K-fold Cross Validation

In [19]:
skfold=StratifiedKFold(n_splits=5)
model=DecisionTreeClassifier()
scores=cross_val_score(model,X,y,cv=skfold)

print(scores)
print(np.mean(scores))

[0.90350877 0.92982456 0.9122807  0.92105263 0.90265487]
0.9138643067846607


# Leave One Out Cross Validation

In [20]:
model=DecisionTreeClassifier()
leave_validation=LeaveOneOut()
results=cross_val_score(model,X,y,cv=leave_validation)

In [23]:
print(np.mean(results))

0.9244288224956063


# Repeated Random Test-Train Splits
This technique is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data in the training-test set manner and then repeat the process of splitting and evaluating the algorithm multiple times, just like the cross-validation method.

In [26]:
model=DecisionTreeClassifier()
ssplit=ShuffleSplit(n_splits=10,test_size=0.30)
results=cross_val_score(model,X,y,cv=ssplit)

In [27]:
results

array([0.91812865, 0.88304094, 0.91812865, 0.92982456, 0.93567251,
       0.95321637, 0.94152047, 0.94736842, 0.9005848 , 0.88888889])

In [28]:
np.mean(results)

0.9216374269005849