# **Cross-Validation Techniques**

>  [**Cross-Validation: Evaluating Estimator Performance**](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

>  [**sklearn.model_selection: Model Selection**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)

Cross-validation is a resampling procedure used to evaluate machine learning models and verify how the model will perform for an independent test dataset.

**Why Cross Validation is Important?**

We often randomly split the dataset into training and validation sets to develop a machine learning model. The training data is used to train the ML model, and the same model is tested on independent validation data to evaluate the performance of the model.

With the change in the `random_state` of the split, the accuracy of the model also changes, so we are unable to achieve a fixed accuracy for the model. The testing data should be kept independent of the training data so that no data leakage occurs. During the development of an ML model using the training data, the model performance needs to be evaluated. Here's the importance of cross-validation comes into the picture.

**Data needs to be split into:**

*   **Training Data:** Use for model development.
*   **Validation Data:** Use for validating the performance of the same model.

![CV.png](https://www.upgrad.com/blog/wp-content/uploads/2020/01/data-preprocessing-machine-learning-5.png)

In simple terms, cross-validation allows utilizing the data even better.

In [1]:
# Import Library.
import pandas as pd
import numpy as np

> [**Kaggle Dataset**](https://www.kaggle.com/mathchi/diabetes-data-set)

In [2]:
# Load Dataset.
data = pd.read_csv(
    "https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
)
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Dataset Summary.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
# Split the dataset into features and target values.
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Feature Scaling.
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X = sc.fit_transform(X)

# Split the dataset into training and test set.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# **HoldOut Validation Approach - Train and Test Split.**

The holdout technique is an exhaustive cross-validation method that randomly splits the dataset into train and test sets. In the case of holdout cross-validation, the dataset is randomly split into training and validation data. Generally, the split of training data is more than test data. The training data is used to induce the model, and the validation data evaluates the performance of the model. The more data is used to train the model, the better the model becomes. For the holdout cross-validation method, a good amount of data is isolated from the training set.

*   Holdout Cross-Validation is not suitable for an imbalanced dataset.
*   A lot of data is isolated from training the model.

In [5]:
# Decision Tree Classifier.
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier().fit(X_train, y_train)
result = model.score(X_test, y_test)
print(result)

0.7402597402597403


# **K-Fold Cross Validation**

> [**sklearn.model_selection.KFold**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold)

In $K$-fold cross-validation, the original dataset is equally partitioned into $K$ subparts or folds. Out of the $K$-folds or groups, for each iteration, one group is selected as the validation set, and the remaining $(K-1)$ groups are selected as the training set.

![Kfold.gif](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4b/KfoldCV.gif/1920px-KfoldCV.gif)

Illustration of $K$-fold cross-validation when $n = 12$ observations and $K = 3$. After data gets shuffled, a total of 3 models will get trained and tested. The process is repeated for $K$ times until each group is treated as validation and remaining as the training data.

### **Diagram of $K$-fold cross-validation.**

![KFOLD.png](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/1920px-K-fold_cross_validation_EN.svg.png)

The final accuracy of the model is computed by taking the mean accuracy of the $K$-models validation data. That is, $acc_{CV} = \sum_{i=1}^{K}\frac{acc_{i}}{K}$

*   The model has low bias and low time complexity.
*   The entire dataset is utilized for both training and validation.
*   It is not suitable for an imbalanced dataset.


In [6]:
from sklearn.model_selection import KFold, cross_val_score

model = DecisionTreeClassifier()
kfold_val = KFold(10)
results = cross_val_score(model, X_train, y_train, cv=kfold_val)
print(results)
print(np.mean(results))

[0.72580645 0.77419355 0.82258065 0.72580645 0.60655738 0.80327869
 0.63934426 0.67213115 0.7704918  0.78688525]
0.7327075621364358


# **Stratified K-fold Cross Validation**

> [**sklearn.model_selection.StratifiedKFold**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold)

For all the cross-validation techniques discussed above, it may not work well with an imbalanced dataset. Stratified $K$-fold cross-validation solved the problem of an imbalanced dataset.

In Stratified $K$-fold cross-validation, the dataset gets partitioned into $K$ groups or folds such that the validation data has an equal number of instances of the target class label. It ensures that one particular class is not over-present in the validation or train data, especially when the dataset is imbalanced.

![SKfold.png](https://dataaspirant.com/wp-content/uploads/2020/12/8-Stratified-K-Fold-Cross-Validation.png)

In Stratified $K$-fold cross-validation, each fold has equal instances of the target class. The final score is computed by taking the mean of scores of each fold. Stratified $K$-fold cross-validation works well for an imbalanced dataset.

In [7]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

model = DecisionTreeClassifier()
skfold = StratifiedKFold(n_splits=10)
results = cross_val_score(model, X_train, y_train, cv=skfold)
print(results)
print(np.mean(results))

[0.74193548 0.70967742 0.70967742 0.70967742 0.6557377  0.78688525
 0.72131148 0.68852459 0.75409836 0.70491803]
0.7182443151771549


# **Leave-P-Out Cross Validation (LpOCV)**

> [**sklearn.model_selection.LeavePOut**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePOut.html)

Leave-P-Out cross-validation ($LpOCV$) is an exhaustive cross-validation technique that involves using $p$-observation as validation data, whereas the remaining data is used to train the model. This gets repeated in all ways to cut the original sample in a validation set of $p$ observations, while the rest gets into the training set.

In [8]:
from sklearn.model_selection import LeavePOut, cross_val_score

model = DecisionTreeClassifier()
lpoCV = LeavePOut(2)
results = cross_val_score(model, X_train, y_train, cv=lpoCV)
print(results)
print(np.mean(results))

[1.  0.5 0.5 ... 0.5 1.  0.5]
0.7149412033519138


# **Leave-One-Out Cross Validation (LOOCV)**

> [**sklearn.model_selection.LeaveOneOut**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut)

Leave-One-Out cross-validation ($LOOCV$) is an exhaustive cross-validation technique. It is a category of $LpOCV$ with the case of $p=1$.

![LOOCV.gif](https://upload.wikimedia.org/wikipedia/commons/c/c7/LOOCV.gif)

Illustration of leave-one-out cross-validation ($LOOCV$), when $n = 8$ observations. A total of 8 models get trained and tested. For a dataset having $n$ rows, the $1^{st}$ row is selected for validation, and the rest $(n-1)$ rows are used to train the model. For the next iteration, the $2^{nd}$ row is selected for validation and the rest to train the model. Similarly, the process gets repeated until $n$ steps or the desired number of operations.

Both the above two cross-validation techniques (i.e., $LpOCV$ and $LOOCV$) are types of exhaustive cross-validation. Exhaustive cross-validation methods are cross-validation methods that learn and test in all possible ways.

*   Simple, easy to understand, and implement.
*   The model may lead to a low bias.
*   The computation time required is high.

In [9]:
from sklearn.model_selection import LeaveOneOut, cross_val_score

model = DecisionTreeClassifier()
looCV = LeaveOneOut()
results = cross_val_score(model, X_train, y_train, cv=looCV)
print(results)
print(np.mean(results))

[1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.
 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1.
 0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0.
 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1.
 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0.
 1. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1.
 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1.
 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0.