<span style="color:orange; font-size:40px;">
    <div align=center><b>Cross-Validation</b></div>
</span>

Cross Validation is a statistical method to evaluate the performance of a machine learning model before they are put to use.

We have some ways to perform cross validation:
- **Hold out method**
- **Leave One Out Cross-Validation**
- **K-fold Cross-Validation**
- **Stratified K-fold Cross-Validation**

<span style="color:cyan; font-size:30px">
    <b>Import Libraries</b>
</span>

In [3]:
import numpy as np
import pandas as pd

In [3]:
adversiting.head()

Unnamed: 0,ID,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


<span style="color:cyan; font-size:30px">
    <b>Hold Out</b>
</span>

<div align=center><img src='./img/hocv.png'/></div>

* Hold Out is a process of splitting an entire dataset into training set and test set, in general the proportion of splitting is 80% for training set and 20$ for testing set or 70%-30% depends on the large of dataset

* The problem with this method is that sometimes important information can be discarded from training set and put into the testing set.

* This method can be implemented using the ***train_test_split*** method from *Scikit-Learn* library

* **Notte:** Hold out method is not a Cross-Validation method

<span style="color:cyan; font-size:20px">
    <b>Code Implementation</b>
</span>

In [1]:
from sklearn.model_selection import train_test_split

# importing data
data = pd.read_csv('../datasets/Adversiting.csv')

# preview data
data.head()

# split data in features and target
# features: TV, Radio, Newspaper | target -> Sales
X = data.iloc[:, 1:4]
y = data.iloc[:,-1]

# Train | Test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print( f'Train set size: {len(X_train)}' )
print( f'Test set size: {len(X_test)}' )

Train set size: 160
Test set size: 40


<span style="color:cyan; font-size:30px">
    <b>Leave One Out Cross-Validation | LOOCV</b>
</span>

<div align=center> <img src='./img/loocv.png' /></div>

* LOOCV is a cross-validation method where a single observation or only one register is taken as test data and everything else is taken as training data.

* This method doon't split the dataset in folds, only take an observation of the dataset for testing the model and the rest of data is taken for training the model model.
* LOOCV it is computationally expensive
* It is useful when we have small datasets

<span style="color:cyan; font-size:20px">
    <b>Code Implementation</b>
</span>

In [17]:
from sklearn.model_selection import LeaveOneOut
X = np.array( [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210] )

cv = LeaveOneOut()

for train_index, test_index in cv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    print(F" Train: {X_train}, Test: {X_test} ")

 Train: [ 20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [10] 
 Train: [ 10  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [20] 
 Train: [ 10  20  40  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [30] 
 Train: [ 10  20  30  50  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [40] 
 Train: [ 10  20  30  40  60  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [50] 
 Train: [ 10  20  30  40  50  70  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [60] 
 Train: [ 10  20  30  40  50  60  80  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [70] 
 Train: [ 10  20  30  40  50  60  70  90 100 110 120 130 140 150 160 170 180 190 200 210], Test: [80] 
 Train: [ 10  20  30  40  50  60  70  80 100 110 120 130 140 150 160 170 180 190 200 210], Test: [90] 
 Train: [ 10  20  30  40  50  60  70  80  90 110 120 130 140 150 160 170 

<span style="color:cyan; font-size:30px">
    <b>K Fold Cross-Validation</b>
</span>

<div align=center> <img src='./img/kfoldcv.png' /></div>

* K-Fold cross-validation is method where the whole dataset is divided in k groups or folds, each group has equal size.
   * e.g if we have a dataset of 20 observations and we put a k vaue of 5 we can divide 20/5 and each group will have 4 observation, so we will have 5 groups or folds of 4 observations.

**The procedure:**
1. Shuffle the dataset randomly.
2. Split the dataset in **k** groups
3. For each unique group
    1. Take the a group as hold aout or test sata set
    2. Take the remaining groups as training data set
    3. Fit a model on the training set and evaluate on the test set
    4. Retain the evaluation score and discard the model
**Set up K**
K is a integer number and indicates the number of groups or folds for dataset, in this case k is equal to the number of iteration of the cross-validation method, for settig up this value you can do next:
* **Representative:** the value for k is chosen such that each train/test group of data samples is large enough to be statistically  representative of the broader dataset.
   * e.g. if we have a dataset with 4000 register we can choose a k value equal to 8 then each group will have 500 observations

* **k=5 | k=10** This method is very common in machine learning and split the data in:

<span style="color:cyan; font-size:20px">
    <b>Code Implementation</b>
</span>

In [22]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5, shuffle=True)
# taking the 20 first observations of dataset
X = np.array( [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210] )

for i, (train_index, test_index) in enumerate(kfold.split(X)):
    print(f'Fold {i+1}:')
    print(" Train: %s, Test: %s " % (X[train_index], X[test_index]))
    

Fold 1:
 Train: [ 10  20  30  40  50  60  70  90 110 120 130 140 150 180 190 200], Test: [ 80 100 160 170 210] 
Fold 2:
 Train: [ 10  20  30  50  60  70  80 100 110 120 140 150 160 170 190 200 210], Test: [ 40  90 130 180] 
Fold 3:
 Train: [ 20  30  40  60  80  90 100 110 120 130 140 150 160 170 180 190 210], Test: [ 10  50  70 200] 
Fold 4:
 Train: [ 10  40  50  60  70  80  90 100 130 140 150 160 170 180 190 200 210], Test: [ 20  30 110 120] 
Fold 5:
 Train: [ 10  20  30  40  50  70  80  90 100 110 120 130 160 170 180 200 210], Test: [ 60 140 150 190] 


<span style="color:cyan; font-size:30px">
    <b>Stratified K Fold Cross-Validation</b>
</span>

<div align=center> <img src='./img/skfold.png' /></div>

* It is an extension of K Fold CV but specifically for classification problems.
* Folds are made by preserving the percentage of samples for each class

<span style="color:cyan; font-size:20px">
    <b>Code Implementation</b>
</span>

In [28]:
from sklearn.model_selection import StratifiedKFold
X = np.array( [ [1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16] ] )
y = np.array( [1, 1, 1, 0, 1, 0, 1, 0] )

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

for i, (train, test) in enumerate( skf.split(X, y) ):
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    print(F"Fold {i+1}")
    print(f" X_train: {X_train}, y_train: {y_train}")

Fold 1
 X_train: [[ 1  2]
 [ 5  6]
 [11 12]
 [13 14]
 [15 16]], y_train: [1 1 0 1 0]
Fold 2
 X_train: [[ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]], y_train: [1 1 0 1 0]
Fold 3
 X_train: [[ 1  2]
 [ 3  4]
 [ 7  8]
 [ 9 10]
 [13 14]
 [15 16]], y_train: [1 1 0 1 1 0]


<span style="color:cyan; font-size:30px">
    <b>Group K Fold Cross-Validation</b>
</span>

<div align=center> <img src='./img/gkfold.png' /></div>

* Group K Fold ensures that the same group is not represented in both test and train sets.
   * e.g When we have medical information collected from multiple patients and multiple samples are taken for each patient, this cross-validation method can take for example 2 patients for test set and everything else for train set, but the same patients will not be in test and train sets

In [34]:
from sklearn.model_selection import GroupKFold

X = np.array( [[10, 20], [30, 40], [50, 60], [70, 80], [90,100], [110, 120]] )
y = np.array([1, 2, 3, 4, 5, 6])
groups = [1, 2, 1, 1, 2, 3]
group_kfold = GroupKFold(n_splits=3)

for i, (train_index, test_index) in enumerate( group_kfold.split(X,y, groups ) ):
    X_train, X_test = X[train_index], X[test_index]
    print(f" Train:{X_train}, Test: {X_test} ")

 Train:[[ 30  40]
 [ 90 100]
 [110 120]], Test: [[10 20]
 [50 60]
 [70 80]] 
 Train:[[ 10  20]
 [ 50  60]
 [ 70  80]
 [110 120]], Test: [[ 30  40]
 [ 90 100]] 
 Train:[[ 10  20]
 [ 30  40]
 [ 50  60]
 [ 70  80]
 [ 90 100]], Test: [[110 120]] 


<span style="color:cyan; font-size:30px">
    <b>Stratified Group K Fold Cross-Validation</b>
</span>

<div align=center> <img src='./img/sgkfold.png' /></div>

* Variation of StratifiedKFold attempts to return stratified folds with non-overlapping groups.
* The folds are made by preserving the percentage of samples for each class
* This method is useful when we have classification problems besides in our data we have groups.

In [11]:
from sklearn.model_selection import StratifiedGroupKFold

X = np.array( [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170] )
y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
sgkf = StratifiedGroupKFold(n_splits=5)

for i, (train_index, test_index) in enumerate(sgkf.split(X, y, groups)):
    X_train, X_test = X[train_index], X[test_index]
    print(f"----------------Fold: {i+1}----------------------- ")
    print(f" Train: {X_train}, Train Groups: {groups[train_index]} ")
    print(f" Test: {X_test}, Test Groups: {groups[test_index]} ")

----------------Fold: 1----------------------- 
 Train: [ 10  20  30  40  50  60  70  80 130 140 150 160 170], Train Groups: [1 1 2 2 3 3 3 4 6 6 7 8 8] 
 Test: [ 90 100 110 120], Test Groups: [5 5 5 5] 
----------------Fold: 2----------------------- 
 Train: [ 10  20  30  40  80  90 100 110 120 130 140 160 170], Train Groups: [1 1 2 2 4 5 5 5 5 6 6 8 8] 
 Test: [ 50  60  70 150], Test Groups: [3 3 3 7] 
----------------Fold: 3----------------------- 
 Train: [ 30  40  50  60  70  90 100 110 120 130 140 150 160 170], Train Groups: [2 2 3 3 3 5 5 5 5 6 6 7 8 8] 
 Test: [10 20 80], Test Groups: [1 1 4] 
----------------Fold: 4----------------------- 
 Train: [ 10  20  50  60  70  80  90 100 110 120 130 140 150], Train Groups: [1 1 3 3 3 4 5 5 5 5 6 6 7] 
 Test: [ 30  40 160 170], Test Groups: [2 2 8 8] 
----------------Fold: 5----------------------- 
 Train: [ 10  20  30  40  50  60  70  80  90 100 110 120 150 160 170], Train Groups: [1 1 2 2 3 3 3 4 5 5 5 5 7 8 8] 
 Test: [130 140], Tes