# K-Folds Validation
As part of this notebook, we will be exploring how to make efficient use of small datasets by utilizing **k-folds validation**. K-folds validation splits a training dataset into multiple small batches. One of these datasets is reserved as the validation dataset 

## Project Setup

In [1]:
# Importing the necessary Python libraries
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Getting the Iris dataset from Scikit-Learn
iris = datasets.load_iris()

In [3]:
# Loading the predictor value (y) and remainder of the training dataset (X) as Pandas DataFrames
X = pd.DataFrame(data = iris['data'], columns = iris['feature_names'])
y = pd.DataFrame(data = iris['target'], columns = ['target'])

## Performing a Typical Split
Before we jump into how we perform k-folds validation, let's do a quick refresher on how we typically split our dataset using a traditional `train_test_split`. Then we'll later contrast this method with k-folds validation.

In [4]:
# Performing a train_test_split on the dataset
X_train, X_val, y_train, y_val = train_test_split(X, y)

In [5]:
# Instantiating a RandomForestClassifier model
rfc_model = RandomForestClassifier()

In [6]:
# Fitting the X_train and y_train datasets to the RandomForestClassifier model
rfc_model.fit(X_train, y_train)

RandomForestClassifier()

In [7]:
# Getting inferential predictions for the validation dataset
val_preds = rfc_model.predict(X_val)

In [8]:
# Generating validation metrics by comparing the inferential predictions (val_preds) to the actuals (y_val)
val_accuracy = accuracy_score(y_val, val_preds)
val_confusion_matrix = confusion_matrix(y_val, val_preds)

In [9]:
# Printing out the validation metrics
print(f'Accuracy Score: {val_accuracy}')
print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 0.9210526315789473
Confusion Matrix: 
[[12  0  0]
 [ 0 11  0]
 [ 0  3 12]]


## Training with K-Folds Validation
Now that we have performed a very basic model training using a traditional `train_test_split`, we are now ready to perform a training using k-folds validation.

In [10]:
# Instantiating the K-Fold cross validation object with 5 folds
k_folds = KFold(n_splits = 5, shuffle = True, random_state = 42)

In [11]:
# Iterating through each of the folds in K-Folds
for train_index, val_index in k_folds.split(X):
    
    # Splitting the training set from the validation set for this specific fold
    X_train, X_val = X.iloc[train_index, :], X.iloc[val_index, :]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    
    # Instantiating a RandomForestClassifier model
    rfc_model = RandomForestClassifier()
    
    # Fitting the X_train and y_train datasets to the RandomForestClassifier model
    rfc_model.fit(X_train, y_train)
    
    # Getting inferential predictions for the validation dataset
    val_preds = rfc_model.predict(X_val)
    
    # Generating validation metrics by comparing the inferential predictions (val_preds) to the actuals (y_val)
    val_accuracy = accuracy_score(y_val, val_preds)
    val_confusion_matrix = confusion_matrix(y_val, val_preds)
    
    # Printing out the validation metrics
    print(f'Accuracy Score: {val_accuracy}')
    print(f'Confusion Matrix: \n{val_confusion_matrix}')

Accuracy Score: 1.0
Confusion Matrix: 
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
Accuracy Score: 0.9666666666666667
Confusion Matrix: 
[[13  0  0]
 [ 0 10  0]
 [ 0  1  6]]
Accuracy Score: 0.9333333333333333
Confusion Matrix: 
[[12  0  0]
 [ 0  8  2]
 [ 0  0  8]]
Accuracy Score: 0.9333333333333333
Confusion Matrix: 
[[ 8  0  0]
 [ 0  9  1]
 [ 0  1 11]]
Accuracy Score: 0.9666666666666667
Confusion Matrix: 
[[ 7  0  0]
 [ 0 11  0]
 [ 0  1 11]]
