<a href="https://colab.research.google.com/github/ajrianop/Machine-Learning-Techniques/blob/main/02_K_fold_Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Cross Validation**

Cross validation is a technique used in machine learning to assess how well a trained model will generalize to an independent data set. It's particularly helpful when you have a limited amount of data.

In this notebook, we are going to work with one of the common types of cross-validation, which is K-fold cross-validation. The idea is to provide an explanation of the technique step by step and use Python code for better understanding

**How is it works?**
In order to apply the K-fold technique, we follow the next steps:

* Splits the train subset into K-randomly assigned segments.
* Consider one segment as the **test data**. And the other K-1 segments as the **train data**.
* Train the model with the K-1 segments and measure their performance for each segment.
* Repeat this with each of the K-1 segments.
* To obtain a performance score, take the average of the K-1 r-squared errors.

**What is the idea with this technique?**

Cross-validation helps to provide a more accurate estimate of a model's performance by using multiple splits of the data, reducing the risk of overfitting to a particular training-validation split.

**In practice:** In K-fold cross-validation, we repeatedly split the data into K subsets (folds), where each fold serves as both a training set and a test set. By doing this repeatedly with different splits of the data, we can assess how well the model performs on average across various subsets of the data. This helps us obtain a more reliable estimate of the model's performance and enables us to choose the model variation that performs the best on average.

## **Example:**
Consider the iris flower dataset. We are going to see only the perform of the results

In [None]:
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()

In order to assess the performance of Support Vector Classification (SVC) using K-fold cross-validation, we will first employ the conventional train-test split method to train the SVC model and evaluate its performance. Subsequently, we will utilize the K-fold technique to conduct a similar evaluation and compare the results. This comparative analysis will provide insights into the effectiveness of different evaluation approaches for SVC.

***Using the conventional method train/test split***

In [None]:
# Split the complete iris dataset with a train/test sets with 30% to test and 70% to train
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=0)

# Using training set to build an SVC model for classification
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

# The performance score with the test data can be seen using model.score(X_test, y_test)
clf.score(X_test, y_test)

0.9777777777777777

***Using K-fold cross validation method train/test split***

In [None]:
# To see the scores of the previous model using now cross validation, we use cross_val_score
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

# Print the accuracy for each fold
print(scores)

# And the mean accuracy of all 4 folds
print(scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


Now, let us describe the perform to some different numbers of folds:

In [None]:
def cross_validation(model, number_of_folds):
  # Define the cross validation function
  scores = cross_val_score(model, iris.data, iris.target, cv=number_of_folds)
  print(f'Accuracy for each fold\n   {scores}\n')
  # And the mean accuracy of all k folds
  print(f'The mean of all K-folds\n   {scores.mean()}')

In [None]:
model = clf
for number_of_folds in range(3,8):
  print(f'K-fold cross validation for K={number_of_folds} folds')
  cross_validation(model , number_of_folds)
  print('-----------------------\n')

K-fold cross validation for K=3 folds
Accuracy for each fold
   [1.   1.   0.98]

The mean of all K-folds
   0.9933333333333333
-----------------------

K-fold cross validation for K=4 folds
Accuracy for each fold
   [1.         0.97368421 0.97297297 0.97297297]

The mean of all K-folds
   0.9799075391180654
-----------------------

K-fold cross validation for K=5 folds
Accuracy for each fold
   [0.96666667 1.         0.96666667 0.96666667 1.        ]

The mean of all K-folds
   0.9800000000000001
-----------------------

K-fold cross validation for K=6 folds
Accuracy for each fold
   [0.96 1.   0.96 0.96 1.   1.  ]

The mean of all K-folds
   0.98
-----------------------

K-fold cross validation for K=7 folds
Accuracy for each fold
   [0.95454545 1.         0.95454545 0.95238095 0.95238095 1.
 1.        ]

The mean of all K-folds
   0.9734075448361164
-----------------------



So, the best description is given by the K = 5 folds. Using the linear Kernel in SVC.

## **Using different Kernel in SVC to see the performance of the model**

***Using the conventional method train/test split***

Now, we see the performance using the SVC with polynomial kernel with degree 3 (by default).

In [None]:
# Using training set to build an SVC model with polynomial kernel (degree by default is 3) for classification
clf_poly = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)

# The performance score with the test data can be seen using model.score(X_test, y_test)
clf_poly.score(X_test, y_test)

0.9777777777777777

***Using K-fold cross validation method train/test split***

In [None]:
# Using the polynomial Kernel
model_poly = clf_poly
for number_of_folds in range(3,8):
  print(f'K-fold cross validation for K={number_of_folds} folds')
  cross_validation(model_poly , number_of_folds)
  print('-----------------------\n')

K-fold cross validation for K=3 folds
Accuracy for each fold
   [0.98 0.94 0.98]

The mean of all K-folds
   0.9666666666666667
-----------------------

K-fold cross validation for K=4 folds
Accuracy for each fold
   [0.97368421 0.94736842 0.97297297 1.        ]

The mean of all K-folds
   0.9735064011379801
-----------------------

K-fold cross validation for K=5 folds
Accuracy for each fold
   [0.96666667 1.         0.96666667 0.96666667 1.        ]

The mean of all K-folds
   0.9800000000000001
-----------------------

K-fold cross validation for K=6 folds
Accuracy for each fold
   [0.96 1.   0.92 0.92 0.96 1.  ]

The mean of all K-folds
   0.96
-----------------------

K-fold cross validation for K=7 folds
Accuracy for each fold
   [0.95454545 1.         0.95454545 0.9047619  0.95238095 1.
 1.        ]

The mean of all K-folds
   0.9666048237476809
-----------------------



### **What happens if we change the degree of the polynomial kernel?**

In [None]:
#Defining the degree
degree = 2

# Using training set to build an SVC model with polynomial kernel (degree by default is 3) for classification
clf_poly = svm.SVC(kernel='poly', degree = degree, C=1).fit(X_train, y_train)

# The performance score with the test data can be seen using model.score(X_test, y_test)
clf_poly.score(X_test, y_test)

0.9777777777777777

In [None]:
# Using training set to build an SVC model with polynomial kernel (degree by default is 3) for classification
clf_poly = svm.SVC(kernel='poly', degree = degree, C=1).fit(X_train, y_train)

# The performance score with the test data can be seen using model.score(X_test, y_test)
clf_poly.score(X_test, y_test)

# Using the polynomial Kernel
model_poly = clf_poly
for number_of_folds in range(3,8):
  print(f'K-fold cross validation for K={number_of_folds} folds')
  cross_validation(model_poly , number_of_folds)
  print('-----------------------\n')

K-fold cross validation for K=3 folds
Accuracy for each fold
   [0.98 1.   0.96]

The mean of all K-folds
   0.98
-----------------------

K-fold cross validation for K=4 folds
Accuracy for each fold
   [0.97368421 0.97368421 0.94594595 0.97297297]

The mean of all K-folds
   0.9665718349928876
-----------------------

K-fold cross validation for K=5 folds
Accuracy for each fold
   [0.96666667 1.         1.         0.96666667 1.        ]

The mean of all K-folds
   0.9866666666666667
-----------------------

K-fold cross validation for K=6 folds
Accuracy for each fold
   [0.96 1.   0.96 0.96 1.   1.  ]

The mean of all K-folds
   0.98
-----------------------

K-fold cross validation for K=7 folds
Accuracy for each fold
   [0.95454545 1.         0.95454545 1.         0.9047619  1.
 1.        ]

The mean of all K-folds
   0.9734075448361164
-----------------------



Based on the previous three descriptions, when the kernel is linear, polynomial of degree 2, and polynomial of degree 3, we can conclude that better performance is achieved by using cross-validation for SVC with a polynomial kernel of degree 2 and K = 5 number of folds. This is evident from the score improvement to 0.9866666666666667 compared to 0.9800000000000001 in both other approaches.