<a href="https://www.kaggle.com/code/zarna99/iris-cross-validation?scriptVersionId=106283648" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<h1><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#477482  ;
           font-size:400%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 20px;
              color:white;
               text-align:center">
    <b>
        Cross-Validation
        </b>
</p>
</div></h1>

<img src='https://www.mihaileric.com/static/model-selection-meme-bd4a6a86f615583d1a1bbc497ca4640e-67414.jpeg' width="900" height="600" align="middle">

# **"Deploying the model without validating it on a test dataset, would be the greatest act of stupidity, in the process of machine learning."**

### Table of Contents

* [1. Introduction](#1)
* [2. Importing necessary Libraries](#2)
* [3. Importing the Dataset](#3)
* [4. Basic information about the dataset](#4)
* [5. Label Encoding](#5)
* [6. Defining the explanatory variables and target variables](#6)
* [7. Creating the model](#7)
* [8. Cross Validation](#8)
     * [8.1 Method 1 : Holding out cross validation](#8.1)
     * [8.2 Method 2 : K-Folds cross validation](#8.2)
     * [8.3 Method 3 : Leave One Out cross validation](#8.3)

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Introduction<a class="anchor" id="1"></a>
        </b>
</p>
</div></h2>


In machine learning, there is always the need to test the stability of the model. It means based only on the training dataset; we can't fit our model and deploy it for further analysis. 
You need some kind of assurance that your model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance.
For this purpose, we reserve a particular sample of the dataset, which was not part of the training dataset. After that, we test our model on that sample before deployment, and this complete process comes under cross-validation.
In this notebook we shall see three methods for cross validation.
*   Holding out cross validation
*   k-folds cross validation
*   Leave one out cross validation


<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Importing necessary libraries<a class="anchor" id="2"></a>
        </b>
</p>
</div></h2>


In [1]:
#importing the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, LeaveOneOut
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree, preprocessing
from sklearn.metrics import classification_report, accuracy_score

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Importing the dataset<a class="anchor" id="3"></a>
        </b>
</p>
</div></h2>



*   The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor).

*   Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

In [2]:
data = pd.read_csv("../input/iris-flower-dataset/IRIS.csv")
data

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Getting the basic information of the data<a class="anchor" id="4"></a>
        </b>
</p>
</div></h2>

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Seeing at the basic information, there are no null values in the datset.

In [4]:
data["species"].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: species, dtype: int64

The dataset has equal number of data points for all the three categories, hence it is a balanced dataset.

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Converting the species into labels<a class="anchor" id="5"></a>
        </b>
</p>
</div></h2>

For building any models, it is necessary to not have any string values in the dataset. Hence, we shall convert the species into labels. The labels are created as follows:
*   0 : setosa
*   1 : versicolor
*   2 : virginica



In [5]:
#converting into labels
label_encoder = preprocessing.LabelEncoder()
data['species']= label_encoder.fit_transform(data['species'])
data["species"].value_counts()

0    50
1    50
2    50
Name: species, dtype: int64

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Defining the explanatory variables and the target variable<a class="anchor" id="6"></a>
        </b>
</p>
</div></h2>

In [6]:
#defining x and y variables
x=data.iloc[:,0:4]
y=data['species']

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Creating the model<a class="anchor" id="7"></a>
        </b>
</p>
</div></h2>

In [7]:
model = DecisionTreeClassifier(criterion = 'gini', max_depth=3)

<h2><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#749FAD  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Cross Validation<a class="anchor" id="8"></a>
        </b>
</p>
</div></h2>



Upto now, we have created the decision tree model. Now its time to train the model on the dataset. At this point we need to make sure that the model is trained properly, such that it is able to study all the patterns in the data.For this purpose we shall apply cross validation.

<h3><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#96BFCC  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Method 1 : Holding out Cross Validation<a class="anchor" id="8.1"></a>
        </b>
</p>
</div></h3>




Holding out is the traditional approach for cross validation, where the data is simply divided into two parts i.e. the training and the testing part. The data can be divided into 70-30 or 60-40, 75-25 or 80-20, or even 50-50 depending on the use case. During the training phase we would only show the training part of the dataset to the model. From this, the model will try to understand the patterns in the dataset. After the training phase, the model will be tested on the data points which were stored in the testing part. The predictions of those data points will then be compared with the actual values and hence we shall evaluate that how well the model is able to deal with the unseen dataset.

<img src='https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/15185319/blogs-15-7-2020-02-1024x565.jpg'>

Following is the code for holding out method:

---



Starting with splitting the dataset into traing and testing parts. For this we have used the "train_test_split" from the sklearn.model_selection library.

In [8]:
# Splitting data into training and testing data set
# from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=40)

In [9]:
print("x_train: " ,x_train.shape)
print("x_test: " ,x_test.shape)
print("y_train: " ,y_train.shape)
print("y_test: " ,y_test.shape)

x_train:  (120, 4)
x_test:  (30, 4)
y_train:  (120,)
y_test:  (30,)


The train test split divided the dataset into two categories. There were in total 150 datapoints, out of which 120 datapoints have been grouped as training and the remaining 30 are kept as testing.

We shall now fit the model only on the training dataset.

In [10]:
model.fit(x_train,y_train)

DecisionTreeClassifier(max_depth=3)

Based on the model fitted, now we have made the predictions for the test dataset.

In [11]:
#Predicting on test data
preds = model.predict(x_test) 
preds

array([0, 1, 2, 2, 1, 2, 1, 1, 1, 0, 1, 0, 0, 1, 1, 2, 2, 2, 1, 1, 2, 2,
       1, 0, 1, 0, 0, 2, 0, 1])

Comparing the values of predictions and the actual labels.

In [12]:
pd.Series(preds).value_counts() # getting the count of each category 

1    13
2     9
0     8
dtype: int64

In [13]:
y_test.value_counts()

1    12
2    10
0     8
Name: species, dtype: int64

Getting the accuracy of the datset

In [14]:
accuracy_score(y_test,preds)*100

96.66666666666667

The accuracy of the model is around 96.66%, which tells us that the model is working quite well on the unseen data points also.

Advantages and Drawbacks of Holding out Method:

---


*   Advantages:
> One of the major advantages of this method is that it is computationally inexpensive compared to other cross-validation techniques.

*   Drawbacks:
> In the Hold out method, the test error rates are highly variable (high variance) and it totally depends on which observations end up in the training set and test set.Only a part of the data is used to train the model (high bias) which is not a very good idea when data is not huge and this will lead to overestimation of test error.


<h3><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#96BFCC  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Method 2 : k-folds Cross Validation<a class="anchor" id="8.2"></a>
        </b>
</p>
</div></h3>






As there is never enough data to train your model, removing a part of it for validation poses a problem of underfitting. By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that.

In this resampling technique, the whole data is divided into k sets of almost equal sizes. The first set is selected as the test set and the model is trained on the remaining k-1 sets. In the second iteration, the 2nd set is selected as a test set and the remaining k-1 sets are used to train the data. This process continues for all the k sets. At the end we are averaging the outputs of k fitted models to get the accuracy score of the model.

<img src='https://www.datasciencecentral.com/wp-content/uploads/2021/10/k-fold_cross_validation_en-1.jpg' width="900" height="500">


Following is the code for k-fold method:

---


For the k-fold cross validation we have the "cross_val_score" module from the sklearn.model_selection library.

In [15]:
scores = cross_val_score(model, x, y, cv=5)
scores

array([0.96666667, 0.96666667, 0.93333333, 1.        , 1.        ])

In [16]:
scores.mean()*100

97.33333333333334

The accuracy of the model comes out to be 97.33% which is slightly higher than the holding out method.

Advantages and Drawbacks of k-fold method:

---



*   Advantages:
> The best part about this method is each data point gets to be in the test set exactly once and gets to be part of the training set k-1 times. As the number of folds k increases, the variance also decreases (low variance).

*   Drawbacks:
> The major disadvantage of this method is that the model has to be run from scratch k-times and is computationally expensive than the Hold Out method.



<h3><div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#96BFCC  ;
           font-size:300%;
           font-family:Sans-serif;
           letter-spacing:0.5px">

<p style="padding: 15px;
              color:white;
               text-align:center">
    <b>
        Method 3 : Leave one out Cross Validation<a class="anchor" id="8.3"></a>
        </b>
</p>
</div></h3>




LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in which each observation is considered as the validation set and the rest (N-1) observations are considered as the training set. In LOOCV, fitting of the model is done and predicting using one observation validation set. This process continues ‘N’ times and the average of all these iterations is calculated. This is a special case of K-fold cross-validation in which the number of folds is the same as the number of observations(K = N).

<img src='https://d2mk45aasx86xg.cloudfront.net/image2_11zon_cac3fb4270.webp'>

Following is the code for Leave one out method:

---

For the leave one out cross validation we have the "LeaveOneOut" module from the sklearn.model_selection library. The code will remain same as that of k-fold, but instead of specifying the number of splits for the "cv" parameter, we will pass the "LeaveOneOut()" function.

In [17]:
scores1 = cross_val_score(model, x, y, cv=LeaveOneOut())
print(scores1)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


In [18]:
scores1.mean()*100

94.66666666666667

The accuracy of the model comes out to be 94.67% which is quite good but is less than that of previous methods.

Advantages and Drawbacks of Leave one Out method:

---



*   Advantages:
> This method helps to reduce Bias and Randomness. 

*   Drawbacks:
> LOOCV has an extremely high variance because we are averaging the output of n-models which are fitted on an almost identical set of observations, and their outputs are highly positively correlated with each other.Also this is computationally expensive as the model is run ‘n’ times to test every observation in the data. 