# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Ungraded Additional Notebook: Cross Validation

## Learning Objective

At the end of the experiment, you will be able :

* Split the data into train and test sets
* Split dataset into k consecutive folds

### Dataset Description

**Penguins dataset**

<center><img src='https://www.gabemednick.com/post/penguin/featured_hu23a2ff6767279debab043a6c8f0a6157_878472_720x0_resize_lanczos_2.png'></center>

<center><img src='https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png' width=450px></center>

The **Penguins dataset** consists of the below 7 columns:

- **species:** *penguin species* (Chinstrap, Adélie, or Gentoo)
- **culmen length & depth:** *The culmen is the upper ridge of a bird's beak*
- **flipper_length_mm:** *flipper length*
- **body_mass_g:** *body mass*
- **island:** *island name* (Dream, Torgersen, or Biscoe)
- **sex:** *penguin sex*

In [None]:
#@title Download Data
from IPython.display import clear_output
!wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/penguins.csv
clear_output()
print("Data downloaded successfully!")
!ls | grep '.csv'

## Import required packages

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

In [None]:
# Load the data
df = pd.read_csv('/content/penguins.csv')
df.head(5)

In [None]:
# Get the shape of the dataset
df.shape

In [None]:
df.columns

In [None]:
# Store the data and target values in two seperate variable x and y
x = df[['island', 'culmen_length_mm', 'culmen_depth_mm','flipper_length_mm', 'body_mass_g', 'sex']]
y = df['species']

In [None]:
# # To be used for K-Fold
# x1 = x
# y1 = y

In [None]:
x.shape, y.shape

In [None]:
# Split the data into train and test sets in the ratio of 80:20
# i.e 80 % of data is train set and 20 % of the data is test set

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state=42, stratify=y)

x_train.shape, x_test.shape, y_train.shape, y_test.shape

## Understand the k-fold data split using dummy data

<img src="https://miro.medium.com/v2/resize:fit:1192/1*dldTNMhgjNNeu7d0OmNPCA.png" width=750px>

In [None]:
# Sample dummy data
data = np.array(["yellow", "blue", "pink", "white", "red", "violet", "orange", "green"])

In [None]:
# Set the KFold module for 4 splits:
kf = KFold( n_splits=4 )
i = 0

# Divide the data into trainindex and testindex
# K splits are iterated
for trainindex, testindex in kf.split(data):
    print( "Split :", i, ":" )
    i += 1

    print( "Training index", trainindex, "Training set is:", data[trainindex])
    print( "Testing index", testindex, "Testing set is:", data[testindex])

## Implementing the k-fold data split on **Penguins data**

In [None]:
kf = KFold(n_splits=4)
x_data = x.values
i = 1

# Divide the data into trainindex and testindex
# K splits are iterated
for train_index, test_index in kf.split(x_data):

    x_train, x_test, y_train, y_test = x_data[train_index], x_data[test_index], y[train_index], y[test_index]
    print( "Split :", i, ":" )
    i += 1
    print( "Distribution of Training labels:", y_train.values)
    print( "Distribution of Testing labels:", y_test.values)
    print("="*150)