## Learning Objective

At the end of the experiment, you will be able :

* Split the data into train and test sets
* split dataset into k consecutive folds

### Description

The dataset consists of the below 7 columns,

- **species:** penguin species (Chinstrap, Adélie, or Gentoo)
- **culmen length & depth:** The culmen is the upper ridge of a bird's beak
- **flipper_length_mm:** flipper length
- **body_mass_g:** body mass
- **island:** island name (Dream, Torgersen, or Biscoe)
- **sex:** penguin sex

In [1]:
#@title Download Data
!wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/penguins.csv

--2023-04-18 10:25:05--  https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/penguins.csv
Resolving cdn.iiith.talentsprint.com (cdn.iiith.talentsprint.com)... 172.105.52.210
Connecting to cdn.iiith.talentsprint.com (cdn.iiith.talentsprint.com)|172.105.52.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13519 (13K) [application/octet-stream]
Saving to: ‘penguins.csv’


2023-04-18 10:25:07 (274 MB/s) - ‘penguins.csv’ saved [13519/13519]



## Import required packages

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier

In [3]:
# Load the data
df = pd.read_csv('/content/penguins.csv')
df.head(5)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


In [4]:
# Get the shape of the dataset
df.shape

(344, 7)

In [5]:
df.columns

Index(['species', 'island', 'culmen_length_mm', 'culmen_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

In [6]:
# Store the data and target values in two seperate variable x and y
x = df[['island', 'culmen_length_mm', 'culmen_depth_mm','flipper_length_mm', 'body_mass_g', 'sex']]
y = df['species']

In [7]:
# # To be used for K-Fold
# x1 = x
# y1 = y

In [8]:
x.shape, y.shape

((344, 6), (344,))

In [9]:
# Split the data into train and test sets in the ratio of 80:20 
# i.e 80 % of data is train set and 20 % of the data is test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

In [10]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((275, 6), (69, 6), (275,), (69,))

## Understand the k-fold data split

In [11]:
t = np.array(["yellow", "blue", "pink", "white", "red", "violet", "orange", "green"])

In [12]:
# Set the KFold module for 4 splits:
kf = KFold( n_splits=4 )
i = 0

# Divide the data into trainindex and testindex
# K splits are iterated 
for trainindex, testindex in kf.split(t):
    print( "Round :", i, ":" ) 
    i += 1

    print( "Training index", trainindex, "Training set is:", t[trainindex])
    print( "Testing index", testindex, "Testing set is:", t[testindex])

Round : 0 :
Training index [2 3 4 5 6 7] Training set is: ['pink' 'white' 'red' 'violet' 'orange' 'green']
Testing index [0 1] Testing set is: ['yellow' 'blue']
Round : 1 :
Training index [0 1 4 5 6 7] Training set is: ['yellow' 'blue' 'red' 'violet' 'orange' 'green']
Testing index [2 3] Testing set is: ['pink' 'white']
Round : 2 :
Training index [0 1 2 3 6 7] Training set is: ['yellow' 'blue' 'pink' 'white' 'orange' 'green']
Testing index [4 5] Testing set is: ['red' 'violet']
Round : 3 :
Training index [0 1 2 3 4 5] Training set is: ['yellow' 'blue' 'pink' 'white' 'red' 'violet']
Testing index [6 7] Testing set is: ['orange' 'green']


## Implementing the k-fold data split on penguins data

In [13]:
kf = KFold(n_splits=4)
x_data = x.values
i = 1
# Divide the data into trainindex and testindex
# K splits are iterated 
for train_index, test_index in kf.split(x_data):
   
    x_train, x_test, y_train, y_test = x_data[train_index], x_data[test_index], y[train_index], y[test_index]
    print( "Split :", i, ":" ) 
    i += 1
    print( "Distribution of Training labels:", y_train.values)
    print( "Distribution of Testing labels:", y_test.values)
    print("--------------------------------------")

Split : 1 :
Distribution of Training labels: ['Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie' 'Adelie'
 'Adelie' 'Adelie' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap'
 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap'
 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap'
 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap'
 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap' 'Chinstrap'
 'Chinstrap'