# CSE 5819 Assignment #2

**by: Aayushi Verma (uef24001)**

*due: Wed 9/11/24*

**Part 2 [Programming] (20 pts)** 

**Write a .ipynb code that can run stratified partitioning on Google Colab. This code can load the Seaborn Iris dataset and split the dataset into two halves that have the same proportion of flower samples in each species as that in the full dataset. The iris dataset has three flower species: setosa, versicolor, and virginica. A stratified partition of a dataset divides data into subsets based on some data attributes (here we use the species labels) and ensure that each subset has a similar (or the same)
proportion of data points from each category or class. (The dataset has four features: sepal_length, sepal_width, petal_length, petal_width, and one target: species.)**

In [1]:
# importing packages
import seaborn as sns
import pandas as pd
import numpy as np

First we load the dataset using the Seaborn library. 

**Note to TA:** I was having issues loading the dataset from Seaborn (ssl certificate error) so instead downloaded a local csv of the dataset. For the submission of this assignment on Google Colab, I will comment out my code which uses the local download of the datatset and keep the original line of code, and hope that it works on your end.

In [2]:
# downloading the dataset from the Seaborn library
# iris = sns.load_dataset('iris')

In [3]:
# loading the iris dataset from my locally-downloaded version
iris = pd.read_csv('iris.csv')
# renaming the columns to match the Seaborn version
iris.rename(columns={
    'SepalLengthCm': 'sepal_length',
    'SepalWidthCm': 'sepal_width',
    'PetalLengthCm': 'petal_length',
    'PetalWidthCm': 'petal_width',
    'Species': 'species'
}, inplace=True)

# dropping the Id column since Seaborn version doesn't have it
iris.drop(columns=['Id'], inplace=True)

In [4]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
# shuffling the dataset to randomize the order
iris = iris.sample(frac=1, random_state=42).reset_index(drop=True)
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.1,2.8,4.7,1.2,Iris-versicolor
1,5.7,3.8,1.7,0.3,Iris-setosa
2,7.7,2.6,6.9,2.3,Iris-virginica
3,6.0,2.9,4.5,1.5,Iris-versicolor
4,6.8,2.8,4.8,1.4,Iris-versicolor


In [6]:
# grouping the data by species
species_groups = iris.groupby('species')

Now we perform the stratified partitioning. Stratified partioning means that we are splitting the dataset randomly into train/test sets whilst ensuring there is an even proportion of each group (in this case, all three species) in each of the train/test split. We will use a 80% train, 20% test split since that's the most common in ML applications.

In [7]:
# creating lists to store the train and test data
train_set = []
test_set = []

In [8]:
# iterating over each of the three species groups
for species, group in species_groups:
    # setting the number of data points in the train set
    n_samples = len(group)
    n_train = int(n_samples * 0.8)
    
    # splitting the current species group into train/test groups
    train_group = group.iloc[:n_train]
    test_group = group.iloc[n_train:]
    
    # adding the train/test data points into their respective lists
    train_set.append(train_group)
    test_set.append(test_group)

In [9]:
# creating a pandas df for each train/test
train_set = pd.concat(train_set).reset_index(drop=True)
test_set = pd.concat(test_set).reset_index(drop=True)

In [10]:
train_set.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.7,3.8,1.7,0.3,Iris-setosa
1,5.4,3.4,1.5,0.4,Iris-setosa
2,4.8,3.0,1.4,0.1,Iris-setosa
3,5.5,3.5,1.3,0.2,Iris-setosa
4,4.9,3.1,1.5,0.1,Iris-setosa


In [11]:
test_set.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,4.6,3.4,1.4,0.3,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,5.1,3.8,1.6,0.2,Iris-setosa
3,4.5,2.3,1.3,0.3,Iris-setosa
4,5.3,3.7,1.5,0.2,Iris-setosa


Now we check the results of stratified partitioning. First we look at the number of data points per species group in the train dataset, and then we compare to the test dataset.

In [12]:
# checking the number of data points per species group in Train df
train_set['species'].value_counts()

species
Iris-setosa        40
Iris-versicolor    40
Iris-virginica     40
Name: count, dtype: int64

In [13]:
# checking the number of data points per species group in Test df
test_set['species'].value_counts()

species
Iris-setosa        10
Iris-versicolor    10
Iris-virginica     10
Name: count, dtype: int64