# SIB - Portfolio of Machine Learning Algorithms

## Exercise 6: Implementing stratified splitting

### 6.1) 
In the "split.py" module of the "model_selection" subpackage add the "stratified_train_test_split" function

def stratified_train_test_split:
- arguments:
  - dataset – the Dataset object to split into training and testing data
  - test_size – the size of the testing Dataset (e.g., 0.2 for 20%)
  - random_state – seed for generating permutations
- expected output:
  - A tuple containing the stratified train and test Dataset objects
- algorithm:
  - Get unique class labels and their counts
  - Initialize empty lists for train and test indices
  - Loop through unique labels:
    - Calculate the number of test samples for the current class
    - Shuffle and select indices for the current class and add them to the test indices
    - Add the remaining indices to the train indices
  - After the loop, create training and testing datasets
  - Return the training and testing datasets


### 6.2) 
Test the "stratified_train_test_split" function with the iris dataset.

In [1]:
import sys
sys.path.append('C:/Users/dases/Desktop/SI/repositorio/si-2/src')

import numpy as np
from si.io.csv_file import read_csv
from si.model_selection.split import train_test_split, stratified_train_test_split

# Carregar o dataset iris
iris = read_csv("../datasets/iris/iris.csv", features=True, label=True)


In [4]:
unique_labels, counts = np.unique(iris.y, return_counts=True)
print("Class distribution in the original dataset:")
for label, count in zip(unique_labels, counts):
    print(f"Class {label}: {count} samples")

Class distribution in the original dataset:
Class Iris-setosa: 50 samples
Class Iris-versicolor: 50 samples
Class Iris-virginica: 50 samples


In [6]:
train_dataset, test_dataset = stratified_train_test_split(iris, test_size=0.2, random_state=123)


print("Train set size:", train_dataset.shape()[0])
print("Test set size:", test_dataset.shape()[0])


unique_train, counts_train = np.unique(train_dataset.y, return_counts=True)
unique_test, counts_test = np.unique(test_dataset.y, return_counts=True)
print("Train set class distribution:", dict(zip(unique_train, counts_train)))
print("Test set class distribution:", dict(zip(unique_test, counts_test)))

Train set size: 120
Test set size: 30
Train set class distribution: {'Iris-setosa': np.int64(40), 'Iris-versicolor': np.int64(40), 'Iris-virginica': np.int64(40)}
Test set class distribution: {'Iris-setosa': np.int64(10), 'Iris-versicolor': np.int64(10), 'Iris-virginica': np.int64(10)}
