## Stratified k-Fold CV Lecture 4 Marisol De La Cruz

In [10]:
from sklearn.datasets import load_wine
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Work Period 4 Questions ##

>  1.	You will implement Stratified K-Fold Cross-Validation using a dataset consisting of positive and negative integers. The goal is to split the dataset into 5 folds while maintaining the ratio between positive and negative examples in each fold. You will observe how Stratified K-Fold ensures that the class distribution remains consistent across all folds.
Dataset: The dataset contains the first 40 positive integers and the first 10 negative integers.  
Class 0: Negative integers (first 10 numbers: -10 to -1).  
Class 1: Positive integers (first 40 numbers: 1 to 40).  
a)	Generate the dataset of the 50 numbers and assign class labels and display the head of the data(15).  
b)	Create your folds with StratifiedKFold.  
c)	For each fold display the test set.  
d)	For each test set verify that the ratio of positives to negatives is correct.


a)	Generate the dataset of the 50 numbers and assign class labels and display the head of the data(15). 

In [27]:
# Dataset creation, 10 firts negative values and 40 positive values
negatives = np.arange(-10, 0)  # Class 0: Negative integers (-10 to -1)
positives = np.arange(1, 41)    # Class 1: Positive integers (1 to 40)


In [28]:
neg_labels = np.zeros(len(negatives), dtype=int)  # Class 0: Negative integers
pos_labels = np.ones(len(positives), dtype=int)   # Class 1: Positive integers

In [34]:
data = np.concatenate((negatives, positives))  # All numbers
labels = np.concatenate((neg_labels, pos_labels))  # all lebels

In [35]:
data = pd.DataFrame({'Integers': data, 'Labels': labels})
data.head(15)

Unnamed: 0,Integers,Labels
0,-10,0
1,-9,0
2,-8,0
3,-7,0
4,-6,0
5,-5,0
6,-4,0
7,-3,0
8,-2,0
9,-1,0


b)	Create your folds with StratifiedKFold.  

In [36]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
folds = []
for train_index, test_index in skf.split(df['Integers'], df['Labels']):
    fold = df.iloc[test_index]
    folds.append(fold)

In [37]:
for i in range(len(folds)):
    print(f"Fold {i+1}:")
    print(folds[i])

Fold 1:
    Integers  Labels
1         -9       0
3         -7       0
13         4       1
18         9       1
21        12       1
22        13       1
34        25       1
37        28       1
40        31       1
43        34       1
Fold 2:
    Integers  Labels
5         -5       0
8         -2       0
16         7       1
20        11       1
25        16       1
27        18       1
28        19       1
29        20       1
38        29       1
47        38       1
Fold 3:
    Integers  Labels
2         -8       0
7         -3       0
17         8       1
26        17       1
30        21       1
41        32       1
42        33       1
45        36       1
48        39       1
49        40       1
Fold 4:
    Integers  Labels
4         -6       0
9         -1       0
10         1       1
12         3       1
14         5       1
19        10       1
31        22       1
33        24       1
35        26       1
36        27       1
Fold 5:
    Integers  Labels
0        -10   

c)	For each fold display the test set. 

In [38]:
for i, (train_index, test_index) in enumerate(skf.split(df['Integers'], df['Labels'])):
    # Access the data using the train_index and test_index directly
    train_set = df.iloc[train_index]
    test_set = df.iloc[test_index]

    print(f"Fold {i+1}:")

    class_distribution = test_set['Labels'].value_counts(normalize=True) # 'Labels' is the correct column name
    print(class_distribution)

Fold 1:
Labels
1    0.8
0    0.2
Name: proportion, dtype: float64
Fold 2:
Labels
1    0.8
0    0.2
Name: proportion, dtype: float64
Fold 3:
Labels
1    0.8
0    0.2
Name: proportion, dtype: float64
Fold 4:
Labels
1    0.8
0    0.2
Name: proportion, dtype: float64
Fold 5:
Labels
1    0.8
0    0.2
Name: proportion, dtype: float64


d)	For each test set verify that the ratio of positives to negatives is correct.

In [55]:
# True and predicted values
true_values = {
    'True_1': [10.0, 7.5, -3.0, 15.0, 2.0, 0.0, -5.0, 9.0, -2.0, 8.0],
    'True_2': [5.0, 8.0, 12.0, -6.0, -1.0, 4.0, 10.0, 6.0, 3.0, -3.0],
    'True_3': [-2.0, 0.5, 5.0, -3.0, 7.0, 2.0, -4.0, 0.0, 1.0, 4.0],
    'True_4': [3.0, -4.0, 2.0, 10.0, -8.0, 1.0, -2.0, 3.0, -7.0, 0.0],
}

predicted_values = {
    'Predicted_1': [11.0, 8.5, -2.0, 16.0, 3.0, 1.0, -4.0, 10.0, -1.0, 9.0],
    'Predicted_2': [4.5, 7.5, 11.5, -6.5, -1.5, 3.5, 9.5, 5.5, 2.5, -3.5],
    'Predicted_3': [-1.2, 1.3, 5.8, -2.2, 7.8, 2.8, -3.2, 0.8, 1.8, 4.8],
    'Predicted_4': [1.8, -5.2, 0.8, 8.8, -9.2, -0.2, -3.2, 1.8, -8.2, -1.2],
}

In [56]:
df_true = pd.DataFrame(true_values, columns=['True_1', 'True_2', 'True_3', 'True_4'])
df_pred = pd.DataFrame(predicted_values, columns=['Predicted_1', 'Predicted_2', 'Predicted_3', 'Predicted_4'])

In [57]:
bias = pd.DataFrame({
    'Bias_1': df_pred['Predicted_1'] - df_true['True_1'],
    'Bias_2': df_pred['Predicted_2'] - df_true['True_2'],
    'Bias_3': df_pred['Predicted_3'] - df_true['True_3'],
    'Bias_4': df_pred['Predicted_4'] - df_true['True_4']
})

In [58]:
bias = (df_pred.values - df_true.values)

df_bias = pd.DataFrame({
    'True_1': df_true['True_1'],
    'Predicted_1': df_pred['Predicted_1'],
    'Bias_1': bias[:, 0],  # Sesgo de la primera columna
    'True_2': df_true['True_2'],
    'Predicted_2': df_pred['Predicted_2'],
    'Bias_2': bias[:, 1],  # Sesgo de la segunda columna
    'True_3': df_true['True_3'],
    'Predicted_3': df_pred['Predicted_3'],
    'Bias_3': bias[:, 2],  # Sesgo de la tercera columna
    'True_4': df_true['True_4'],
    'Predicted_4': df_pred['Predicted_4'],
    'Bias_4': bias[:, 3]   # Sesgo de la cuarta columna
})

# Mostrar los resultados
print(df_bias)

   True_1  Predicted_1  Bias_1  True_2  Predicted_2  Bias_2  True_3  \
0    10.0         11.0     1.0     5.0          4.5    -0.5    -2.0   
1     7.5          8.5     1.0     8.0          7.5    -0.5     0.5   
2    -3.0         -2.0     1.0    12.0         11.5    -0.5     5.0   
3    15.0         16.0     1.0    -6.0         -6.5    -0.5    -3.0   
4     2.0          3.0     1.0    -1.0         -1.5    -0.5     7.0   
5     0.0          1.0     1.0     4.0          3.5    -0.5     2.0   
6    -5.0         -4.0     1.0    10.0          9.5    -0.5    -4.0   
7     9.0         10.0     1.0     6.0          5.5    -0.5     0.0   
8    -2.0         -1.0     1.0     3.0          2.5    -0.5     1.0   
9     8.0          9.0     1.0    -3.0         -3.5    -0.5     4.0   

   Predicted_3  Bias_3  True_4  Predicted_4  Bias_4  
0         -1.2     0.8     3.0          1.8    -1.2  
1          1.3     0.8    -4.0         -5.2    -1.2  
2          5.8     0.8     2.0          0.8    -1.2  
3 

> 2.	Bias Calculation (from lecture notes)
Given the dataset with both true and predicted values from the notes:  
a)	For each data point, compute the bias using the formula for bias. Display the results.  
b)	 Compute the Euclidean norm for each bias and display the values.


In [60]:
bias_df = pd.DataFrame({
    'Bias_1': df_pred['Predicted_1'] - df_true['True_1'],
    'Bias_2': df_pred['Predicted_2'] - df_true['True_2'],
    'Bias_3': df_pred['Predicted_3'] - df_true['True_3'],
    'Bias_4': df_pred['Predicted_4'] - df_true['True_4']
})

print("Bias (Predicted - True):\n", bias_df)

# Compute Euclidean norm (L2 norm) of bias for each prediction
euclidean_norm = np.linalg.norm(bias_df.values, axis=1)
print("Euclidean Norms of Bias:\n", euclidean_norm)

Bias (Predicted - True):
    Bias_1  Bias_2  Bias_3  Bias_4
0     1.0    -0.5     0.8    -1.2
1     1.0    -0.5     0.8    -1.2
2     1.0    -0.5     0.8    -1.2
3     1.0    -0.5     0.8    -1.2
4     1.0    -0.5     0.8    -1.2
5     1.0    -0.5     0.8    -1.2
6     1.0    -0.5     0.8    -1.2
7     1.0    -0.5     0.8    -1.2
8     1.0    -0.5     0.8    -1.2
9     1.0    -0.5     0.8    -1.2
Euclidean Norms of Bias:
 [1.82482876 1.82482876 1.82482876 1.82482876 1.82482876 1.82482876
 1.82482876 1.82482876 1.82482876 1.82482876]


> 3.	Advantages of Stratified K-Fold Cross-Validation.  
a)	In your own words, explain the advantages of using stratified K-fold cross-validation over standard K-fold cross-validation.  
b)	Discuss how stratification improves the performance of machine learning models, especially when working with imbalanced datasets.  
c)	Why is it important to maintain the same class proportions in each fold, and how does this impact model evaluation?


assuming that we have an unbalanced dataset, when applying K-Fold each fold better reflects the class distribution of the entire data set,

Preserving the class distribution ensures that each fold is representative of the entire data set.

It allows you to better understand how the model behaves in different segments of the data and adjust the model accordingly.

> 4.	Impact of the Number of Folds (k).  
The number of folds (k) in cross-validation can impact the bias and variance of the model's performance estimate.  
a)	Explain how increasing or decreasing the number of folds affects the bias and variance of the model.  
b)	what would be a good choice for the number of folds (k) for:  
i.	A small dataset?  
ii.	A large dataset?  
iii.	An imbalanced dataset?


The number of folds (k) in cross-validation can significantly influence both the bias and variance of the model's performance estimates

The training set will contain a greater or lesser proportion of the data set. and depending on the selected portion allows a more robust evaluation of the model,

A higher k means that each training set will contain a smaller proportion of the dataset.