<a href="https://colab.research.google.com/github/gabitza-tech/ETTI-SummerSchool2025/blob/main/Students_MachineLearning_Intro_ImbalancedClasses.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Imbalanced Dataset Example - Binary Classification

In the real world, data is rarely **clean** or **perfectly balanced**.  
Some of the common challenges we face include:  

- ⚖️ **Imbalanced classes** — certain outcomes are much rarer than others.  
- ❓ **Missing values** — incomplete information across different attributes.  
- 🌍 **Domain shifts** — differences between training and test data distributions.  

---
To illustrate how to handle imbalanced data in Python, let’s explore the **Bank Marketing Dataset**. This publicly available dataset contains information about bank customers, with the target variable indicating whether a client subscribed to a term deposit after receiving a marketing call (“yes” vs. “no”).

In [None]:
# We need to install this library in order to fetch the dataset we need
!pip install ucimlrepo

In [None]:
# This dataset has a similar format in the end to the previous one from scikit-learn.
from ucimlrepo import fetch_ucirepo
import pandas as pd

bank_marketing = fetch_ucirepo(id=222)

# Separate the target labels from the rest of the features
x = bank_marketing.data.features
y = bank_marketing.data.targets

# Show some dataset metadata
print(bank_marketing.metadata)
print(bank_marketing.variables)


# Exercise 1

### Tasks
1. Check dataset size.
2. Check number of features.
3. Check number of classes.
4. Check class distribution.
5. Check for missing data.


In [None]:
# CODE HERE

# Shape of data and number of features

# Number of classes

# Class Distribution

# Check for missing data

# Exercise 2

For most of these tasks, look at the previous exercise!

### Tasks:

1. Handle missing features. (can we simply drop samples?)
2. Preprocess features categorical to numerical. - Hint: use `OneHotEncoder` or `LabelEncoder`;
3. Split dataset in train and test sets. (70-30 split)
4. Scale your data. - Whatever scaler you want
4. Train a logistic regression model.
5. How long does training take?
5. Check accuracy.
6. Check precision, recall and f1-score.

In [None]:
# CODE HERE FOR MISSING DATA + FEATURE & LABEL ENCODING
# import necessary libraries for Encoding Features and Labels

# Save encoded features and labels in variables x_encoded and y_encoded
print(x_encoded[:2])
print(set(y_encoded))


In [None]:
# CODE HERE FOR DATA SPLITTING
# Necessary imports

# use the train-test split function from scikit-learn

# DATA SCALING - use the scalers used in previous exercises - save your data as x_train, y_train, x_test, y_test

In [None]:
# CODE HERE FOR TRAINING a Logistic Regression Model
# Necessary imports for training model + evaluation metrics like the report, accuracy score, f1-score

# Train and save predictions in a y_pred variable
y_pred = ...

# Evaluate your train model - save f1, acc
print("\nLogistic Regression Classification Report:\n", classification_report(...))
f1 = f1_score(...) # Calculate macro

# Let's plot a confusion matrix

Using a combination of scikit-learn, seaborn and matplotlib. Let's first show you how you can do that.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Compute confusion matrix
cm = confusion_matrix(...) # YOU HAVE TO FILL THIS PART ;)

# Plot using seaborn heatmap
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


# Dataset Imbalance Techniques


# 1. Random Undersampling

Taking in consideration that one class has a **majority**, what we can do as a very simple technique is to simply **remove randomly** from that class until we obtain an equal number of samples in both classes. - we are **REMOVING** training data on purpose, in order to avoid **OVERFITTING** the majority class.

# Exercise 3

### Tasks

1. What is the new number of samples for each class?
2. How much data did we remove?
3. Train a new logistic regression model - save predictions in `y_pred_rus`
4. How long does training take?
4. Evaluate the model with the new data.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

# 1️⃣ Random Undersampling
rus = RandomUnderSampler(random_state=42)
X_rus, y_rus = rus.fit_resample(x_train, y_train) # New data

# Train a logistic regression model with the new data - save your predictions in y_pred_rus
...
y_pred_rus = ...

# Evaluate the new model, save f1_rus, acc_rus

f1_rus = ... # calculate macro

# 2. Random Oversampling

Taking in consideration that one class has a **minority**, what we can do as a very simple technique is to simply **randomly repeat** samples from that class until we obtain an equal number of samples in both classes. - we are **ADDING** training data on purpose, in order to avoid **UNDERFITTING** the minority class.

# Exercise 4

### Tasks

1. What is the new number of samples for each class?
3. Train a new logistic regression model - save predictions in `y_pred_ros`
4. How long does training take?
4. Evaluate the model with the new data.

In [None]:
from imblearn.over_sampling import RandomOverSampler


# 1️⃣ Random Undersampling
ros = RandomOverSampler(random_state=42)
X_ros, y_ros = ros.fit_resample(x_train, y_train)

# Write your training code here - save your predictions in y_pred_ros
# ...
y_pred_ros = ...

# Evaluate your model, save f1_ros, acc_ros for later comparisons
f1_ros = ... # calculate macro

# 3. Train the Logistic Regression Model with the class balancing function

Most methods nowadays have a parameter that can put a bigger weight/importance on minority classes, in order to give them equal importance to the more presents clases. This is a simple method and it only need to add the `class_output='balanced'` in the initialization of the `LogisticRegression` function.

# Exercise 5

### Tasks

1. What is the number of samples for each class?
3. Train a new logistic regression model - save predictions in `y_pred_bal`
4. How long does training take?
4. Evaluate the model with the new data.

In [None]:
# CODE HERE
# Balanced Logistic Regression Model training
clf_bal = LogisticRegression(max_iter=500, class_weight='balanced')
...

# Predict and evaluate, save your predictions in y_pred_bal
y_pred_bal = ...

# Evaluate your model here, save f1_bal, acc_bal
...


Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.85      0.91     11977
           1       0.42      0.81      0.55      1587

    accuracy                           0.85     13564
   macro avg       0.69      0.83      0.73     13564
weighted avg       0.91      0.85      0.87     13564

Accuracy: 0.8464317310527868


# Data Imbalance Techniques comparison

Let's plot some confusion matrices for each case, side-by-side. You have minimal *code filling* to do.

# Exercise 6

### Tasks

1. What is the best method?
2. Also print some info about each methods f1-score, acc, time taken, etc.

### OPTIONAL
3. Plot a graph containing the AUROC curve for all 4 cases (unbalanced, undersampling, oversampling, balanced). Make the curves of different colors, with an associated legend. Search what you need in order to plot such a curve, you can obtain auroc/precision/recall values easily with `sklearn.metrics`.

In [None]:
# Prepare data - fill the necessary variables
preds = {
    'Original Logistic': ...,
    'Balanced Class Weight': ...,
    'Random Undersampling': ...,
    'Random Oversampling': ...
}

# Create 1 row, 4 columns plot
fig, axes = plt.subplots(1, 4, figsize=(20,5))

for ax, (title, pred) in zip(axes, preds.items()):
    cm = confusion_matrix(...) # FILL THE NECESSARY VARIABLES
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'], ax=ax)
    ax.set_title(title)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

plt.tight_layout()
plt.show()

# CODE HERE
# Print some more info about each methods performance, acc, f1-score, etc.
...

In [None]:
# CODE HERE
# PLOT A GRAPHIC CONTAINING ALL 4 AUROC CURVES
...