# Part 6.3: Advanced Topics - Handling Imbalanced Data

Imbalanced datasets are a common problem in classification where the classes are not represented equally. For example, in fraud detection, the number of fraudulent transactions is much smaller than legitimate ones. If not handled correctly, models can become biased towards the majority class and have poor performance on the minority class.

In [1]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Create an imbalanced dataset (95% class 0, 5% class 1)
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, 
                           n_redundant=0, n_classes=2, n_clusters_per_class=1, 
                           weights=[0.95, 0.05], flip_y=0, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### The Problem: A Naive Model
Let's train a standard model and see how it performs.

In [2]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print("Notice the poor recall for the minority class (1). The model is failing to identify it.")

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       190
           1       1.00      0.60      0.75        10

    accuracy                           0.98       200
   macro avg       0.99      0.80      0.87       200
weighted avg       0.98      0.98      0.98       200

Notice the poor recall for the minority class (1). The model is failing to identify it.


### Strategy 1: Resampling with SMOTE
Resampling techniques modify the training data to create a more balanced class distribution.
- **Oversampling**: Increases the number of instances in the minority class. **SMOTE** (Synthetic Minority Over-sampling Technique) is a popular method that creates new synthetic samples.
- **Undersampling**: Decreases the number of instances in the majority class.

**Note**: This requires the `imbalanced-learn` library (`pip install -U imbalanced-learn`).

In [3]:
# This cell is for demonstration and will NOT run without `imbalanced-learn` installed.

# from imblearn.over_sampling import SMOTE
# from imblearn.pipeline import Pipeline

# smote = SMOTE(random_state=42)
# model_smote = LogisticRegression()

# # It's best to use SMOTE within a pipeline to prevent data leakage
# pipe = Pipeline([('smote', smote), ('model', model_smote)])
# pipe.fit(X_train, y_train)
# y_pred_smote = pipe.predict(X_test)

# print(classification_report(y_test, y_pred_smote))

### Strategy 2: Using Class Weights
A simpler approach is to adjust the model's loss function to give more weight to the minority class. Many scikit-learn models have a `class_weight='balanced'` parameter that does this automatically.

In [4]:
model_weighted = LogisticRegression(class_weight='balanced')
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)

print(classification_report(y_test, y_pred_weighted))
print("\nRecall for class 1 has improved significantly.")

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       190
           1       0.77      1.00      0.87        10

    accuracy                           0.98       200
   macro avg       0.88      0.99      0.93       200
weighted avg       0.99      0.98      0.99       200


Recall for class 1 has improved significantly.
