# Imbalanced data

Imagine we have data where 95% of samples belong to class 1 and only 5% to class 2.
Now imagine a classifier that always predicts class 1. It would still achieve 95% accuracy while having no real predictive power.

This situation is called class imbalance, and it’s common in real-world problems such as fraud detection, disease diagnosis, or rare event prediction.
When one class heavily dominates the other, a model can appear to perform well by focusing only on the majority class while ignoring the minority.

Therefore, accuracy can be a misleading metric for imbalanced data. Instead, it’s better to evaluate models using precision, recall, and the F1-score, which better reflect how well the minority class is identified.

In [1]:
from sklearn.datasets import make_classification
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_classes=2,
    weights=[0.95, 0.05], # 5% minority class
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

lr = LogisticRegression(max_iter=2000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Logistic Regression without resampling:")
print(classification_report(y_test, y_pred_lr))

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest without resampling")
print(classification_report(y_test, y_pred_rf))



Logistic Regression without resampling:
              precision    recall  f1-score   support

           0       0.96      0.99      0.97      1181
           1       0.61      0.25      0.35        69

    accuracy                           0.95      1250
   macro avg       0.78      0.62      0.66      1250
weighted avg       0.94      0.95      0.94      1250

Random Forest without resampling
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1181
           1       0.75      0.52      0.62        69

    accuracy                           0.96      1250
   macro avg       0.86      0.76      0.80      1250
weighted avg       0.96      0.96      0.96      1250



The random forest seems to handle the imbalance decently well. It still detects some minority samples.  
Logistic regression, on the other hand, struggles: its recall and F1 for the 5% class are very low.  
Let's see if resampling can help.

## Resampling Methods:

### Random Over Sampler

This method randomly **duplicates existing minority samples** until the classes are balanced.

**Pros:**
- Simple and fast.
- Keeps all original data.

**Cons:**
- Can lead to **overfitting**, since the model sees identical samples multiple times.
- Doesn’t add new information.

Below we retrain both models after oversampling:

In [None]:
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy='minority', random_state=42)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

# Logistic Regression on oversampled data
lr.fit(X_train_ros, y_train_ros)
y_pred_lr_ros = lr.predict(X_test)

print(classification_report(y_test, y_pred_lr_ros))

# Random Forest on oversampled data
rf.fit(X_train_ros, y_train_ros)
y_pred_rf_ros = rf.predict(X_test)

print(classification_report(y_test, y_pred_rf_ros))

              precision    recall  f1-score   support

           0       0.98      0.85      0.91      1181
           1       0.22      0.72      0.34        69

    accuracy                           0.85      1250
   macro avg       0.60      0.79      0.63      1250
weighted avg       0.94      0.85      0.88      1250

              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1181
           1       0.69      0.61      0.65        69

    accuracy                           0.96      1250
   macro avg       0.83      0.80      0.81      1250
weighted avg       0.96      0.96      0.96      1250



### Random Under Sampler

This method randomly **removes majority-class samples** to balance the dataset.

**Pros:**
- Quick to run.
- Prevents overfitting to the majority class.

**Cons:**
- **Loses information** from the majority class.
- The model may generalize worse if too much data is dropped.

Results below show (marginally) higher recall for the minority class but lower overall accuracy, which fits expectations.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy='majority', random_state=42)
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)

# Logistic Regression on undersampled data
lr.fit(X_train_rus, y_train_rus)
y_pred_lr_rus = lr.predict(X_test)

print(classification_report(y_test, y_pred_lr_rus))

# Random Forest on undersampled data
rf.fit(X_train_rus, y_train_rus)
y_pred_rf_rus = rf.predict(X_test)

print(classification_report(y_test, y_pred_rf_rus))

              precision    recall  f1-score   support

           0       0.98      0.85      0.91      1181
           1       0.22      0.75      0.34        69

    accuracy                           0.84      1250
   macro avg       0.60      0.80      0.63      1250
weighted avg       0.94      0.84      0.88      1250

              precision    recall  f1-score   support

           0       0.98      0.91      0.95      1181
           1       0.32      0.72      0.45        69

    accuracy                           0.90      1250
   macro avg       0.65      0.82      0.70      1250
weighted avg       0.95      0.90      0.92      1250



### SMOTE (Synthetic Minority Oversampling Technique)

Instead of duplicating samples, SMOTE creates **synthetic examples** by interpolating between existing minority samples and their nearest neighbors.

**Pros:**
- Adds variety to minority data → less overfitting than plain oversampling.
- Often improves recall and smooths the decision boundary.

**Cons:**
- Can introduce **noisy or unrealistic samples** if classes overlap.
- Doesn’t work well with categorical features.

Let’s see how it performs:


In [None]:
from imblearn.over_sampling import SMOTE

# Resample training data using SMOTE
smote = SMOTE(sampling_strategy='minority', random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Logistic Regression on SMOTE data
lr.fit(X_train_smote, y_train_smote)
y_pred_lr_smote = lr.predict(X_test)

print(classification_report(y_test, y_pred_lr_smote))

# Random Forest on SMOTE data
rf.fit(X_train_smote, y_train_smote)
y_pred_rf_smote = rf.predict(X_test)

print(classification_report(y_test, y_pred_rf_smote))

              precision    recall  f1-score   support

           0       0.98      0.86      0.92      1181
           1       0.24      0.74      0.36        69

    accuracy                           0.85      1250
   macro avg       0.61      0.80      0.64      1250
weighted avg       0.94      0.85      0.89      1250

              precision    recall  f1-score   support

           0       0.98      0.97      0.97      1181
           1       0.54      0.62      0.58        69

    accuracy                           0.95      1250
   macro avg       0.76      0.80      0.78      1250
weighted avg       0.95      0.95      0.95      1250



## Algorithm-Level Method: Class Weights

Instead of changing the data, we can adjust the algorithm itself to pay more attention to the minority class.
This is done using class weights, which assign higher penalties to misclassifications of minority samples.
In other words, errors on rare classes "cost" more during training, encouraging the model to learn to recognize them.

In [4]:
# Logistic Regression with class weights
lr_bal = LogisticRegression(max_iter=2000, class_weight='balanced', random_state=42)
lr_bal.fit(X_train, y_train)
y_pred_lr_bal = lr_bal.predict(X_test)

print("Logistic Regression with class_weight='balanced':")
print(classification_report(y_test, y_pred_lr_bal))

# Random Forest with class weights
rf_bal = RandomForestClassifier(class_weight='balanced', random_state=42)
rf_bal.fit(X_train, y_train)
y_pred_rf_bal = rf_bal.predict(X_test)

print("Random Forest with class_weight='balanced':")
print(classification_report(y_test, y_pred_rf_bal))


Logistic Regression with class_weight='balanced':
              precision    recall  f1-score   support

           0       0.98      0.85      0.91      1181
           1       0.22      0.72      0.34        69

    accuracy                           0.84      1250
   macro avg       0.60      0.79      0.63      1250
weighted avg       0.94      0.84      0.88      1250

Random Forest with class_weight='balanced':
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1181
           1       0.73      0.52      0.61        69

    accuracy                           0.96      1250
   macro avg       0.85      0.76      0.80      1250
weighted avg       0.96      0.96      0.96      1250



Logistic regression’s results are very similar to oversampling, while Random Forest’s performance is nearly unchanged, again showing that tree ensembles already handle imbalance fairly well.

### Conclusion

Resampling helped logistic regression a lot: its recall for the minority class jumped from 0.25 to over 0.7 with any resampling method.  
Random forest didn't improve significantly, which makes sense since tree ensembles already handle imbalance relatively well due to their sampling and splitting mechanisms.

In general:
- **Oversampling / SMOTE** increase recall but may reduce precision.
- **Undersampling** increases recall more but often reduces overall accuracy.
- **Class weights** offer a simple algorithm-level alternative, letting the model focus more on minority samples without changing the data.
- For complex models like RandomForest, class weights or tuning the decision threshold may be more efficient than resampling.

Overall, this shows *why and when* resampling techniques are useful, especially for simpler, linear models on strongly imbalanced data.
