# Tutorial 11: Handling Imbalanced Data

## Module 4: Model Development

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Identify and measure class imbalance** in datasets
2. **Apply resampling techniques** including SMOTE and undersampling
3. **Use class-weighted loss functions** to handle imbalance
4. **Implement focal loss** for extreme imbalance
5. **Choose appropriate metrics** for imbalanced datasets
6. **Design evaluation strategies** that account for imbalance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from typing import Tuple, List, Dict
from collections import Counter

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score,
    precision_recall_curve, average_precision_score, roc_curve
)
from sklearn.utils.class_weight import compute_class_weight

np.random.seed(42)
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
print("Libraries imported successfully!")

In [None]:
# Try to import imbalanced-learn
try:
    from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler, TomekLinks, NearMiss
    from imblearn.combine import SMOTETomek, SMOTEENN
    from imblearn.pipeline import Pipeline as ImbPipeline
    IMBLEARN_AVAILABLE = True
    print("imbalanced-learn available")
except ImportError:
    IMBLEARN_AVAILABLE = False
    print("imbalanced-learn not available - will use custom implementations")

## 1. Understanding Class Imbalance

Class imbalance occurs when one class significantly outnumbers the others.

### Common Scenarios

| Domain | Minority Class | Typical Ratio |
|--------|---------------|---------------|
| Fraud Detection | Fraud | 1:1000 |
| Disease Diagnosis | Positive cases | 1:100 |
| Spam Detection | Spam | 1:10 |
| Churn Prediction | Churned | 1:20 |
| Click-through Rate | Clicks | 1:100 |

In [None]:
def create_imbalanced_dataset(n_samples=10000, imbalance_ratio=0.05, n_features=20, random_state=42):
    """
    Create an imbalanced binary classification dataset.
    
    Args:
        n_samples: Total number of samples
        imbalance_ratio: Fraction of minority class (e.g., 0.05 = 5%)
        n_features: Number of features
        random_state: Random seed
    """
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=10,
        n_redundant=5,
        n_clusters_per_class=2,
        weights=[1 - imbalance_ratio, imbalance_ratio],
        flip_y=0.01,
        random_state=random_state
    )
    return X, y

# Create datasets with different imbalance levels
X_mild, y_mild = create_imbalanced_dataset(imbalance_ratio=0.3)     # Mild: 30%
X_moderate, y_moderate = create_imbalanced_dataset(imbalance_ratio=0.1)  # Moderate: 10%
X_severe, y_severe = create_imbalanced_dataset(imbalance_ratio=0.02)    # Severe: 2%

print("Imbalanced Datasets Created:")
for name, y in [('Mild', y_mild), ('Moderate', y_moderate), ('Severe', y_severe)]:
    counts = Counter(y)
    ratio = counts[1] / counts[0]
    print(f"  {name}: {counts} (ratio 1:{1/ratio:.1f})")

In [None]:
# Visualize class distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, (name, y) in zip(axes, [('Mild (30%)', y_mild), ('Moderate (10%)', y_moderate), ('Severe (2%)', y_severe)]):
    counts = Counter(y)
    ax.bar(['Majority (0)', 'Minority (1)'], [counts[0], counts[1]], color=['steelblue', 'coral'])
    ax.set_ylabel('Count')
    ax.set_title(f'{name} Imbalance')
    for i, (label, count) in enumerate([(0, counts[0]), (1, counts[1])]):
        ax.text(i, count + 100, str(count), ha='center', fontsize=10)

plt.tight_layout()
plt.show()

### 1.1 Problem with Standard Accuracy

In [None]:
# Demonstrate why accuracy is misleading
class MajorityClassifier:
    """Always predicts the majority class."""
    def fit(self, X, y):
        self.majority_class = Counter(y).most_common(1)[0][0]
        return self
    
    def predict(self, X):
        return np.full(len(X), self.majority_class)

# Test on severe imbalance
X_train, X_test, y_train, y_test = train_test_split(
    X_severe, y_severe, test_size=0.2, random_state=42, stratify=y_severe
)

dummy = MajorityClassifier()
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)

print("Majority Class Classifier on Severe Imbalance (2%):")
print(f"  Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred, zero_division=0):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score: {f1_score(y_test, y_pred):.4f}")
print("\n=> High accuracy but useless model!")

### 1.2 Better Metrics for Imbalanced Data

| Metric | Description | When to Use |
|--------|-------------|-------------|
| **Precision** | TP / (TP + FP) | Cost of false positives is high |
| **Recall** | TP / (TP + FN) | Cost of false negatives is high |
| **F1 Score** | Harmonic mean | Balance precision and recall |
| **AUC-ROC** | Area under ROC curve | Overall ranking ability |
| **Average Precision** | Area under PR curve | Focus on minority class |

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Comprehensive evaluation for imbalanced data."""
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results = {
        'Model': model_name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred)
    }
    
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test)[:, 1]
        results['AUC-ROC'] = roc_auc_score(y_test, y_proba)
        results['Avg Precision'] = average_precision_score(y_test, y_proba)
    
    return results

# Prepare data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Evaluate standard Logistic Regression
results = evaluate_model(
    LogisticRegression(max_iter=1000, random_state=42),
    X_train_s, X_test_s, y_train, y_test,
    'Standard Logistic Regression'
)

print("Standard Logistic Regression on Imbalanced Data:")
for metric, value in results.items():
    if isinstance(value, float):
        print(f"  {metric}: {value:.4f}")

## 2. Resampling Techniques

### Overview

| Technique | Description | Pros | Cons |
|-----------|-------------|------|------|
| **Random Oversampling** | Duplicate minority samples | Simple | Overfitting |
| **SMOTE** | Synthetic minority samples | Better generalization | May create noise |
| **Random Undersampling** | Remove majority samples | Reduces training time | Information loss |
| **Tomek Links** | Remove borderline majority | Cleaner boundary | Limited reduction |
| **Combination** | SMOTE + Undersampling | Balanced approach | More complex |

### 2.1 Random Oversampling

In [None]:
def random_oversample(X, y, target_ratio=1.0):
    """
    Random oversampling of minority class.
    
    Args:
        X: Features
        y: Labels
        target_ratio: Desired ratio of minority to majority (1.0 = balanced)
    """
    counts = Counter(y)
    minority_class = min(counts, key=counts.get)
    majority_class = max(counts, key=counts.get)
    
    # Get indices for each class
    minority_indices = np.where(y == minority_class)[0]
    majority_indices = np.where(y == majority_class)[0]
    
    # Calculate how many samples to generate
    n_target = int(len(majority_indices) * target_ratio)
    n_to_generate = n_target - len(minority_indices)
    
    if n_to_generate <= 0:
        return X, y
    
    # Randomly sample with replacement
    oversample_indices = np.random.choice(minority_indices, n_to_generate, replace=True)
    
    X_new = np.vstack([X, X[oversample_indices]])
    y_new = np.concatenate([y, y[oversample_indices]])
    
    return X_new, y_new

# Apply random oversampling
X_ros, y_ros = random_oversample(X_train, y_train)

print("Random Oversampling:")
print(f"  Before: {Counter(y_train)}")
print(f"  After: {Counter(y_ros)}")

### 2.2 SMOTE (Synthetic Minority Over-sampling Technique)

In [None]:
def simple_smote(X, y, k=5, target_ratio=1.0):
    """
    Simplified SMOTE implementation.
    Creates synthetic samples by interpolating between minority samples and their neighbors.
    """
    from sklearn.neighbors import NearestNeighbors
    
    counts = Counter(y)
    minority_class = min(counts, key=counts.get)
    majority_class = max(counts, key=counts.get)
    
    X_minority = X[y == minority_class]
    X_majority = X[y == majority_class]
    
    # Calculate how many samples to generate
    n_target = int(len(X_majority) * target_ratio)
    n_to_generate = n_target - len(X_minority)
    
    if n_to_generate <= 0:
        return X, y
    
    # Fit nearest neighbors on minority class
    nn = NearestNeighbors(n_neighbors=k+1)
    nn.fit(X_minority)
    
    # Generate synthetic samples
    synthetic_samples = []
    for _ in range(n_to_generate):
        # Pick a random minority sample
        idx = np.random.randint(0, len(X_minority))
        sample = X_minority[idx]
        
        # Find its neighbors
        _, neighbor_indices = nn.kneighbors([sample])
        neighbor_idx = np.random.choice(neighbor_indices[0][1:])  # Exclude itself
        neighbor = X_minority[neighbor_idx]
        
        # Interpolate
        alpha = np.random.random()
        synthetic = sample + alpha * (neighbor - sample)
        synthetic_samples.append(synthetic)
    
    X_synthetic = np.array(synthetic_samples)
    y_synthetic = np.full(len(synthetic_samples), minority_class)
    
    X_new = np.vstack([X, X_synthetic])
    y_new = np.concatenate([y, y_synthetic])
    
    return X_new, y_new

# Apply SMOTE
X_smote, y_smote = simple_smote(X_train, y_train)

print("SMOTE:")
print(f"  Before: {Counter(y_train)}")
print(f"  After: {Counter(y_smote)}")

In [None]:
# Visualize SMOTE effect in 2D
from sklearn.decomposition import PCA

# Reduce to 2D for visualization
pca = PCA(n_components=2)
X_train_2d = pca.fit_transform(X_train)
X_smote_2d = pca.transform(X_smote)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original
axes[0].scatter(X_train_2d[y_train == 0, 0], X_train_2d[y_train == 0, 1], 
                alpha=0.5, label='Majority', c='steelblue', s=20)
axes[0].scatter(X_train_2d[y_train == 1, 0], X_train_2d[y_train == 1, 1], 
                alpha=0.8, label='Minority', c='coral', s=50)
axes[0].set_title(f'Original (n={len(y_train)})')
axes[0].legend()

# After SMOTE
axes[1].scatter(X_smote_2d[y_smote == 0, 0], X_smote_2d[y_smote == 0, 1], 
                alpha=0.5, label='Majority', c='steelblue', s=20)
axes[1].scatter(X_smote_2d[y_smote == 1, 0], X_smote_2d[y_smote == 1, 1], 
                alpha=0.8, label='Minority (incl. synthetic)', c='coral', s=50)
axes[1].set_title(f'After SMOTE (n={len(y_smote)})')
axes[1].legend()

plt.tight_layout()
plt.show()

### 2.3 Undersampling

In [None]:
def random_undersample(X, y, target_ratio=1.0):
    """
    Random undersampling of majority class.
    """
    counts = Counter(y)
    minority_class = min(counts, key=counts.get)
    majority_class = max(counts, key=counts.get)
    
    minority_indices = np.where(y == minority_class)[0]
    majority_indices = np.where(y == majority_class)[0]
    
    # Calculate target majority size
    n_target = int(len(minority_indices) / target_ratio)
    n_target = min(n_target, len(majority_indices))
    
    # Randomly sample majority class
    selected_majority = np.random.choice(majority_indices, n_target, replace=False)
    
    # Combine
    selected_indices = np.concatenate([minority_indices, selected_majority])
    
    return X[selected_indices], y[selected_indices]

# Apply undersampling
X_under, y_under = random_undersample(X_train, y_train)

print("Random Undersampling:")
print(f"  Before: {Counter(y_train)}")
print(f"  After: {Counter(y_under)}")

In [None]:
def tomek_links_undersample(X, y):
    """
    Remove Tomek links (pairs of different class nearest neighbors).
    Cleans the boundary between classes.
    """
    from sklearn.neighbors import NearestNeighbors
    
    nn = NearestNeighbors(n_neighbors=2)
    nn.fit(X)
    
    _, indices = nn.kneighbors(X)
    
    # Find Tomek links
    tomek_indices = set()
    for i in range(len(X)):
        neighbor = indices[i, 1]  # Nearest neighbor (excluding self)
        
        # Check if they form a Tomek link
        if y[i] != y[neighbor]:  # Different classes
            neighbor_of_neighbor = indices[neighbor, 1]
            if neighbor_of_neighbor == i:  # Mutual nearest neighbors
                # Remove the majority class sample
                counts = Counter(y)
                majority_class = max(counts, key=counts.get)
                if y[i] == majority_class:
                    tomek_indices.add(i)
                else:
                    tomek_indices.add(neighbor)
    
    # Keep non-Tomek samples
    mask = np.ones(len(X), dtype=bool)
    mask[list(tomek_indices)] = False
    
    return X[mask], y[mask], len(tomek_indices)

# Apply Tomek links
X_tomek, y_tomek, n_removed = tomek_links_undersample(X_train, y_train)

print(f"\nTomek Links Undersampling:")
print(f"  Removed {n_removed} Tomek link samples")
print(f"  Before: {Counter(y_train)}")
print(f"  After: {Counter(y_tomek)}")

### 2.4 Compare Resampling Methods

In [None]:
# Compare all resampling methods
resampling_results = []

# Prepare datasets
datasets = {
    'Original': (X_train_s, y_train),
    'Random Oversampling': (scaler.fit_transform(X_ros), y_ros),
    'SMOTE': (scaler.fit_transform(X_smote), y_smote),
    'Random Undersampling': (scaler.fit_transform(X_under), y_under),
    'Tomek Links': (scaler.fit_transform(X_tomek), y_tomek)
}

for name, (X_resampled, y_resampled) in datasets.items():
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_resampled, y_resampled)
    y_pred = model.predict(X_test_s)
    y_proba = model.predict_proba(X_test_s)[:, 1]
    
    resampling_results.append({
        'Method': name,
        'Train Size': len(y_resampled),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC-ROC': roc_auc_score(y_test, y_proba)
    })

results_df = pd.DataFrame(resampling_results)
print("Resampling Methods Comparison:")
print(results_df.to_string(index=False))

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(12, 6))

metrics = ['Precision', 'Recall', 'F1', 'AUC-ROC']
x = np.arange(len(results_df))
width = 0.2

for i, metric in enumerate(metrics):
    ax.bar(x + i*width, results_df[metric], width, label=metric)

ax.set_xticks(x + width * 1.5)
ax.set_xticklabels(results_df['Method'], rotation=45, ha='right')
ax.set_ylabel('Score')
ax.set_title('Resampling Methods Performance Comparison')
ax.legend()
ax.set_ylim([0, 1])

plt.tight_layout()
plt.show()

## 3. Class-Weighted Loss Functions

Instead of resampling, we can adjust the loss function to penalize misclassification of minority class more heavily.

In [None]:
def compute_balanced_weights(y):
    """Compute class weights inversely proportional to class frequencies."""
    classes = np.unique(y)
    weights = compute_class_weight('balanced', classes=classes, y=y)
    return dict(zip(classes, weights))

# Calculate balanced weights
class_weights = compute_balanced_weights(y_train)
print(f"Computed class weights: {class_weights}")
print(f"Minority class is weighted {class_weights[1]/class_weights[0]:.1f}x more")

In [None]:
# Compare weighted vs unweighted models
models = {
    'Unweighted Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Weighted Logistic Regression': LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42),
    'Unweighted Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Weighted Random Forest': RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
}

weight_results = []
for name, model in models.items():
    result = evaluate_model(model, X_train_s, X_test_s, y_train, y_test, name)
    weight_results.append(result)

weight_df = pd.DataFrame(weight_results)
print("Class Weighting Comparison:")
print(weight_df.to_string(index=False))

In [None]:
# Custom class weights
custom_weights = {0: 1, 1: 10}  # 10x weight for minority class

lr_custom = LogisticRegression(class_weight=custom_weights, max_iter=1000, random_state=42)
lr_custom.fit(X_train_s, y_train)
y_pred_custom = lr_custom.predict(X_test_s)

print("\nCustom Weights (1:10):")
print(f"  Precision: {precision_score(y_test, y_pred_custom):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_custom):.4f}")
print(f"  F1: {f1_score(y_test, y_pred_custom):.4f}")

## 4. Focal Loss

Focal loss is designed for extreme class imbalance. It down-weights easy examples and focuses on hard examples.

**Formula**: FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)

Where:
- p_t is the predicted probability for the true class
- γ (gamma) is the focusing parameter (typically 2)
- α (alpha) is the class balancing weight

In [None]:
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    """
    Compute focal loss for binary classification.
    
    Args:
        y_true: True labels (0 or 1)
        y_pred: Predicted probabilities
        gamma: Focusing parameter (default: 2)
        alpha: Class balancing weight for positive class
    """
    epsilon = 1e-7
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Cross entropy
    ce = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)
    
    # Focal weight
    p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
    focal_weight = (1 - p_t) ** gamma
    
    # Alpha weight
    alpha_weight = y_true * alpha + (1 - y_true) * (1 - alpha)
    
    # Focal loss
    fl = alpha_weight * focal_weight * ce
    
    return np.mean(fl)

# Compare CE and Focal Loss
y_true = np.array([1, 1, 0, 0])
y_pred_easy = np.array([0.95, 0.90, 0.10, 0.05])  # Easy predictions
y_pred_hard = np.array([0.60, 0.55, 0.40, 0.45])  # Hard predictions

print("Loss Comparison:")
print(f"\nEasy predictions (confident):")
ce_easy = -np.mean(y_true * np.log(y_pred_easy + 1e-7) + (1 - y_true) * np.log(1 - y_pred_easy + 1e-7))
print(f"  Cross-Entropy: {ce_easy:.4f}")
print(f"  Focal Loss (γ=2): {focal_loss(y_true, y_pred_easy, gamma=2):.4f}")

print(f"\nHard predictions (uncertain):")
ce_hard = -np.mean(y_true * np.log(y_pred_hard + 1e-7) + (1 - y_true) * np.log(1 - y_pred_hard + 1e-7))
print(f"  Cross-Entropy: {ce_hard:.4f}")
print(f"  Focal Loss (γ=2): {focal_loss(y_true, y_pred_hard, gamma=2):.4f}")

In [None]:
# Visualize focal loss vs cross-entropy
p = np.linspace(0.01, 0.99, 100)

# For true class = 1
ce_loss = -np.log(p)
fl_gamma1 = -(1 - p) * np.log(p)
fl_gamma2 = -((1 - p) ** 2) * np.log(p)
fl_gamma5 = -((1 - p) ** 5) * np.log(p)

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(p, ce_loss, 'b-', label='Cross-Entropy', linewidth=2)
ax.plot(p, fl_gamma1, 'r--', label='Focal (γ=1)', linewidth=2)
ax.plot(p, fl_gamma2, 'g--', label='Focal (γ=2)', linewidth=2)
ax.plot(p, fl_gamma5, 'm--', label='Focal (γ=5)', linewidth=2)

ax.set_xlabel('Predicted Probability for True Class')
ax.set_ylabel('Loss')
ax.set_title('Cross-Entropy vs Focal Loss (True Class = 1)')
ax.legend()
ax.set_ylim([0, 5])
ax.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)

plt.tight_layout()
plt.show()

In [None]:
class FocalLossClassifier:
    """
    Simple logistic regression trained with focal loss using gradient descent.
    """
    
    def __init__(self, gamma=2.0, alpha=0.25, lr=0.01, n_iter=1000):
        self.gamma = gamma
        self.alpha = alpha
        self.lr = lr
        self.n_iter = n_iter
        self.weights = None
        self.bias = None
    
    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -250, 250)))
    
    def _focal_loss_gradient(self, y, p):
        """Compute gradient of focal loss."""
        epsilon = 1e-7
        p = np.clip(p, epsilon, 1 - epsilon)
        
        # Focal loss gradient components
        p_t = y * p + (1 - y) * (1 - p)
        focal_weight = (1 - p_t) ** self.gamma
        alpha_weight = y * self.alpha + (1 - y) * (1 - self.alpha)
        
        grad = alpha_weight * (self.gamma * (1 - p_t) ** (self.gamma - 1) * 
                               (y - p) * np.log(p_t + epsilon) + 
                               focal_weight * (p - y) / (p_t + epsilon))
        
        return grad
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        for _ in range(self.n_iter):
            z = np.dot(X, self.weights) + self.bias
            p = self._sigmoid(z)
            
            grad = self._focal_loss_gradient(y, p)
            
            self.weights -= self.lr * np.dot(X.T, grad) / n_samples
            self.bias -= self.lr * np.mean(grad)
        
        return self
    
    def predict_proba(self, X):
        z = np.dot(X, self.weights) + self.bias
        p = self._sigmoid(z)
        return np.column_stack([1 - p, p])
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)

# Train focal loss classifier
fl_clf = FocalLossClassifier(gamma=2.0, alpha=0.75, lr=0.1, n_iter=1000)
fl_clf.fit(X_train_s, y_train)
y_pred_fl = fl_clf.predict(X_test_s)

print("Focal Loss Classifier:")
print(f"  Precision: {precision_score(y_test, y_pred_fl):.4f}")
print(f"  Recall: {recall_score(y_test, y_pred_fl):.4f}")
print(f"  F1: {f1_score(y_test, y_pred_fl):.4f}")

## 5. Threshold Adjustment

Another approach is to adjust the classification threshold instead of the default 0.5.

In [None]:
# Train a model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_s, y_train)
y_proba = model.predict_proba(X_test_s)[:, 1]

# Evaluate at different thresholds
thresholds = np.arange(0.1, 0.9, 0.1)
threshold_results = []

for thresh in thresholds:
    y_pred = (y_proba >= thresh).astype(int)
    threshold_results.append({
        'Threshold': thresh,
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred)
    })

thresh_df = pd.DataFrame(threshold_results)
print("Effect of Classification Threshold:")
print(thresh_df.to_string(index=False))

In [None]:
# Visualize precision-recall trade-off
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_test, y_proba)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Precision-Recall curve
axes[0].plot(recall_curve, precision_curve, 'b-', linewidth=2)
axes[0].set_xlabel('Recall')
axes[0].set_ylabel('Precision')
axes[0].set_title(f'Precision-Recall Curve (AP = {average_precision_score(y_test, y_proba):.3f})')

# Threshold vs metrics
axes[1].plot(thresh_df['Threshold'], thresh_df['Precision'], 'b-o', label='Precision', linewidth=2)
axes[1].plot(thresh_df['Threshold'], thresh_df['Recall'], 'r-o', label='Recall', linewidth=2)
axes[1].plot(thresh_df['Threshold'], thresh_df['F1'], 'g-o', label='F1', linewidth=2)
axes[1].set_xlabel('Classification Threshold')
axes[1].set_ylabel('Score')
axes[1].set_title('Metrics vs Threshold')
axes[1].legend()
axes[1].axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()

In [None]:
def find_optimal_threshold(y_true, y_proba, metric='f1'):
    """Find threshold that maximizes the specified metric."""
    thresholds = np.arange(0.05, 0.95, 0.01)
    best_thresh, best_score = 0.5, 0
    
    for thresh in thresholds:
        y_pred = (y_proba >= thresh).astype(int)
        
        if metric == 'f1':
            score = f1_score(y_true, y_pred)
        elif metric == 'precision':
            score = precision_score(y_true, y_pred)
        elif metric == 'recall':
            score = recall_score(y_true, y_pred)
        
        if score > best_score:
            best_score = score
            best_thresh = thresh
    
    return best_thresh, best_score

optimal_thresh, optimal_f1 = find_optimal_threshold(y_test, y_proba, 'f1')
print(f"Optimal threshold for F1: {optimal_thresh:.2f} (F1 = {optimal_f1:.4f})")

## 6. Evaluation Strategies for Imbalanced Data

In [None]:
# Comprehensive evaluation
def comprehensive_evaluation(y_true, y_pred, y_proba=None):
    """Generate comprehensive metrics for imbalanced classification."""
    print("Confusion Matrix:")
    cm = confusion_matrix(y_true, y_pred)
    print(cm)
    
    tn, fp, fn, tp = cm.ravel()
    print(f"\nTrue Negatives: {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")
    
    print(f"\nMetrics:")
    print(f"  Accuracy: {accuracy_score(y_true, y_pred):.4f}")
    print(f"  Balanced Accuracy: {(tp/(tp+fn) + tn/(tn+fp))/2:.4f}")
    print(f"  Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"  Recall (Sensitivity): {recall_score(y_true, y_pred):.4f}")
    print(f"  Specificity: {tn/(tn+fp):.4f}")
    print(f"  F1 Score: {f1_score(y_true, y_pred):.4f}")
    
    if y_proba is not None:
        print(f"  AUC-ROC: {roc_auc_score(y_true, y_proba):.4f}")
        print(f"  Average Precision: {average_precision_score(y_true, y_proba):.4f}")

# Best model evaluation
best_model = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
best_model.fit(X_train_s, y_train)
y_pred_best = best_model.predict(X_test_s)
y_proba_best = best_model.predict_proba(X_test_s)[:, 1]

comprehensive_evaluation(y_test, y_pred_best, y_proba_best)

In [None]:
# Visualize confusion matrix
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Counts
cm = confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix (Counts)')

# Normalized
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2%', cmap='Blues', ax=axes[1])
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_title('Confusion Matrix (Normalized)')

plt.tight_layout()
plt.show()

In [None]:
# ROC and PR curves comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Compare models
models_to_compare = {
    'Unweighted LR': LogisticRegression(max_iter=1000, random_state=42),
    'Weighted LR': LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42),
    'SMOTE + LR': LogisticRegression(max_iter=1000, random_state=42)
}

# Prepare SMOTE data
X_smote_s = scaler.fit_transform(X_smote)

for name, model in models_to_compare.items():
    if 'SMOTE' in name:
        model.fit(X_smote_s, y_smote)
    else:
        model.fit(X_train_s, y_train)
    
    y_proba = model.predict_proba(X_test_s)[:, 1]
    
    # ROC curve
    fpr, tpr, _ = roc_curve(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)
    axes[0].plot(fpr, tpr, label=f'{name} (AUC={auc:.3f})', linewidth=2)
    
    # PR curve
    prec, rec, _ = precision_recall_curve(y_test, y_proba)
    ap = average_precision_score(y_test, y_proba)
    axes[1].plot(rec, prec, label=f'{name} (AP={ap:.3f})', linewidth=2)

axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('ROC Curves')
axes[0].legend()

axes[1].axhline(y=Counter(y_test)[1]/len(y_test), color='k', linestyle='--', alpha=0.5, label='Baseline')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curves')
axes[1].legend()

plt.tight_layout()
plt.show()

## 7. Hands-on Exercise

In [None]:
# Exercise: Complete Imbalanced Classification Pipeline
print("Exercise: Build a Pipeline for Highly Imbalanced Data")
print("=" * 60)

# Create a very imbalanced dataset (1% minority)
X_exercise, y_exercise = create_imbalanced_dataset(n_samples=10000, imbalance_ratio=0.01)

X_train_ex, X_test_ex, y_train_ex, y_test_ex = train_test_split(
    X_exercise, y_exercise, test_size=0.2, random_state=42, stratify=y_exercise
)

scaler_ex = StandardScaler()
X_train_ex_s = scaler_ex.fit_transform(X_train_ex)
X_test_ex_s = scaler_ex.transform(X_test_ex)

print(f"Training set class distribution: {Counter(y_train_ex)}")
print(f"Test set class distribution: {Counter(y_test_ex)}")

In [None]:
# Step 1: Baseline
baseline = LogisticRegression(max_iter=1000, random_state=42)
baseline.fit(X_train_ex_s, y_train_ex)
y_pred_baseline = baseline.predict(X_test_ex_s)

print("Step 1: Baseline Model")
print(f"  F1 Score: {f1_score(y_test_ex, y_pred_baseline):.4f}")
print(f"  Recall: {recall_score(y_test_ex, y_pred_baseline):.4f}")

In [None]:
# Step 2: Apply SMOTE
X_train_smote, y_train_smote = simple_smote(X_train_ex, y_train_ex)
X_train_smote_s = scaler_ex.fit_transform(X_train_smote)

smote_model = LogisticRegression(max_iter=1000, random_state=42)
smote_model.fit(X_train_smote_s, y_train_smote)
y_pred_smote = smote_model.predict(X_test_ex_s)

print("Step 2: With SMOTE")
print(f"  F1 Score: {f1_score(y_test_ex, y_pred_smote):.4f}")
print(f"  Recall: {recall_score(y_test_ex, y_pred_smote):.4f}")

In [None]:
# Step 3: Class weighting
weighted_model = LogisticRegression(class_weight='balanced', max_iter=1000, random_state=42)
weighted_model.fit(X_train_ex_s, y_train_ex)
y_pred_weighted = weighted_model.predict(X_test_ex_s)

print("Step 3: With Class Weighting")
print(f"  F1 Score: {f1_score(y_test_ex, y_pred_weighted):.4f}")
print(f"  Recall: {recall_score(y_test_ex, y_pred_weighted):.4f}")

In [None]:
# Step 4: Threshold optimization
y_proba_ex = weighted_model.predict_proba(X_test_ex_s)[:, 1]
opt_thresh, opt_f1 = find_optimal_threshold(y_test_ex, y_proba_ex, 'f1')
y_pred_opt = (y_proba_ex >= opt_thresh).astype(int)

print("Step 4: With Optimized Threshold")
print(f"  Optimal Threshold: {opt_thresh:.2f}")
print(f"  F1 Score: {f1_score(y_test_ex, y_pred_opt):.4f}")
print(f"  Recall: {recall_score(y_test_ex, y_pred_opt):.4f}")

In [None]:
# Compare all approaches
print("\nFinal Comparison:")
print("=" * 50)
approaches = [
    ('Baseline', y_pred_baseline),
    ('SMOTE', y_pred_smote),
    ('Class Weighting', y_pred_weighted),
    ('Optimized Threshold', y_pred_opt)
]

for name, y_pred in approaches:
    print(f"{name}: F1={f1_score(y_test_ex, y_pred):.4f}, Recall={recall_score(y_test_ex, y_pred):.4f}")

## 8. Summary

### Key Takeaways

1. **Standard accuracy is misleading** for imbalanced data
2. **Use appropriate metrics**: F1, Precision, Recall, AUC-PR
3. **Resampling options**: SMOTE for oversampling, random/Tomek for undersampling
4. **Class weighting** is often simpler and equally effective
5. **Focal loss** works well for extreme imbalance
6. **Threshold optimization** can significantly improve results

### Decision Guide

| Imbalance Level | Recommended Approach |
|-----------------|---------------------|
| Mild (20-40%) | Class weighting |
| Moderate (5-20%) | SMOTE + Class weighting |
| Severe (1-5%) | SMOTE + Focal loss + Threshold tuning |
| Extreme (<1%) | Anomaly detection or specialized methods |