# Daily Blog #67 - Class Imbalance + SMOTE (Synthetic Minority Oversampling Technique)
### July 6, 2025 

### The Problem

In classification, **imbalanced data** means some classes appear much more frequently than others.

Example:

| Class                      | Count  |
| -------------------------- | ------ |
| Legitimate Transaction (0) | 98,000 |
| Fraudulent Transaction (1) | 2,000  |

A model can be **98% accurate by doing nothing but predicting 'legit'**—completely useless.

### Why It's a Problem

* **Bias toward the majority class**: Most models are optimized for overall accuracy, not minority class precision.
* Metrics like **accuracy become meaningless**.
* Class 1 (minority) is often the one we care about most (fraud, cancer, failure, etc.).


## Fixes

### 1. **Use Better Metrics**

* **Precision**: How many predicted positives are *actually* positive.
* **Recall**: How many actual positives are *correctly* predicted.
* **F1 Score**: Balance between precision and recall.
* **AUC-ROC / PR Curve**: Good for binary classification imbalance.


### 2. **Resampling Techniques**

#### A. **Undersampling**

* Remove random majority class samples to balance dataset.
* Risk: You throw away potentially useful data.
* Use when you have too much majority data and speed matters.

#### B. **Oversampling (Naive)**

* Duplicate minority class instances.
* Risk: Overfitting to repeated examples.

#### C. **SMOTE** 

## SMOTE: Synthetic Minority Oversampling Technique

**SMOTE generates synthetic examples** of the minority class, interpolating between real samples.

### How It Works (Simplified):

1. Pick a minority class sample.
2. Find its **k nearest minority class neighbors**.
3. Choose one neighbor at random.
4. Create a synthetic point *between the two* in feature space.


### When to Use SMOTE?

* Continuous variables
* Not great for categorical data alone (but there's SMOTENC for that!)
* Combine with **Tomek Links** or **Edited Nearest Neighbors** to clean noisy samples (SMOTE + Tomek is strong in practice).


### Code Example in Python

```python
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from collections import Counter

# X = features, y = labels
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

print("Before SMOTE:", Counter(y_train))

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print("After SMOTE:", Counter(y_resampled))

clf = RandomForestClassifier()
clf.fit(X_resampled, y_resampled)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
```

### Pro Tips:

* **Always apply SMOTE *after* train-test split**. Never oversample before splitting.
* Use **pipeline + cross-validation** to avoid leakage.
* Check **ROC and PR curves**, not just accuracy.
* Try **ensemble methods with balanced class weights** (like `class_weight='balanced'` in `LogisticRegression` or `RandomForestClassifier`).
