# ⚠️ Problems Caused by Imbalanced Data in Classification Tasks

---

## 📌 What is Imbalanced Data?

**Imbalanced data** occurs when one class (or category) in a dataset has **significantly more samples** than the other(s).  
In such cases, the model tends to favor the majority class, leading to poor performance on the minority class — which is often the class of most interest (e.g., fraud, disease, or failure cases).

---

## ⚖️ Example

Imagine a binary classification dataset for **fraud detection**:
- 9,900 transactions are **legitimate**.
- 100 transactions are **fraudulent**.

The dataset is **99% vs 1%** imbalanced.

If a naive model simply predicts *“legitimate”* for every case:
- It will have **99% accuracy**,  
  yet **0% usefulness**, because it never detects fraud.

---

## 🚧 Challenges with Imbalanced Data

### 1. 🧠 **Bias Toward the Majority Class**
- Machine learning models tend to **prioritize the majority class**, since it dominates the training data.
- The model learns patterns that favor the frequent class and ignores the rare one.
- As a result, the **minority class gets misclassified** frequently.

**Example:**  
In disease detection, the model predicts “healthy” for almost all cases, missing critical rare disease cases.

---

### 2. 📉 **Misleading Evaluation Metrics**
- Common metrics like **accuracy** can be highly deceptive in imbalanced datasets.  
  For example, a model that predicts all samples as the majority class can still appear to perform well by accuracy.

#### ✅ Better Metrics for Imbalanced Data:
- **Precision:** Fraction of predicted positives that are correct.  
- **Recall (Sensitivity):** Fraction of actual positives correctly identified.  
- **F1-Score:** Harmonic mean of precision and recall.  
- **AUC-ROC (Area Under Curve):** Measures the trade-off between true positive rate and false positive rate.

**Example (Accuracy Trap):**
| Class | Actual Count | Model Prediction | Accuracy Contribution |
|--------|---------------|------------------|------------------------|
| Majority (0) | 950 | Predicted correctly | ✅ 950 |
| Minority (1) | 50 | Predicted as 0 | ❌ 0 |
| **Total Accuracy:** | | | **95%** (but 0% recall for minority class) |

---

### 3. 🔍 **Limited Information for Minority Class**
- The **minority class** may have too few samples for the model to learn meaningful patterns.
- This can lead to **underfitting** and **poor generalization** for the minority class.

**Consequences:**
- The model may never learn what distinguishes rare events (like fraud, diseases, or machine faults).
- Synthetic data generation (e.g., **SMOTE**, **ADASYN**) may be needed to create balance.

---

### 4. 📊 **Applications Commonly Affected by Imbalanced Data**
Imbalanced data is common in **high-stakes, real-world applications** where rare events are critical to detect.

| Application | Majority Class | Minority Class |
|--------------|----------------|----------------|
| **Fraud Detection** | Legitimate transactions | Fraudulent transactions |
| **Medical Diagnosis** | Healthy patients | Disease cases |
| **Anomaly Detection** | Normal system behavior | Faults or cyberattacks |
| **Credit Scoring** | Loan repayments | Loan defaults |

These cases require models that **perform well on the minority class**, even if overall accuracy drops.

---

## 🧠 Summary

| Problem | Description | Consequence |
|----------|--------------|--------------|
| **Bias Toward Majority Class** | Model prioritizes frequent classes | Poor detection of rare events |
| **Misleading Metrics** | Accuracy hides poor recall for rare cases | False sense of success |
| **Limited Minority Samples** | Insufficient data to learn patterns | Underfitting of minority class |
| **Real-World Risk** | Errors on rare events are costly | Financial, medical, or operational losses |

---

## ✅ Key Takeaways

- Always **inspect class distributions** before training models.
- Avoid relying solely on **accuracy** — use **precision, recall, F1, ROC-AUC**, or **confusion matrices**.
- Consider **resampling techniques** (oversampling minority, undersampling majority).
- Use **cost-sensitive learning** or **ensemble methods** designed for imbalance (e.g., Balanced Random Forest, XGBoost with `scale_pos_weight`).
- Imbalanced datasets require **special care in model evaluation and feature engineering**.

---

📊 **In short:**  
> Imbalanced data doesn’t just lower model performance — it **misleads evaluation** and hides critical failure risks.  
> Always measure, visualize, and balance your dataset before trusting your model’s results.


# 🧩 Techniques to Handle Imbalanced Data

Handling imbalanced data is crucial to improving model performance and ensuring fair evaluation across classes.  
Below are key **algorithmic and data-level strategies** used to address this challenge.

---

## ⚙️ 1. Algorithmic Solutions

Algorithmic techniques modify the learning process so that the model gives more attention to the **minority class** during training.

---

### 🧮 a. Class Weights

**Idea:** Assign **higher weights** to the minority class during training to penalize misclassification of rare samples more strongly.

- This tells the model that errors on the minority class are **more costly**.
- Many algorithms (e.g., Logistic Regression, SVM, Decision Tree, Random Forest, and XGBoost) support **built-in class weighting**.

#### ✅ Benefits:
- Works directly within the algorithm (no need to modify the data).
- Maintains the size and distribution of the original dataset.
- Reduces model bias toward the majority class.

#### ⚠️ When to Use:
- When the dataset is too small for resampling techniques.
- When the imbalance ratio is moderate (e.g., 1:10, 1:20).

#### 💻 Example (Scikit-learn Logistic Regression):

```python
from sklearn.linear_model import LogisticRegression
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Example: binary labels (0 = majority, 1 = minority)
y = np.array([0]*90 + [1]*10)
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
print("Computed class weights:", dict(zip(np.unique(y), weights)))

# Apply class weights in model
model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)


# 🧮 Evaluation Metrics for Imbalanced Data

---

## 📌 Why Special Metrics Are Needed

In imbalanced datasets, **accuracy alone can be misleading** because predicting only the majority class can yield high accuracy but poor performance on the minority (positive) class.

For example:
- 95% of samples are class 0 (majority).
- 5% are class 1 (minority).

A model that predicts everything as class 0 would still have **95% accuracy**, yet **0% recall** for class 1 — completely missing the minority class.

To properly evaluate models under imbalance, we use **metrics that consider both false positives and false negatives.**

---

## ⚖️ 1. F1-Score

### 🧠 Definition
The **F1-Score** is the **harmonic mean** of **Precision** and **Recall**, balancing the two.

\[
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
\]

Where:
- **Precision (Positive Predictive Value):**
  \[
  Precision = \frac{TP}{TP + FP}
  \]
  Measures how many predicted positives are actually correct.
- **Recall (Sensitivity):**
  \[
  Recall = \frac{TP}{TP + FN}
  \]
  Measures how many actual positives were correctly identified.

### ✅ When to Use
- When both **false positives (FP)** and **false negatives (FN)** are costly.
- Particularly useful in **fraud detection**, **medical diagnosis**, and **risk analysis**.

### 💻 Example (Scikit-learn):

```python
from sklearn.metrics import f1_score, precision_score, recall_score

y_true = [0, 0, 1, 1, 1, 0, 1, 0]
y_pred = [0, 0, 1, 0, 1, 0, 1, 0]

print("Precision:", precision_score(y_true, y_pred))
print("Recall:", recall_score(y_true, y_pred))
print("F1-Score:", f1_score(y_true, y_pred))


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
url = "https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv"
df = pd.read_csv(url)


In [3]:
print("Dataset Info:\n ")
print(df.info())
print("\n Class Distribution\n")
print(df["Class"].value_counts)


Dataset Info:
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64

In [4]:
X = df.drop(columns = ['Class'])
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

rf_model = RandomForestClassifier(random_state=42, class_weight="balanced")
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)
print("\n Classification Report:\n")
print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, rf_model.predict_proba(X_test)[:, 1])
print(roc_auc)



 Classification Report:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.99      0.76      0.86        98

    accuracy                           1.00     56962
   macro avg       0.99      0.88      0.93     56962
weighted avg       1.00      1.00      1.00     56962

0.9478093273747316


In [8]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state = 42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print("\n Class Distribution after SMOTE: \n")
print(pd.Series(y_resampled.value_counts()))


 Class Distribution after SMOTE: 

Class
0    227451
1    227451
Name: count, dtype: int64


In [None]:
rf_model_smote = RandomForestClassifier(random_state=42, class_weight="balanced")
rf_model_smote.fit(X_resampled, y_resampled)

y_pred_smote = rf_model_smote.predict(X_test)
print("\n Classification Report (SMOTE):\n")
print(classification_report(y_test, y_pred_smote))
roc_auc_score = roc_auc_score(y_test, rf_model_smote.predict_proba(X_test)[:,1])
print(roc_auc_score)