# **1️⃣ Handling Imbalanced Datasets in Classification Problems ⚖️🤖**

## **💡 Real-Life Analogy: Refereeing Fouls in a Football Match ⚽**

Imagine a football match where **90% of plays are fair** and **10% involve fouls**:
- If the referee **only focuses on fair plays**, they might **miss many real fouls** (low recall ❌).  
- If the referee **calls fouls too often**, they might **give unnecessary penalties** (low precision ❌).  
- **Solution?** Balance decision-making by **adjusting the way fouls are counted!** ✅

📌 **In machine learning, when one class is much rarer than another (e.g., fraud detection, medical diagnosis, goal-scoring predictions), we must adjust the model to handle this imbalance!**

# **📌 Why Are Imbalanced Datasets a Problem?**

✅ **Issue 1: Accuracy Can Be Misleading**  
- If **99% of games end in a win and 1% in a loss**, a model that **always predicts a win** is **99% accurate** but **completely useless** ❌.  

✅ **Issue 2: Model Ignores the Minority Class**  
- In fraud detection 💳, if 99% of transactions are legitimate, the model may **always predict "not fraud"**.  

✅ **Issue 3: Poor Generalization**  
- Models trained on **imbalanced data** struggle to detect **rare events** in new data.  

📌 **Solution? Use special techniques to balance the dataset and improve model learning!**

# **📌 Techniques to Handle Imbalanced Datasets**

| Method                                    | How It Works                                                 | Best Used For                   |
|-------------------------------------------|--------------------------------------------------------------|---------------------------------|
| **Resampling (Oversampling & Undersampling)** | Adjusts the dataset to balance class distribution.           | Works well when **data is small**.  |
| **Class Weight Adjustment**               | Increases the penalty for misclassifying the minority class.   | Works well for **imbalanced datasets**. |
| **Synthetic Data (SMOTE, ADASYN)**          | Generates artificial data points for the minority class.       | Works well for **small datasets**.  |
| **Anomaly Detection Methods**             | Treats rare classes as "anomalies" instead of normal classification. | Best for **fraud detection, rare diseases**. |
| **Ensemble Learning (Boosting, Bagging)**   | Combines multiple models to improve minority class detection.  | Works well for **complex imbalances**. |

✅ **Choosing the Right Method Depends on the Dataset & Problem Type!**

## **📊 1️⃣ Resampling (Oversampling & Undersampling)**

📌 **Oversampling:** Increases the number of minority class examples.  
📌 **Undersampling:** Reduces the number of majority class examples.  

✅ **Example: Predicting Football Injuries ⚽**  

| **Injury Status**    | **Before Oversampling** | **After Oversampling** |
|----------------------|-------------------------|------------------------|
| **Not Injured (0)**  | **90 players**          | **90 players**         |
| **Injured (1)**      | **10 players**          | **90 players**         |

📌 **Pros & Cons:**  

| Method           | Pros ✅                           | Cons ❌                           |
|------------------|-----------------------------------|-----------------------------------|
| **Oversampling** | Prevents information loss 📊      | Increases overfitting risk 📈     |
| **Undersampling**| Faster training ⏩                | Can remove important data ❌      |

✅ **Best for Small Datasets!**

## **📊 2️⃣ Class Weight Adjustment (Weighted Loss Function)**

📌 Modify the loss function so the model **penalizes errors on the minority class more heavily.**  

✅ **Example: Detecting NBA All-Stars 🏀**  
- **99% of NBA players are NOT All-Stars** → Model will ignore All-Stars.  
- **Solution?** Increase the weight for "All-Star" misclassification.  

📌 **Scikit-Learn Implementation:**

In [1]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Create sample imbalanced dataset
# Example: 90% class 0, 10% class 1
X_train = np.random.randn(1000, 4)  # 1000 samples, 4 features
y_train = np.concatenate([np.zeros(900), np.ones(100)])  # 900 zeros, 100 ones

# Train model with balanced class weights
model = RandomForestClassifier(class_weight='balanced')
model.fit(X_train, y_train)

✅ **Automatically adjusts for class imbalance!**

## **📊 3️⃣ Synthetic Data Generation (SMOTE & ADASYN)**

📌 **Creates new artificial samples for the minority class using existing data.**  
✅ **Great for small datasets where oversampling alone isn’t enough.**  

✅ **Example: Predicting Football Goals Based on Player Stats ⚽**  
- Only **5% of shots result in goals** → Model struggles to predict them.  
- **SMOTE** generates **synthetic goal-scoring events** to balance training data.  

📌 **SMOTE Implementation:**

In [2]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE
smote = SMOTE(random_state=42, sampling_strategy=0.5)  # Increase minority class by 50%
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Print the class distribution before and after SMOTE
print("Original class distribution:", dict(zip(*np.unique(y_train, return_counts=True))))
print("Resampled class distribution:", dict(zip(*np.unique(y_resampled, return_counts=True))))

Original class distribution: {np.float64(0.0): np.int64(900), np.float64(1.0): np.int64(100)}
Resampled class distribution: {np.float64(0.0): np.int64(900), np.float64(1.0): np.int64(450)}


✅ **More balanced dataset = better generalization!**

## **📊 4️⃣ Anomaly Detection for Extremely Imbalanced Data (Fraud, Injuries, etc.)**

📌 **When the minority class is very rare, treat it as an "anomaly" instead of normal classification.**  

✅ **Example: Detecting Injuries in Football Players ⚽**  
- **Injuries are rare (1% of players).**  
- Instead of a **normal classifier**, use **Anomaly Detection algorithms**.  

📌 **Implementation with Isolation Forest:**

In [6]:
from sklearn.ensemble import IsolationForest

# Split data into train and test sets
X_test = X_train[800:]  # Use last 200 samples as test set
X_train_subset = X_train[:800]  # Use first 800 samples for training

# Train anomaly detection model
anomaly_detector = IsolationForest(contamination=0.01)  # 1% expected injuries
anomaly_detector.fit(X_train_subset)
y_pred = anomaly_detector.predict(X_test)

# Print results (-1 for anomalies, 1 for normal cases)
print("Number of predicted anomalies:", sum(y_pred == -1))
print("Number of predicted normal cases:", sum(y_pred == 1))

Number of predicted anomalies: 2
Number of predicted normal cases: 198


✅ **Best for fraud detection, rare diseases, and sports injuries!**

## **📊 5️⃣ Ensemble Learning (Boosting & Bagging)**

📌 **Combines multiple models to improve classification of the minority class.**  

✅ **Example: Predicting NBA MVPs 🏀**  
- MVPs are **very rare events (~1% of players).**  
- **Boosting methods** improve detection by focusing on hard-to-classify MVPs.  

📌 **Implementation with XGBoost:**

In [11]:
from xgboost import XGBClassifier

model = XGBClassifier(
    tree_method='hist',  # Use histogram-based algorithm which works well on ARM
    enable_categorical=False,  # Required for M1/M2 Macs
    random_state=42  # Optional: for reproducibility
)
model.fit(X_train, y_train)

XGBoostError: 
XGBoost Library (libxgboost.dylib) could not be loaded.
Likely causes:
  * OpenMP runtime is not installed
    - vcomp140.dll or libgomp-1.dll for Windows
    - libomp.dylib for Mac OSX
    - libgomp.so for Linux and other UNIX-like OSes
    Mac OSX users: Run `brew install libomp` to install OpenMP runtime.

  * You are running 32-bit Python on a 64-bit OS

Error message(s): ["dlopen(/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/xgboost/lib/libxgboost.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib\n  Referenced from: <89AD948E-E564-3266-867D-7AF89D6488F0> /Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/xgboost/lib/libxgboost.dylib\n  Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file)"]


✅ **Boosting helps focus on difficult cases, improving recall!**

# **📊 Summary: Choosing the Best Technique**

| Problem Type                            | Best Method                                               |
|-----------------------------------------|-----------------------------------------------------------|
| **Small Dataset (Few Examples)**        | **Oversampling (SMOTE), Class Weight Adjustment**         |
| **Large Dataset (Many Examples)**       | **Undersampling, Ensemble Learning (Boosting)**           |
| **Extreme Imbalance (1% Minority Class)** | **Anomaly Detection, XGBoost, Weighted Loss**             |
| **Medical, Fraud, Rare Events**           | **Anomaly Detection (Isolation Forest, One-Class SVM)**     |

✅ **For sports predictions, injury forecasting, and goal-scoring models, SMOTE + Weighted Loss works best!**

# **🔥 Final Takeaways**

1️⃣ **Imbalanced datasets cause models to ignore the minority class (e.g., goals, injuries, fraud).**  
2️⃣ **Resampling (oversampling, undersampling) balances class representation.**  
3️⃣ **Class weighting makes the model "care more" about rare events.**  
4️⃣ **SMOTE & synthetic data help generate more minority class examples.**  
5️⃣ **Ensemble methods (Boosting, XGBoost) improve rare class detection.**