# **1️⃣ Handling Imbalanced Datasets in Classification Problems ⚖️🤖**

## **💡 Real-Life Analogy: Refereeing Fouls in a Football Match ⚽**

Imagine a football match where **90% of plays are fair** and **10% involve fouls**:
- If the referee **only focuses on fair plays**, they might **miss many real fouls** (low recall ❌).  
- If the referee **calls fouls too often**, they might **give unnecessary penalties** (low precision ❌).  
- **Solution?** Balance decision-making by **adjusting the way fouls are counted!** ✅

📌 **In machine learning, when one class is much rarer than another (e.g., fraud detection, medical diagnosis, goal-scoring predictions), we must adjust the model to handle this imbalance!**

# **📌 Why Are Imbalanced Datasets a Problem?**

✅ **Issue 1: Accuracy Can Be Misleading**  
- If **99% of games end in a win and 1% in a loss**, a model that **always predicts a win** is **99% accurate** but **completely useless** ❌.  

✅ **Issue 2: Model Ignores the Minority Class**  
- In fraud detection 💳, if 99% of transactions are legitimate, the model may **always predict "not fraud"**.  

✅ **Issue 3: Poor Generalization**  
- Models trained on **imbalanced data** struggle to detect **rare events** in new data.  

📌 **Solution? Use special techniques to balance the dataset and improve model learning!**

# **📌 Techniques to Handle Imbalanced Datasets**

| Method                                    | How It Works                                                 | Best Used For                   |
|-------------------------------------------|--------------------------------------------------------------|---------------------------------|
| **Resampling (Oversampling & Undersampling)** | Adjusts the dataset to balance class distribution.           | Works well when **data is small**.  |
| **Class Weight Adjustment**               | Increases the penalty for misclassifying the minority class.   | Works well for **imbalanced datasets**. |
| **Synthetic Data (SMOTE, ADASYN)**          | Generates artificial data points for the minority class.       | Works well for **small datasets**.  |
| **Anomaly Detection Methods**             | Treats rare classes as "anomalies" instead of normal classification. | Best for **fraud detection, rare diseases**. |
| **Ensemble Learning (Boosting, Bagging)**   | Combines multiple models to improve minority class detection.  | Works well for **complex imbalances**. |

✅ **Choosing the Right Method Depends on the Dataset & Problem Type!**

## **📊 1️⃣ Resampling (Oversampling & Undersampling)**

📌 **Oversampling:** Increases the number of minority class examples.  
📌 **Undersampling:** Reduces the number of majority class examples.  

✅ **Example: Predicting Football Injuries ⚽**  

| **Injury Status**    | **Before Oversampling** | **After Oversampling** |
|----------------------|-------------------------|------------------------|
| **Not Injured (0)**  | **90 players**          | **90 players**         |
| **Injured (1)**      | **10 players**          | **90 players**         |

📌 **Pros & Cons:**  

| Method           | Pros ✅                           | Cons ❌                           |
|------------------|-----------------------------------|-----------------------------------|
| **Oversampling** | Prevents information loss 📊      | Increases overfitting risk 📈     |
| **Undersampling**| Faster training ⏩                | Can remove important data ❌      |

✅ **Best for Small Datasets!**

## **📊 2️⃣ Class Weight Adjustment (Weighted Loss Function)**

📌 Modify the loss function so the model **penalizes errors on the minority class more heavily.**  

✅ **Example: Detecting NBA All-Stars 🏀**  
- **99% of NBA players are NOT All-Stars** → Model will ignore All-Stars.  
- **Solution?** Increase the weight for "All-Star" misclassification.  

📌 **Scikit-Learn Implementation:**

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

np.random.seed(42)

# Create synthetic dataset representing NBA players:
n_samples = 1000
n_all_stars = int(n_samples * 0.01)  # 1% All-Stars
n_non_all_stars = n_samples - n_all_stars

# Features: 4 random features per player
X = np.random.randn(n_samples, 4)

# Target: 0 for non-All-Stars, 1 for All-Stars
y = np.concatenate([np.zeros(n_non_all_stars), np.ones(n_all_stars)])

# Initialize RandomForestClassifier with balanced class weights
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X, y)

In [2]:
# Evaluate the model on the same data to inspect performance
y_pred = model.predict(X)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       990
         1.0       1.00      1.00      1.00        10

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000



✅ **Automatically adjusts for class imbalance!**

## **📊 3️⃣ Synthetic Data Generation (SMOTE & ADASYN)**

📌 **Creates new artificial samples for the minority class using existing data.**  
✅ **Great for small datasets where oversampling alone isn’t enough.**  

✅ **Example: Predicting Football Goals Based on Player Stats ⚽**  
- Only **5% of shots result in goals** → Model struggles to predict them.  
- **SMOTE** generates **synthetic goal-scoring events** to balance training data.  

📌 **SMOTE Implementation:**

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import numpy as np
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Create synthetic dataset representing football shots:
n_samples = 500
n_goals = int(n_samples * 0.05)     # 5% of shots result in goals
n_no_goals = n_samples - n_goals

# Features: For example, three random metrics per shot (e.g., shot power, shot accuracy, distance)
X = np.random.randn(n_samples, 3)

# Target: 1 indicates a goal, 0 indicates no goal
y = np.concatenate([np.zeros(n_no_goals), np.ones(n_goals)])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print class distribution before SMOTE
unique, counts = np.unique(y_train, return_counts=True)
print("Before SMOTE:", dict(zip(unique, counts)))

# Apply SMOTE to the training data to generate synthetic samples for the minority class (goals)
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Print class distribution after SMOTE
unique_res, counts_res = np.unique(y_train_resampled, return_counts=True)
print("After SMOTE:", dict(zip(unique_res, counts_res)))

Before SMOTE: {np.float64(0.0): np.int64(335), np.float64(1.0): np.int64(15)}
After SMOTE: {np.float64(0.0): np.int64(335), np.float64(1.0): np.int64(335)}


In [4]:
# Initialize and train a RandomForestClassifier on the resampled data
model = RandomForestClassifier(random_state=42)
model.fit(X_train_resampled, y_train_resampled)

# Optionally, evaluate the model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.92      0.86      0.89       140
         1.0       0.00      0.00      0.00        10

    accuracy                           0.80       150
   macro avg       0.46      0.43      0.44       150
weighted avg       0.86      0.80      0.83       150



✅ **More balanced dataset = better generalization!**

## **📊 4️⃣ Anomaly Detection for Extremely Imbalanced Data (Fraud, Injuries, etc.)**

📌 **When the minority class is very rare, treat it as an "anomaly" instead of normal classification.**  

✅ **Example: Detecting Injuries in Football Players ⚽**  
- **Injuries are rare (1% of players).**  
- Instead of a **normal classifier**, use **Anomaly Detection algorithms**.  

📌 **Implementation with Isolation Forest:**

In [7]:
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Create synthetic dataset representing player metrics
n_samples = 1000
n_features = 4  # e.g., speed, agility, reaction time, stamina
X = np.random.randn(n_samples, n_features)

# Inject anomalies representing injuries (1% of players) by shifting their values
n_injuries = int(n_samples * 0.01)
if n_injuries > 0:
    anomaly_indices = np.random.choice(n_samples, n_injuries, replace=False)
    X[anomaly_indices] += np.random.uniform(5, 10, size=(n_injuries, n_features))

# Split the dataset into training and testing sets
X_train, X_test = train_test_split(X, test_size=0.3, random_state=42)

# Train anomaly detection model
anomaly_detector = IsolationForest(contamination=0.01, random_state=42)  # 1% expected injuries
anomaly_detector.fit(X_train)
y_pred = anomaly_detector.predict(X_test)

# In IsolationForest, normal observations are labeled 1 and anomalies are labeled -1.
n_anomalies = np.sum(y_pred == -1)
n_normals = np.sum(y_pred == 1)
print(f"Number of predicted anomalies (injuries): {n_anomalies}")
print(f"Number of predicted normal cases: {n_normals}")

Number of predicted anomalies (injuries): 1
Number of predicted normal cases: 299


✅ **Best for fraud detection, rare diseases, and sports injuries!**

## **📊 5️⃣ Ensemble Learning (Boosting & Bagging)**

📌 **Combines multiple models to improve classification of the minority class.**  

✅ **Example: Predicting NBA MVPs 🏀**  
- MVPs are **very rare events (~1% of players).**  
- **Boosting methods** improve detection by focusing on hard-to-classify MVPs.  

📌 **Implementation with XGBoost:**

In [6]:
from xgboost import XGBClassifier

model = XGBClassifier(
    tree_method='hist',  # Use histogram-based algorithm which works well on ARM
    enable_categorical=False,  # Required for M1/M2 Macs
    random_state=42  # Optional: for reproducibility
)
model.fit(X_train, y_train)

✅ **Boosting helps focus on difficult cases, improving recall!**

# **📊 Summary: Choosing the Best Technique**

| Problem Type                            | Best Method                                               |
|-----------------------------------------|-----------------------------------------------------------|
| **Small Dataset (Few Examples)**        | **Oversampling (SMOTE), Class Weight Adjustment**         |
| **Large Dataset (Many Examples)**       | **Undersampling, Ensemble Learning (Boosting)**           |
| **Extreme Imbalance (1% Minority Class)** | **Anomaly Detection, XGBoost, Weighted Loss**             |
| **Medical, Fraud, Rare Events**           | **Anomaly Detection (Isolation Forest, One-Class SVM)**     |

✅ **For sports predictions, injury forecasting, and goal-scoring models, SMOTE + Weighted Loss works best!**

# **🔥 Final Takeaways**

1️⃣ **Imbalanced datasets cause models to ignore the minority class (e.g., goals, injuries, fraud).**  
2️⃣ **Resampling (oversampling, undersampling) balances class representation.**  
3️⃣ **Class weighting makes the model "care more" about rare events.**  
4️⃣ **SMOTE & synthetic data help generate more minority class examples.**  
5️⃣ **Ensemble methods (Boosting, XGBoost) improve rare class detection.**