This repository contains implementations of Decision Tree and Ensemble Learning algorithms specifically designed for handling imbalanced datasets. The project was developed as part of a Machine Learning course assignment (Spring 2025) and includes custom implementations from scratch without relying on pre-built ML libraries for the core algorithms (e.g., no sklearn for tree logic).
- Implement Hellinger Distance Decision Tree (HDDT) suitable for imbalanced data.
- Develop Bagging with Undersampling ensemble method.
- Implement AdaBoost.M1 with SMOTE for handling class imbalance.
- Evaluate performance using appropriate metrics for imbalanced datasets (Precision, Recall, F-measure, AUC, G-mean).
├── data/
│ ├── Covid19HDDT.csv # Dataset for Part A (Decision Tree)
│ ├── Covid.csv # Dataset for Part B Method I (Bagging)
│ └── abalone.data # Dataset for Part B Method II (AdaBoost + SMOTE)
│
├── result/
│ ├── bagging_results.csv # Results from Bagging experiments
│ ├── method_II_results.csv # Results from AdaBoost + SMOTE experiments
│
├── code.ipynb
└── README.md # This file
Dataset: Covid19HDDT.csv (3-class imbalanced dataset)
Key Features:
- Custom HDDT implementation from scratch.
- Hellinger Distance used as the split criterion (robust to class imbalance).
- Support for both binary and multi-class classification.
- OVO (One-vs-One) and OVA (One-vs-All) strategies implemented for multi-class.
- Tree pruning with configurable max heights
{2, 3, 4, 5}.
Evaluation Metrics:
- Precision, Recall, F-measure (for minority class).
- AUC (Area Under ROC Curve).
- G-mean (Geometric Mean of sensitivity and specificity).
Dataset: Covid.csv (imbalanced binary classification)
Implementation:
- Bootstrap sampling from minority class.
- Random undersampling from majority class to balance each bootstrap sample.
- Base learners: HDDT and Decision Stump.
- Configurable number of estimators:
{11, 31, 51, 101}. - Evaluated over 10 independent runs.
Dataset: Abalone (UCI Machine Learning Repository)
Implementation:
- SMOTE (Synthetic Minority Over-sampling Technique) implemented from scratch to generate synthetic samples.
- AdaBoost.M1 with Decision Stump base learners.
- 5-fold cross-validation repeated 5 times (25 total runs).
- Preprocessing: One-Hot Encoding for categorical features (Sex), Standard Scaling for numerical features.
- Task: Binary classification (≤7 rings vs >7 rings).
Bonus Experiments:
- Varying the number of synthetic samples (
N) to analyze overfitting vs. performance trade-offs. - Visualizing decision boundaries and SMOTE synthetic samples.
| Method | Accuracy | Precision | Recall | F-measure | AUC | G-mean |
|---|---|---|---|---|---|---|
| OVA | 0.988 | 0.971 | 0.945 | 0.957 | 0.969 | 0.968 |
| OVO | 0.986 | 0.942 | 0.872 | 0.905 | 0.933 | 0.931 |
| Two-class | 0.986 | 0.942 | 0.872 | 0.905 | 0.933 | 0.931 |
Observation: One-vs-All (OVA) approach showed slightly better performance than One-vs-One (OVO).
| Base Learner | T | Precision | Recall | F-measure | AUC | G-mean |
|---|---|---|---|---|---|---|
| HDDT | 11 | 0.994 | 0.733 | 0.844 | 0.837 | 0.827 |
| HDDT | 101 | 0.995 | 0.740 | 0.849 | 0.850 | 0.835 |
| Stump | 11 | 0.999 | 0.692 | 0.817 | 0.840 | 0.826 |
| Stump | 101 | 1.000 | 0.692 | 0.818 | 0.842 | 0.830 |
Observation: HDDT base learners generally achieved higher Recall and G-mean compared to Decision Stumps.
| Method | Accuracy | AUROC | AUPR | G-mean |
|---|---|---|---|---|
| AdaBoost + SMOTE | 0.867 | 0.874 | 0.577 | 0.870 |
| Baseline (AdaBoost) | 0.905 | 0.939 | 0.820 | 0.837 |
Observation: SMOTE improved G-mean (better minority class detection) but decreased overall accuracy compared to the baseline.
- More robust to class imbalance than Gini or Entropy.
- Doesn't require smoothing for zero probabilities.
- Naturally handles skewed class distributions.
- Generates synthetic samples in feature space.
- Helps decision boundary learning.
- Reduces bias towards majority class.
- Focuses on misclassified samples.
- Combines weak learners into strong classifier.
- Works well with SMOTE-balanced data.
- UCI Machine Learning Repository - Abalone Dataset
- Hellinger Distance Decision Trees for Imbalanced Data
- SMOTE: Synthetic Minority Over-sampling Technique (Chawla et al., 2002)
- AdaBoost.M1: Freund & Schapire (1996)