# CIS 5450 Project: Difficulty Topics
**Group Members:**
* **Amogh Channashetti**
* **Binoy Patel**
* **Yohan Vergis**


## **Topic 1: Imbalanced Data**
[Hyperlink](https://colab.research.google.com/drive/1yPnYThJtgEq3xZUUYbIjLBBkydeEufLu?authuser=1#scrollTo=JhyzvkwG91FY
  )
### **Why we used this concept**
Our target variable, 'is_hit', was extremely imbalanced. Only ~3.5% of tracks in the dataset qualify as hits. Hits are defined as popularity >= 70.

The 28:1 imbalance makes the model favor non-hits. This gives high accuracy but poor performance on detecting hits.

The goal of this analysis is to identify the characteristics of hit songs, so we needed to ensure the model could meaningfully learn patterns associated with hits. Handling imbalanced data is critical to avoid biased models and to ensure the classifier is evaluated fairly.

### **How we implemented it**
We applied three imbalance-handling strategies:

1. **Class Weights (Baseline Logistic Regression)**  
   - Applied `class_weight="balanced"` to increase penalty on misclassified hits.

2. **SMOTE Oversampling**  
   - Used `SMOTE()` inside an `ImbPipeline` with scaling and logistic regression.  
   - Used synthesized samples to increase the number of hit-song instances.
3. **Random Undersampling**
   - Reduced the majority class size to match hits more closely.
   - Helps model focus on minority patterns at the cost of losing some data.

### **Results & Interpretation**
Across both models, feature importance was **highly consistent**:

- SMOTE increased recall for hits. This shows the model can better identify minority samples.
- Undersampling produced more balanced results, though at the cost of slightly less precision.
- Class-weighted logistic regression provided a solid baseline, but its precision remained low.
- These results demonstrated that resampling meaningfully changes model behavior. Imbalance handling is essential for interpreting model capability on rare hit songs.

The insights from this analysis shaped our final model selection and highlighted the difficulty of predicting hits based solely on audio features.

## **Topic 2: Ensemble Models**
[Hyperlink](https://colab.research.google.com/drive/1yPnYThJtgEq3xZUUYbIjLBBkydeEufLu?authuser=1#scrollTo=2U7snb1f4cK4&line=8&uniqifier=1)
### **Why we used this concept**
Predicting hit songs involves complex, nonlinear relationships between audio features such as energy, loudness, danceability, and acousticness. As a result, Linear models such logistic regression can't capture these interactions well and showed weak performance in our baseline tests.

Ensemble models, particularly Random Forest and Gradient Boosting, are strong choices for noisy tabular data because they can learn nonlinear structures in the features. Using ensembles allowed us to improve predictive performance, capture richer feature interactions, and compare model behavior across different learning paradigms.

### **How we implemented it**
We trained and evaluated two esnsemble classifiers:

1. **Random Forest Classifier**
   - Tuned key params: number of trees, depth, and sample splits
   - Calculated performance using ROC-AUC, PR-AUC, precision, and recall
   - Extracted feature importances
2. **Gradient Boosting Classifier** (baseline + tuned)
   - Implemented a standard gradient boosting model
   - Improved it later with hyperparameter tuning
   - Evaluated using the same metrics for a fair comparison


### **Results & Interpretation**


- Random Forest achieved the best overall performance. This was especially the case in PR-AUC, indicating strong ranking capability for rare hits.  
- Tuned Gradient Boosting improved significantly over its baseline model. This showed the effect of hyperparameter optimization.
- Ensemble methods consistently outperformed logistic regression.

This proved that Audio-based hit prediction requires nonlinear modeling capacity.



## **Topic 3: Hyperparameter Tuning** (RandomizedSearchCV on Gradient Boosting)
[Hyperlink](https://colab.research.google.com/drive/1yPnYThJtgEq3xZUUYbIjLBBkydeEufLu?authuser=1#scrollTo=wMYAKQy5BBGx&line=5&uniqifier=1)
### **Why we used this concept**
Gradient Boosting is a strong model but highly sensitive to hyperparameters like learning rate, tree depth, and number of estimators. Our baseline model underperformed, indicating that it wasn't making good use of the available features. To achieve better performance and avoid underfitting/overfitting, we used RandomizedSearchCV, which efficiently explores a wide parameter space without the computational cost of grid search.




### **How we implemented it**
We defined a search space including:

- `learning_rate`
- `max_depth`
- `n_estimators`
- `subsample`

RandomizedSearchCV was run with cross-validation using ROC-AUC as the scoring metric. Using the optimal parameters, we rebuilt the Gradient Boosting model and tested it on the same evaluation set as the rest.


### **Results & Interpretation**
The tuned model significantly outperformed the baseline Gradient Boosting model. Its higher ROC-AUC and improved PR-AUC are especially valuable for imbalanced hit prediction.

Tuning allowed the model to capture more complex nonlinear interactions between audio features. This reduced the gap with Random Forest and showed that hyperparameter optimization makes a real difference for ensemble classifiers.