<h1 style="color:blue;">Project Overview</h1>

This project focuses on building a predictive model for determining the probability of winning a bid in an ad-exchange auction. The goal is to surpass the **baseline F1 score of 0.503** using available features in the dataset, which represent various aspects of the auction process.

Given hardware limitations, I aimed to use tools and models that are efficient both in terms of memory and speed:

- **Polars**: Used for loading and preprocessing the data due to its high performance, especially with larger datasets.
- **LightGBM**: Chosen as the model due to its lightweight nature, efficiency, and ability to handle categorical features, making it ideal for this task.

## Project Structure

<pre style="font-size:14px;">
AD-EXCHANGE-AUCTION-PREDICTION/
├── configs/
│   ├── config.yaml
├── data/
│   ├── test_data.csv
│   ├── train_data.csv
├── models/
│   ├── trained_model.pkl
├── notebooks/
│   ├── 01_eda_clean.ipynb
│   ├── 02_train_infer.ipynb
├── results/
├── src/
│   ├── data_preprocessing/
│   │   ├── cleaner.py
│   │   ├── feature_engineering.py
│   ├── utils/
│   │   ├── helpers.py
├── approach_and_final_results.ipynb
├── environment.yml
├── hyperparameter_tuning.py
├── LICENSE
├── main.py
├── README.md
</pre>

This project structure is designed to ensure modularity and readability. While the `main.py` file contains the core logic for training, cross-validation, and inference, separating data cleaning and feature engineering into the `data_preprocessing/` folder makes it easier to manage and modify those processes independently. Additionally, helper functions in `helpers.py` allow for reusable, clean code, improving the overall maintainability of the project.

<h1 style="color:blue;">Initial Data Processing</h1>


The initial data processing focuses on optimizing memory usage, handling missing data, and addressing class imbalance. Key steps include:

- **Polars for Data Loading**: Ensuring fast and efficient data loading and preprocessing.
  - **Sampling Option**: To load a smaller sample (1M rows) for quicker iteration during development.

- **Dropping Non-informative Columns**: Removed columns like `ifa` (too many missing values), `sdk`, `adt`, `dc`, `ssp`, and `os` based on EDA insights.

- **Casting Column Types**: 
  - Floats as `float32` and integers as `int16` to optimize memory usage.
  - Categorical columns converted to categorical data types for efficiency.

- **Filling Missing Values**: Missing values in categorical columns were filled with `"unknown"`.

- **Downsampling Due to Class Imbalance**: Addressed the heavily imbalanced target (8.3% positive) by downsampling the negative class. The `neg_ratio` parameter was introduced and will later be optimized using Optuna to improve the F1 score on the test set.

<h1 style="color:blue;">Model Training, Cross-Validation, and Inference</h1>

The model training process was set up using **LightGBM** with simple initial parameters. Key steps included:

- **Training Setup**: LightGBM was selected for its efficiency and performance, with basic model parameters specified in the `config.yaml` file.
  
- **Cross-Validation Setup**: A K-Fold cross-validation approach was used to evaluate the model's performance on multiple folds. The process outputs F1 scores for both training and validation sets across the folds, providing a robust assessment of the model's generalization capabilities.

- **Evaluation on Test Dataset**: After training on the entire dataset, the model was evaluated on the test set. Key outputs include:
  - **F1 Score** on the test data to assess model performance.
  - **Improvement over Benchmark**: The F1 score is compared against a predefined benchmark from the configuration file, and the percentage improvement is calculated.

- **Outputs**:
  - **Metrics**: Cross-validation F1 scores, train/test performance, and improvement over the benchmark.
  - **Feature Importance**: A plot showing the importance of each feature in the model.
  - **Cross-Validation F1 Fold Plots**: A plot showing F1 scores across different validation folds.


## Initial Results

### No Downsampling

**Mean Train F1 Score**: 0.59354  
**Mean Validation F1 Score**: 0.59076  
**Test F1 Score**: 0.60681

#### **Improvement over Benchmark**: 20.64%

<div style="display: flex; justify-content: space-between; width:60%">
    <img src="results/01_no_down_no_fe/cross_validation_scores.png" alt="Cross-Validation Plot" style="width:60%;">
    <img src="results/01_no_down_no_fe/feature_importances.png" alt="Feature Importance Plot" style="width:60%;">
</div>

### Arbitrary NEG_RATIO = 4 Downsampling

**Mean Train F1 Score**: 0.75363  
**Mean Validation F1 Score**: 0.75002  
**Test F1 Score**: 0.60930

#### **Improvement over Benchmark**: 21.13%

<div style="display: flex; justify-content: space-between; width:60%">
    <img src="results/02_neg_ratio4_no_fe/cross_validation_scores.png" alt="Cross-Validation Plot" style="width:60%;">
    <img src="results/02_neg_ratio4_no_fe/feature_importances.png" alt="Feature Importance Plot" style="width:60%;">
</div>

## Feature Engineering

The following features were engineered to enhance model performance:

- **Log Transformation**: Applied log transformation to price-related columns (`flr`, `sellerClearPrice`, `price`) to reduce skewness.
  
- **Categorical Hour**: Converted the `hour` column into a categorical variable (`hour_cat`) and created hour bands (`hour_band`) for better time-based segmentation (e.g., Morning, Afternoon).

- **Language Simplification**: Extracted the general language part from the `lang` column by splitting and retaining the first part, reducing granularity.

- **Screen Size**: Combined the width and height of the device's screen into a new categorical `screen_size` feature.

- **Low-Frequency Categories**: Used the `replace_less_frequent_polars` function from the helpers module to group low-frequency values in categorical columns (`sdkver`, `lang`, `country`, `region`, `screen_size`) into an "other" category.

---

### Model Performance After Initial Feature Engineering (NEG_RATIO = 4)

**Mean Train F1 Score**: 0.74860  
**Mean Validation F1 Score**: 0.74673  
**Test F1 Score**: 0.60729  

#### **Improvement over Benchmark**: 20.73%

<div style="display: flex; justify-content: space-between; width:60%">
    <img src="results/03_fe1/cross_validation_scores.png" alt="Cross-Validation Plot" style="width:60%;">
    <img src="results/03_fe1/feature_importances.png" alt="Feature Importance Plot" style="width:60%;">
</div>

---

## Additional Feature Engineering: Price Normalization 

In addition to the above feature engineering, a second function was implemented to normalize the price columns (`flr`, `sellerClearPrice`, `price`) by the mean and standard deviation for each **Country**, **dsp** and **hour cat**.

---

### Model Performance After Price Normalization (NEG_RATIO = 4)

**Mean Train F1 Score**: 0.74851  
**Mean Validation F1 Score**: 0.74685  
**Test F1 Score**: 0.61082  

#### **Improvement over Benchmark**: 21.44%

<div style="display: flex; justify-content: space-between; width:60%">
    <img src="results/04_fe2_group/cross_validation_scores.png" alt="Cross-Validation Plot" style="width:60%;">
    <img src="results/04_fe2_group/feature_importances.png" alt="Feature Importance Plot" style="width:60%;">
</div>

<h1 style="color:blue;">Hyperparameter Tuning</h1>

Based on the cross-validation results, the model was underfitting, likely due to an insufficient setup (e.g., only `n_estimators = 300`). To address this, I used **Optuna** to optimize key hyperparameters related to model complexity while balancing training and inference time. I also tuned the `NEG_RATIO` to improve handling of class imbalance.

### Optuna Setup:
```python
def objective(trial):
    NEG_RATIO = trial.suggest_float('neg_ratio', 2, 6)
    lgb_params = {
        'n_estimators': 500,
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.1, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 6),
        'num_leaves': trial.suggest_int('num_leaves', 20, 64),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-4, 1.0, log=True),
    }
    # Training code...#
    return test_f1
```

Running 20 trials with Optuna led to a significant improvement in model performance.

**Best Trial:**
```
F1 Score: 0.6195  
Parameters: {'neg_ratio': 4.92, 'learning_rate': 0.056, 'max_depth': 6, 'num_leaves': 35, 'subsample': 0.89, 'lambda_l2': 0.00031}
```

In [1]:
!python main.py


Data processed successfully.
--------------------------------------------------
Train data shape: (4747671, 32)
Test data shape: (1500000, 32)
Number of features: 31
--------------------------------------------------
% target distribution in train data after downsampling:
 target
0    0.831062
1    0.168938
Name: proportion, dtype: float64
--------------------------------------------------

 Moving to model training...
No oversampling
Skipping cross-validation.

 Training final model...
Training time: 96.78 seconds
Final model performance on test data:


 **************************************************
F1 on test data: 0.61949
Improvement over Benchmark: 23.16%
**************************************************
Trained model saved at models/trained_model.pkl
Metrics saved to results/metrics_20241021_135229.txt


### FINAL MODEL PERFORMANCE After Hyperparameter Tuning

**Mean Train F1 Score**: 0.72010  
**Mean Validation F1 Score**: 0.71821  
**Test F1 Score**: 0.61949  

#### **Improvement over Benchmark:** 23.16%

<div style="display: flex; justify-content: space-between; width:60%">
    <img src="results/05_tuned/cross_validation_scores.png" alt="Cross-Validation Plot" style="width:60%;">
    <img src="results/05_tuned/feature_importances.png" alt="Feature Importance Plot" style="width:60%;">
</div>