# Project Overview

This project focuses on building a predictive model for determining the probability of winning a bid in an ad-exchange auction. The goal is to surpass the **baseline F1 score of 0.503** using available features in the dataset, which represent various aspects of the auction process.

Given hardware limitations, I aimed to use tools and models that are efficient both in terms of memory and speed:

- **Polars**: Used for loading and preprocessing the data due to its high performance, especially with larger datasets.
- **LightGBM**: Chosen as the model due to its lightweight nature, efficiency, and ability to handle categorical features, making it ideal for this task.

# Project Structure

<pre style="font-size:14px;">
AD-EXCHANGE-AUCTION-PREDICTION/
├── configs/
│   ├── config.yaml
├── data/
│   ├── test_data.csv
│   ├── train_data.csv
├── models/
│   ├── trained_model.pkl
├── notebooks/
│   ├── 01_eda_clean.ipynb
│   ├── 02_train_infer.ipynb
├── results/
├── src/
│   ├── data_preprocessing/
│   │   ├── cleaner.py
│   │   ├── feature_engineering.py
│   ├── utils/
│   │   ├── helpers.py
├── approach_and_final_results.ipynb
├── environment.yml
├── hyperparameter_tuning.py
├── LICENSE
├── main.py
├── README.md
</pre>

This project structure is designed to ensure modularity and readability. While the `main.py` file contains the core logic for training, cross-validation, and inference, separating data cleaning and feature engineering into the `data_preprocessing/` folder makes it easier to manage and modify those processes independently. Additionally, helper functions in `helpers.py` allow for reusable, clean code, improving the overall maintainability of the project.


This is an example of `main.py` file running on a 1M sample of train data:

In [2]:
!python main.py


Data processed successfully.
--------------------------------------------------
Train data shape: (218045, 32)
Test data shape: (1000000, 32)
Number of features: 31
--------------------------------------------------
% target distribution in train data after downsampling:
 target
0    0.869848
1    0.130152
Name: proportion, dtype: float64
--------------------------------------------------

 Moving to model training...
No oversampling
Skipping cross-validation.

 Training final model...
Training time: 7.24 seconds
Final model performance on test data:


 **************************************************
F1 on test data: 0.56926
Improvement over Benchmark: 13.17%
**************************************************
Trained model saved at models/trained_model.pkl
Metrics saved to results/metrics_20241021_095651.txt
Feature importance plot saved to results/feature_importances.png
