Skip to content
This repository has been archived by the owner on Sep 19, 2023. It is now read-only.

caioaao/hse-aml-final-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

competitive-data-science-final-project

Project outline

This file is the summary of my project and should be seen as the main source of documentation. Experiments and EDA were made using jupyter notebooks inside ./notebooks. Each notebook is prefixed with a number: the order in which they should be read. To view all, run jupyter lab under <project-root>/notebooks. The title also contains the analysis step I was in when I wrote the notebook (EDA, feature engineering, hyperparameter tuning).

After exploring using the notebooks, I ported the logic into scripts and plain python code inside ./src so it's easier to manage the results. I also used GNU Make to define and run the steps that need to be run to generate the final results, reports, etc. All I had to do was run make all after changes and it would automatically detect what needed to be re-run.

Environment setup

This submission makes use of pipenv, so to install the dependencies, just run pipenv install inside the project directory. After that, run pipenv shell and you should be all set.

We also use GNU Make to process the data and generate the final solution.

Also make sure the kaggle CLI is configured as it is used to fetch the datasets.

Generating the Train Set Samples

This was the biggest challenge since I was sure I needed to exploit the way the test set was generated.

The final solution, which apparently fits pretty well with the test set, in summary was to run, for each month:

  1. Get the shops that appeared in the previous month
  2. Get the items that appeared in the current month
  3. Generate all possible pairs with the selected shops and items

The exploration can be seen on [notebook 22](./notebooks/22-eda train test round 2.ipynb).

Feature sets

Each feature set is generated by a script and the baseline validation score is generated by using an XGB with mostly default parameters and using a time series split with 3 months (31, 32, and 33). Most of the features were explored in one or more notebooks. To see the exploration, check the notebooks with feature engineering on the file name. I also included the notebook ids in the table below.

OBS: Best score is in bold

ID Notebook ID Description Baseline validation score
000 01 baseline: shop ids, date block num 1.08053
001 month number, year number 1.09704
002 04 lagged item counts 0.88733
003 02 item categories metadata 1.03844
004 [09](./notebooks/09-feature-engineering-shop mean encoding.ipynb) target encoding using item_id, shop_id, category_name, and subcategory_name 0.83296
005 10 external indicators: RUB to USD/CNY/EUR conversions, MOEX (Moscow Exchange) index (lagged and same month values) 1.09703
006 000 + 001 + 002 + 003 + 004 0.82067
007 04, [11](./notebooks/11-feature selection-lagged item cnts.ipynb) 002 after feature selection 0.88639
008 000 + 001 + 003 + 004 + 007 0.81741
009 [13](./notebooks/13-feature engineering-item prices.ipynb) Median prices for item, item+shop, category, shop+category 1.11636
010 [14](./notebooks/14-feature engineering-ranks.ipynb) Lagged ranks for item price and item cnt over item, category, shop+category 0.91049
013 008 + 009 + 010 0.81455
014 008 + 009 0.81981
015 008 + 010 0.81114
016 [16](./notebooks/16-feature engineering-time deltas.ipynb) deltas for item sales 0.89988
017 015 + 016 0.81726
018 [13](./notebooks/13-feature engineering-item prices.ipynb), [16](./notebooks/16-feature engineering-time deltas.ipynb) deltas for item prices 1.21004
019 017 + 018 0.82302
020 [15](./notebooks/154K Sep 2 20:39 01-eda-train-test-data.ipynb) revenue and sales / price 0.88878
021 019 + 020 0.81892
022 [18](./notebooks/18-feature engineering-release dates.ipynb) Release dates for item, item+shop, shop 1.06677
023 021 + 022 0.81233
024 [19](./notebooks/19-feature engineering-last seen.ipynb) Months since last seen for item, item+shop 1.04692
025 023 + 024 0.80703

By comparing the baseline validation score I was able to choose feature sets to mix and match until I got to the dataset I used for the final solution: 025.

Learning Algorithms

This section contains an explanation of each learning algorithm and the HPO strategy used to configure them.

XGBoost

Source code for building and tuning here.

For HPO I used optuna with the default TPE sampler and a HyperBand pruner.

After choosing most of the hyperparameters using Optuna, I find the optimal number of estimators (or boost rounds) by running train with early stopping. This probably doesn't yield the best result, but I decided to do so anyway to speed up the HPO.

Linear Regression

Source code for building and tuning here.

This was done using scikit-learn's stochastic gradient descent implementation, aka SGDRegressor.

Since this is also a gradient descent method, I decided to implement my own train loop so I could take advantage of optuna's hyperband pruner as well. After finding the optimal hyperparameters, I run a second pass on the best configuration with an increased amount of iterations to find the optimal number of iterations for his configuration.

Just like in the XGBoost, doing the optimal number of iterations separate from the other hyperparameters is submoptimal, but it was a trade-off I was willing to take to speed up the process.

For preprocessing, a standard linear model preprocessor was put together, doing one-hot encoding for categorical variables and then scaling the dataset variance to 1 (mean couldn't be scaled as well to preserve the sparseness of the train set matrix).

LightGBM

Source code for building and tuning here.

Used with Optuna for HPO. Optuna's LGBM integration was great since it implements the stepwise algorithm paired with random search. It was still slower than the random search for XGBoost and SGD, but the results were good.

Feed-Forward Neural Network

Source code for building and tuning here.

I really wanted a quick win here, so I just used this rules of thumb to get a decent score. The preprocessor is the same as SGD.

I didn't do much HPO, just used early stopping for finding a good number of epochs and tuned the initial learning rate for the ADAM optimizer to have a nice learning curve.

Experiments

We define an experiment as a feature-set + an algorithm. Their evaluation are the validation score and the public LB score (also our generalization score). Since our amount of submissions is limited I didn't send every result to the competition and used my validation score to choose the ones I submitted.

ID Algorithm Feature set Validation Score Public LB score
000 XGB 000 1.04264
004 XGB 025 0.79191 0.91197
006 LGB 025 0.79357 0.91941
007 MLP 025 0.84572

Stacking

For stacking I created cross-validation predictions for the last 8 months in the train set using a rolling window of 16 months. For instance, train on months 14 to 30 and generate predictions to month 31, then train on months 15 to 31 and generate predictions to month 32, etc.

After that I used this to train the estimator on the second layer. For validation score I trained on the first 7 months and predicted on the 8th.

| ID | Layer 0 IDs | Meta Estimator | Validation Score | Public LB Score | | -- | -- | -- | -- | | 0 | 004, 006, 007 | SGD | 0.77110 | 0.91199 | | 1 | 004, 006, 007 | XGB (small) | 0.78051 | 0.91361 |

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published