competitive-data-science-final-project

Project outline

This file is the summary of my project and should be seen as the main source of documentation. Experiments and EDA were made using jupyter notebooks inside ./notebooks. Each notebook is prefixed with a number: the order in which they should be read. To view all, run jupyter lab under <project-root>/notebooks. The title also contains the analysis step I was in when I wrote the notebook (EDA, feature engineering, hyperparameter tuning).

After exploring using the notebooks, I ported the logic into scripts and plain python code inside ./src so it's easier to manage the results. I also used GNU Make to define and run the steps that need to be run to generate the final results, reports, etc. All I had to do was run make all after changes and it would automatically detect what needed to be re-run.

Environment setup

This submission makes use of pipenv, so to install the dependencies, just run pipenv install inside the project directory. After that, run pipenv shell and you should be all set.

We also use GNU Make to process the data and generate the final solution.

Also make sure the kaggle CLI is configured as it is used to fetch the datasets.

Generating the Train Set Samples

This was the biggest challenge since I was sure I needed to exploit the way the test set was generated.

The final solution, which apparently fits pretty well with the test set, in summary was to run, for each month:

Get the shops that appeared in the previous month
Get the items that appeared in the current month
Generate all possible pairs with the selected shops and items

The exploration can be seen on [notebook 22](./notebooks/22-eda train test round 2.ipynb).

Feature sets

Each feature set is generated by a script and the baseline validation score is generated by using an XGB with mostly default parameters and using a time series split with 3 months (31, 32, and 33). Most of the features were explored in one or more notebooks. To see the exploration, check the notebooks with feature engineering on the file name. I also included the notebook ids in the table below.

OBS: Best score is in bold

ID	Notebook ID	Description	Baseline validation score
000	01	baseline: shop ids, date block num	1.08053
001		month number, year number	1.09704
002	04	lagged item counts	0.88733
003	02	item categories metadata	1.03844
004	[09](./notebooks/09-feature-engineering-shop mean encoding.ipynb)	target encoding using `item_id`, `shop_id`, `category_name`, and `subcategory_name`	0.83296
005	10	external indicators: RUB to USD/CNY/EUR conversions, MOEX (Moscow Exchange) index (lagged and same month values)	1.09703
006		000 + 001 + 002 + 003 + 004	0.82067
007	04, [11](./notebooks/11-feature selection-lagged item cnts.ipynb)	002 after feature selection	0.88639
008		000 + 001 + 003 + 004 + 007	0.81741
009	[13](./notebooks/13-feature engineering-item prices.ipynb)	Median prices for item, item+shop, category, shop+category	1.11636
010	[14](./notebooks/14-feature engineering-ranks.ipynb)	Lagged ranks for item price and item cnt over item, category, shop+category	0.91049
013		008 + 009 + 010	0.81455
014		008 + 009	0.81981
015		008 + 010	0.81114
016	[16](./notebooks/16-feature engineering-time deltas.ipynb)	deltas for item sales	0.89988
017		015 + 016	0.81726
018	[13](./notebooks/13-feature engineering-item prices.ipynb), [16](./notebooks/16-feature engineering-time deltas.ipynb)	deltas for item prices	1.21004
019		017 + 018	0.82302
020	[15](./notebooks/154K Sep 2 20:39 01-eda-train-test-data.ipynb)	revenue and sales / price	0.88878
021		019 + 020	0.81892
022	[18](./notebooks/18-feature engineering-release dates.ipynb)	Release dates for item, item+shop, shop	1.06677
023		021 + 022	0.81233
024	[19](./notebooks/19-feature engineering-last seen.ipynb)	Months since last seen for item, item+shop	1.04692
025		023 + 024	0.80703

By comparing the baseline validation score I was able to choose feature sets to mix and match until I got to the dataset I used for the final solution: 025.

Learning Algorithms

This section contains an explanation of each learning algorithm and the HPO strategy used to configure them.

XGBoost

Source code for building and tuning here.

For HPO I used optuna with the default TPE sampler and a HyperBand pruner.

After choosing most of the hyperparameters using Optuna, I find the optimal number of estimators (or boost rounds) by running train with early stopping. This probably doesn't yield the best result, but I decided to do so anyway to speed up the HPO.

Linear Regression

Source code for building and tuning here.

This was done using scikit-learn's stochastic gradient descent implementation, aka SGDRegressor.

Since this is also a gradient descent method, I decided to implement my own train loop so I could take advantage of optuna's hyperband pruner as well. After finding the optimal hyperparameters, I run a second pass on the best configuration with an increased amount of iterations to find the optimal number of iterations for his configuration.

Just like in the XGBoost, doing the optimal number of iterations separate from the other hyperparameters is submoptimal, but it was a trade-off I was willing to take to speed up the process.

For preprocessing, a standard linear model preprocessor was put together, doing one-hot encoding for categorical variables and then scaling the dataset variance to 1 (mean couldn't be scaled as well to preserve the sparseness of the train set matrix).

LightGBM

Source code for building and tuning here.

Used with Optuna for HPO. Optuna's LGBM integration was great since it implements the stepwise algorithm paired with random search. It was still slower than the random search for XGBoost and SGD, but the results were good.

Feed-Forward Neural Network

Source code for building and tuning here.

I really wanted a quick win here, so I just used this rules of thumb to get a decent score. The preprocessor is the same as SGD.

I didn't do much HPO, just used early stopping for finding a good number of epochs and tuned the initial learning rate for the ADAM optimizer to have a nice learning curve.

Experiments

We define an experiment as a feature-set + an algorithm. Their evaluation are the validation score and the public LB score (also our generalization score). Since our amount of submissions is limited I didn't send every result to the competition and used my validation score to choose the ones I submitted.

ID	Algorithm	Feature set	Validation Score	Public LB score
000	XGB	000	1.04264
004	XGB	025	0.79191	0.91197
006	LGB	025	0.79357	0.91941
007	MLP	025	0.84572

Stacking

For stacking I created cross-validation predictions for the last 8 months in the train set using a rolling window of 16 months. For instance, train on months 14 to 30 and generate predictions to month 31, then train on months 15 to 31 and generate predictions to month 32, etc.

After that I used this to train the estimator on the second layer. For validation score I trained on the first 7 months and predicted on the 8th.

| ID | Layer 0 IDs | Meta Estimator | Validation Score | Public LB Score | | -- | -- | -- | -- | | 0 | 004, 006, 007 | SGD | 0.77110 | 0.91199 | | 1 | 004, 006, 007 | XGB (small) | 0.78051 | 0.91361 |

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
notebooks		notebooks
reports		reports
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
setup.py		setup.py
tmux.sh		tmux.sh

caioaao/hse-aml-final-project

Folders and files

Latest commit

History

Repository files navigation

competitive-data-science-final-project

Project outline

Environment setup

Generating the Train Set Samples

Feature sets

Learning Algorithms

XGBoost

Linear Regression

LightGBM

Feed-Forward Neural Network

Experiments

Stacking

About

Resources

Stars

Watchers

Forks

Languages