This file is the summary of my project and should be seen as the main source of documentation. Experiments and EDA were made using jupyter notebooks inside ./notebooks. Each notebook is prefixed with a number: the order in which they should be read. To view all, run jupyter lab
under <project-root>/notebooks
. The title also contains the analysis step I was in when I wrote the notebook (EDA, feature engineering, hyperparameter tuning).
After exploring using the notebooks, I ported the logic into scripts and plain python code inside ./src so it's easier to manage the results. I also used GNU Make to define and run the steps that need to be run to generate the final results, reports, etc. All I had to do was run make all
after changes and it would automatically detect what needed to be re-run.
This submission makes use of pipenv, so to install the dependencies, just run pipenv install
inside the project directory. After that, run pipenv shell
and you should be all set.
We also use GNU Make to process the data and generate the final solution.
Also make sure the kaggle CLI is configured as it is used to fetch the datasets.
This was the biggest challenge since I was sure I needed to exploit the way the test set was generated.
The final solution, which apparently fits pretty well with the test set, in summary was to run, for each month:
- Get the shops that appeared in the previous month
- Get the items that appeared in the current month
- Generate all possible pairs with the selected shops and items
The exploration can be seen on [notebook 22](./notebooks/22-eda train test round 2.ipynb).
Each feature set is generated by a script and the baseline validation score is generated by using an XGB with mostly default parameters and using a time series split with 3 months (31, 32, and 33). Most of the features were explored in one or more notebooks. To see the exploration, check the notebooks with feature engineering
on the file name. I also included the notebook ids in the table below.
OBS: Best score is in bold
ID | Notebook ID | Description | Baseline validation score |
---|---|---|---|
000 | 01 | baseline: shop ids, date block num | 1.08053 |
001 | month number, year number | 1.09704 | |
002 | 04 | lagged item counts | 0.88733 |
003 | 02 | item categories metadata | 1.03844 |
004 | [09](./notebooks/09-feature-engineering-shop mean encoding.ipynb) | target encoding using item_id , shop_id , category_name , and subcategory_name |
0.83296 |
005 | 10 | external indicators: RUB to USD/CNY/EUR conversions, MOEX (Moscow Exchange) index (lagged and same month values) | 1.09703 |
006 | 000 + 001 + 002 + 003 + 004 | 0.82067 | |
007 | 04, [11](./notebooks/11-feature selection-lagged item cnts.ipynb) | 002 after feature selection | 0.88639 |
008 | 000 + 001 + 003 + 004 + 007 | 0.81741 | |
009 | [13](./notebooks/13-feature engineering-item prices.ipynb) | Median prices for item, item+shop, category, shop+category | 1.11636 |
010 | [14](./notebooks/14-feature engineering-ranks.ipynb) | Lagged ranks for item price and item cnt over item, category, shop+category | 0.91049 |
013 | 008 + 009 + 010 | 0.81455 | |
014 | 008 + 009 | 0.81981 | |
015 | 008 + 010 | 0.81114 | |
016 | [16](./notebooks/16-feature engineering-time deltas.ipynb) | deltas for item sales | 0.89988 |
017 | 015 + 016 | 0.81726 | |
018 | [13](./notebooks/13-feature engineering-item prices.ipynb), [16](./notebooks/16-feature engineering-time deltas.ipynb) | deltas for item prices | 1.21004 |
019 | 017 + 018 | 0.82302 | |
020 | [15](./notebooks/154K Sep 2 20:39 01-eda-train-test-data.ipynb) | revenue and sales / price | 0.88878 |
021 | 019 + 020 | 0.81892 | |
022 | [18](./notebooks/18-feature engineering-release dates.ipynb) | Release dates for item, item+shop, shop | 1.06677 |
023 | 021 + 022 | 0.81233 | |
024 | [19](./notebooks/19-feature engineering-last seen.ipynb) | Months since last seen for item, item+shop | 1.04692 |
025 | 023 + 024 | 0.80703 |
By comparing the baseline validation score I was able to choose feature sets to mix and match until I got to the dataset I used for the final solution: 025.
This section contains an explanation of each learning algorithm and the HPO strategy used to configure them.
Source code for building and tuning here.
For HPO I used optuna with the default TPE sampler and a HyperBand pruner.
After choosing most of the hyperparameters using Optuna, I find the optimal number of estimators (or boost rounds) by running train with early stopping. This probably doesn't yield the best result, but I decided to do so anyway to speed up the HPO.
Source code for building and tuning here.
This was done using scikit-learn's stochastic gradient descent implementation, aka SGDRegressor
.
Since this is also a gradient descent method, I decided to implement my own train loop so I could take advantage of optuna's hyperband pruner as well. After finding the optimal hyperparameters, I run a second pass on the best configuration with an increased amount of iterations to find the optimal number of iterations for his configuration.
Just like in the XGBoost, doing the optimal number of iterations separate from the other hyperparameters is submoptimal, but it was a trade-off I was willing to take to speed up the process.
For preprocessing, a standard linear model preprocessor was put together, doing one-hot encoding for categorical variables and then scaling the dataset variance to 1 (mean couldn't be scaled as well to preserve the sparseness of the train set matrix).
Source code for building and tuning here.
Used with Optuna for HPO. Optuna's LGBM integration was great since it implements the stepwise algorithm paired with random search. It was still slower than the random search for XGBoost and SGD, but the results were good.
Source code for building and tuning here.
I really wanted a quick win here, so I just used this rules of thumb to get a decent score. The preprocessor is the same as SGD.
I didn't do much HPO, just used early stopping for finding a good number of epochs and tuned the initial learning rate for the ADAM optimizer to have a nice learning curve.
We define an experiment as a feature-set + an algorithm. Their evaluation are the validation score and the public LB score (also our generalization score). Since our amount of submissions is limited I didn't send every result to the competition and used my validation score to choose the ones I submitted.
ID | Algorithm | Feature set | Validation Score | Public LB score |
---|---|---|---|---|
000 | XGB | 000 | 1.04264 | |
004 | XGB | 025 | 0.79191 | 0.91197 |
006 | LGB | 025 | 0.79357 | 0.91941 |
007 | MLP | 025 | 0.84572 |
For stacking I created cross-validation predictions for the last 8 months in the train set using a rolling window of 16 months. For instance, train on months 14 to 30 and generate predictions to month 31, then train on months 15 to 31 and generate predictions to month 32, etc.
After that I used this to train the estimator on the second layer. For validation score I trained on the first 7 months and predicted on the 8th.
| ID | Layer 0 IDs | Meta Estimator | Validation Score | Public LB Score | | -- | -- | -- | -- | | 0 | 004, 006, 007 | SGD | 0.77110 | 0.91199 | | 1 | 004, 006, 007 | XGB (small) | 0.78051 | 0.91361 |