In [24]:
            %load_ext autoreload
%autoreload 2

import os
import sys

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import scipy
import altair as alt
from altair import datum
from sklearn.model_selection import cross_validate
from tqdm.auto import tqdm, trange

from src.model import tscv
from src.model.metrics import corrected_rmse, corrected_rmse_score

%run constants.py

%matplotlib inline
print("Versions:")
print("  Python: %s" % sys.version)
for module in [pd, np, sns, sklearn, alt]:
    print("  %s: %s" %(module.__name__, module.__version__))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Versions:
  Python: 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0]
  pandas: 1.1.0
  numpy: 1.19.1
  seaborn: 0.10.1
  sklearn: 0.23.2
  altair: 4.1.0


# Feature Engineering

I think we have enough information to start with feature engineering now.

The first step in my opinion is to define a validation pipeline, which is described in the next session.

## Validation pipeline

We already have our metric, which I implemented on `src.model.metrics.corrected_rmse`.

Now, we need to decide how our training set should be split to validate a model. Since the problem is about forecasting, I chose to do a time-series split for this. This means I'll train on the dataset where `date_block_num < k` and predict for dataset where `date_block_num = k`, for `k in [31, 32, 33]`.

I've prepared a train set which is basically the `sales_train.csv` grouped by month and without the first 20 months. Let's load that to start it out. 

In [25]:
train_set = pd.read_parquet(os.path.join(PROCESSED_DATA_DIR, 'train-set.parquet'))
train_set.describe()

Unnamed: 0,date_block_num,item_id,shop_id,item_cnt_month
count,1609124.0,1609124.0,1609124.0,1609124.0
mean,14.66479,10680.99,32.80585,2.022806
std,9.542322,6238.883,16.53701,2.577964
min,0.0,0.0,0.0,0.0
25%,6.0,5045.0,21.0,1.0
50%,14.0,10497.0,31.0,1.0
75%,23.0,16060.0,47.0,2.0
max,33.0,22169.0,59.0,20.0


In order to use more of the dataset I'll use everything for CV and use the public LB score as the generalization score.

In [26]:
tscv.split(train_set['date_block_num'].values)

[(array([      0,       1,       2, ..., 1514426, 1514427, 1514428]),
  array([1514429, 1514430, 1514431, ..., 1547912, 1547913, 1547914])),
 (array([      0,       1,       2, ..., 1547912, 1547913, 1547914]),
  array([1547915, 1547916, 1547917, ..., 1577590, 1577591, 1577592])),
 (array([      0,       1,       2, ..., 1577590, 1577591, 1577592]),
  array([1577593, 1577594, 1577595, ..., 1609121, 1609122, 1609123]))]

With that we can use scikit learn to evaluate a regressor. Let's prepare our matrices and try a random forest just as an exercise.

In [27]:
cv_splits = tscv.split(train_set['date_block_num'].values)
X_train, y_train = train_set.drop(columns='item_cnt_month').values, train_set['item_cnt_month'].values

We also need to remember to trim the outputs. For that I'll use a wrapper I wrote. Every estimator should be wrapped with it to have the output automatically clipped.

In [28]:
from sklearn.ensemble import RandomForestRegressor
from src.model import ClippedOutputRegressor

reg = ClippedOutputRegressor(RandomForestRegressor(n_estimators=30, n_jobs=-1, verbose=1))

In [29]:
scores = cross_validate(reg, X=X_train, y=y_train,
                        scoring=corrected_rmse_score, verbose=1, n_jobs=-1, 
                        cv=cv_splits, return_train_score=True)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.0min finished


{'fit_time': array([109.11260271, 110.04270244, 111.97539425]),
 'score_time': array([0.15496159, 0.41551447, 0.51796556]),
 'test_score': array([-0.83050309, -1.06552692, -1.12445434]),
 'train_score': array([-0.30845238, -0.30658002, -0.3071537 ])}

In [30]:
scores['test_score'].mean()

-1.0068281133517571

We can verify if our validation split is good by comparing with our generalization score. Since we're using the public LB, let's fit the model to the whole train set, create a submission and send it.

The test set we generate predictions for is a subset of the full test set. The submission predictions will then be passed to a function that will generate the final dataset.

In [31]:
import zipfile
with zipfile.ZipFile(os.path.join(RAW_DATA_DIR, 'competitive-data-science-predict-future-sales.zip'), 'r') as datasets_file:
    test_set = pd.read_csv(datasets_file.open('test.csv'))

test_subset = pd.read_parquet(os.path.join(PROCESSED_DATA_DIR, 'test-subset.parquet'))

In [32]:
reg.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   42.3s finished


ClippedOutputRegressor(regressor=RandomForestRegressor(n_estimators=30,
                                                       n_jobs=-1, verbose=1))

In [33]:
X_test = test_subset.values
y_pred = reg.predict(X_test)
test_subset['item_cnt_month'] = y_pred

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  30 out of  30 | elapsed:    0.0s finished


In [34]:
from src.submission import submission_from_subset

submission = submission_from_subset(test_subset, test_set)
submission.to_csv(os.path.join(TMP_DIR, 'rf-exercise-submission.csv'), index=False)

In [35]:
%%bash
kaggle c submit -f ${TMP_DIR}/rf-exercise-submission.csv -m 'testing CV score using a RF' competitive-data-science-predict-future-sales

Successfully submitted to Predict Future Sales

100%|██████████| 2.46M/2.46M [01:32<00:00, 28.0kB/s]


The score on the public LB is 1.19355, which is a bit farther from our score than the holdout set was. Also, we can see the score equivalent to the holdout set on the scores map (it's the last one of the test set scores) and it's closer. Since we're not trying to build a model that is robust to temporal factors and we're just trying to predict a single month, we should probably focus more on the month that is closer or the same month from previous years.

To validate this claim, let's try the CV only with months that are the same as the test set month.

In [36]:
test_months = [i for i in range(1, 34) if i % 12 == 34 % 12]
test_months

[10, 22]

In [37]:
cv_splits = tscv.split(train_set['date_block_num'].values, n=None, 
                       test_months=test_months)

In [38]:
scores = cross_validate(reg, X=X_train, y=y_train,
                        scoring=corrected_rmse_score, verbose=1, n_jobs=-1, 
                        cv=cv_splits, return_train_score=True)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   46.7s finished


{'fit_time': array([27.06822729, 42.65053535]),
 'score_time': array([0.41080236, 0.15328503]),
 'test_score': array([-1.23908863, -1.20861024]),
 'train_score': array([-0.31294194, -0.31232313])}

In [40]:
np.mean(scores['test_score'])

-1.2238494360890864

That's a lot closer. The only issue is that month 10 is too close to the beginning of the training set, so I'll probably want to use only month 22, otherwise our windows will be too tight. Either way, I won't change the validation now, but it's good to keep this in mind.