In [1]:
            %load_ext autoreload
%autoreload 2

import os
import sys

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import scipy
import altair as alt
from altair import datum
from sklearn.model_selection import cross_validate
from tqdm.auto import tqdm, trange

from src.model import tscv

%run constants.py

%matplotlib inline
print("Versions:")
print("  Python: %s" % sys.version)
for module in [pd, np, sns, sklearn, alt]:
    print("  %s: %s" %(module.__name__, module.__version__))

Versions:
  Python: 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0]
  pandas: 1.1.0
  numpy: 1.19.1
  seaborn: 0.10.1
  sklearn: 0.23.2
  altair: 4.1.0


# Feature Engineering

I think we have enough information to start with feature engineering now.

The first step in my opinion is to define a validation pipeline, which is described in the next session.

## Validation pipeline

We already have our metric defined as the RMSE.

Now, we need to decide how our training set should be split to validate a model. Since the problem is about forecasting, I chose to do a time-series split for this. This means I'll train on the dataset where `date_block_num < k` and predict for dataset where `date_block_num = k`, for `k in [31, 32, 33]`.

I've prepared a train set which is basically the `sales_train.csv` grouped by month and without the first 20 months. Let's load that to start it out. 

In [2]:
train_set = pd.read_parquet(os.path.join(PROCESSED_DATA_DIR, 'train-set.parquet'))
train_set.describe()

Unnamed: 0,item_id,shop_id,date_block_num,item_cnt_month
count,5140800.0,5140800.0,5140800.0,5140800.0
mean,11019.4,31.64286,21.5,0.2199702
std,6252.631,17.56189,6.922187,1.113889
min,30.0,2.0,10.0,0.0
25%,5381.5,16.0,15.75,0.0
50%,11203.0,34.5,21.5,0.0
75%,16071.5,47.0,27.25,0.0
max,22167.0,59.0,33.0,20.0


In order to use more of the dataset I'll use everything for CV and use the public LB score as the generalization score.

In [3]:
tscv.split(train_set['date_block_num'].values)

[(array([      0,       1,       2, ..., 4498197, 4498198, 4498199]),
  array([4498200, 4498201, 4498202, ..., 4712397, 4712398, 4712399])),
 (array([      0,       1,       2, ..., 4712397, 4712398, 4712399]),
  array([4712400, 4712401, 4712402, ..., 4926597, 4926598, 4926599])),
 (array([      0,       1,       2, ..., 4926597, 4926598, 4926599]),
  array([4926600, 4926601, 4926602, ..., 5140797, 5140798, 5140799]))]

With that we can use scikit learn to evaluate a regressor. Let's prepare our matrices and try a random forest just as an exercise.

In [4]:
cv_splits = tscv.split(train_set['date_block_num'].values)
X_train, y_train = train_set.drop(columns='item_cnt_month').values, train_set['item_cnt_month'].values

We also need to remember to trim the outputs. For that I'll use a wrapper I wrote. Every estimator should be wrapped with it to have the output automatically clipped.

In [5]:
from sklearn.ensemble import RandomForestRegressor
from src.model import ClippedOutputRegressor

reg = ClippedOutputRegressor(RandomForestRegressor(n_estimators=30, n_jobs=-1, verbose=1))

In [6]:
scores = cross_validate(reg, X=X_train, y=y_train,
                        scoring='neg_root_mean_squared_error', verbose=1, n_jobs=-1, 
                        cv=cv_splits, return_train_score=True)
scores

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  2.9min finished


{'fit_time': array([157.37074184, 159.43186998, 167.41473937]),
 'score_time': array([0.61956573, 0.73969293, 0.31278968]),
 'test_score': array([-0.81331843, -0.97432712, -0.99426806]),
 'train_score': array([-0.26957893, -0.2706521 , -0.27173339])}

In [7]:
scores['test_score'].mean()

-0.9273045367247977

We can verify if our validation split is good by comparing with our generalization score. Since we're using the public LB, let's fit the model to the whole train set, create a submission and send it.

In [8]:
test_set = pd.read_parquet(os.path.join(PROCESSED_DATA_DIR, 'test-set.parquet'))

In [9]:
reg.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.0min finished


ClippedOutputRegressor(regressor=RandomForestRegressor(n_estimators=30,
                                                       n_jobs=-1, verbose=1))

In [10]:
X_test = test_set[['item_id', 'shop_id', 'date_block_num']].values
y_pred = reg.predict(X_test)
test_set['item_cnt_month'] = y_pred

[Parallel(n_jobs=12)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=12)]: Done  30 out of  30 | elapsed:    0.2s finished


In [11]:
test_set[['ID', 'item_cnt_month']].to_csv(os.path.join(TMP_DIR, 'rf-exercise-submission.csv'), index=False)

In [12]:
%%bash
kaggle c submit -f ${TMP_DIR}/rf-exercise-submission.csv -m 'testing CV score using a RF' competitive-data-science-predict-future-sales

403 - Your team has used its submission allowance (5 of 5). This resets at midnight UTC (12 hours from now).


100%|██████████| 3.11M/3.11M [00:09<00:00, 339kB/s] 


CalledProcessError: Command 'b"kaggle c submit -f ${TMP_DIR}/rf-exercise-submission.csv -m 'testing CV score using a RF' competitive-data-science-predict-future-sales\n"' returned non-zero exit status 1.

The score on the public LB is ~1.09, which is a bit far from our CV score.

In [13]:
!echo $TMP_DIR

/tmp/tmpeid5tq20
