In [28]:
%load_ext autoreload
%autoreload 2

import os
import sys

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import scipy
import altair as alt
from altair import datum
from tqdm.auto import tqdm, trange

from src.model import tscv

%run constants.py

%matplotlib inline
print("Versions:")
print("  Python: %s" % sys.version)
for module in [pd, np, sns, sklearn, alt]:
    print("  %s: %s" %(module.__name__, module.__version__))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Versions:
  Python: 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0]
  pandas: 1.1.0
  numpy: 1.19.1
  seaborn: 0.10.1
  sklearn: 0.23.2
  altair: 4.1.0


# Feature Engineering

I think we have enough information to start with feature engineering now.

The first step in my opinion is to define a validation pipeline, which is described in the next session.

## Validation pipeline

First of all, we need our metric, which sould be `sklearn.metrics.mean_squared_error(squared=False)`.

Now, we need to decide how our training set should be split to validate a model. Since the problem is about forecasting, I chose to do a time-series split for this. This means I'll train on the dataset where `date_block_num < k` and predict for dataset where `date_block_num = k`, for `k in [31, 32, 33]`.

In [12]:
train_set = pd.read_parquet(os.path.join(PROCESSED_DATA_DIR, 'train-set-base.parquet'))
train_set.head()

Unnamed: 0,date_block_num,item_id,shop_id,item_cnt
0,0,33,2,1.0
1,0,317,2,1.0
2,0,438,2,1.0
3,0,471,2,2.0
4,0,481,2,1.0


In order to use more of the dataset I'll use everything for CV and use the public LB score as the holdout set. This should be enough to make sure the model generalizes.

In [53]:
tscv.split(train_set.values)

[(array([      0,       1,       2, ..., 1506157, 1506158, 1506159]),
  array([1506160, 1506161, 1506162, ..., 1539643, 1539644, 1539645])),
 (array([      0,       1,       2, ..., 1539643, 1539644, 1539645]),
  array([1539646, 1539647, 1539648, ..., 1569321, 1569322, 1569323])),
 (array([      0,       1,       2, ..., 1569321, 1569322, 1569323]),
  array([1569324, 1569325, 1569326, ..., 1600852, 1600853, 1600854]))]

With that we can use scikit learn to evaluate a regressor. Let's try a random forest just as an exercise.

In [58]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_squared_error

reg = RandomForestRegressor(n_estimators=30, n_jobs=-1, verbose=1)
cross_validate(reg, X=train_set.drop(columns='item_cnt').values, y=train_set['item_cnt'].values, 
               scoring='neg_root_mean_squared_error', verbose=1, n_jobs=-1, cv=tscv.split(train_set.values))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   46.6s finished


{'fit_time': array([45.01807499, 46.02572703, 45.87729359]),
 'score_time': array([0.16727304, 0.03178692, 0.03285241]),
 'test_score': array([-1.74696769, -2.22734749, -2.34067508])}

Oof, it takes a while. Let's drop the number of months we are using.

In [60]:
cross_validate(reg, X=train_set.drop(columns='item_cnt').values, y=train_set['item_cnt'].values, 
               scoring='neg_root_mean_squared_error', verbose=1, n_jobs=-1, cv=tscv.split(train_set.values, window=15))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   18.9s finished


{'fit_time': array([18.46477222, 18.5503037 , 18.15812111]),
 'score_time': array([0.02659321, 0.02440453, 0.15769124]),
 'test_score': array([-1.74741024, -2.2119326 , -2.35210333])}