In [2]:
import os
import sys

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import scipy
import altair as alt
from altair import datum
from tqdm.auto import tqdm, trange

from src.model import tscv

%run constants.py

%matplotlib inline
print("Versions:")
print("  Python: %s" % sys.version)
for module in [pd, np, sns, sklearn, alt]:
    print("  %s: %s" %(module.__name__, module.__version__))

Versions:
  Python: 3.8.2 (default, Jul 16 2020, 14:00:26) 
[GCC 9.3.0]
  pandas: 1.1.0
  numpy: 1.19.1
  seaborn: 0.10.1
  sklearn: 0.23.2
  altair: 4.1.0


# Feature Engineering

I think we have enough information to start with feature engineering now.

The first step in my opinion is to define a validation pipeline, which is described in the next session.

## Validation pipeline

First of all, we need our metric, which sould be `sklearn.metrics.mean_squared_error(squared=False)`.

Now, we need to decide how our training set should be split to validate a model. Since the problem is about forecasting, I chose to do a time-series split for this. This means I'll train on the dataset where `date_block_num < k` and predict for dataset where `date_block_num = k`, for `k in [31, 32, 33]`.

In [3]:
train_set = pd.read_parquet(os.path.join(PROCESSED_DATA_DIR, 'train-set-base.parquet'))
tscv.split(train_set)

[[Int64Index([      0,       1,       2,       3,       4,       5,       6,
                    7,       8,       9,
              ...
              1506150, 1506151, 1506152, 1506153, 1506154, 1506155, 1506156,
              1506157, 1506158, 1506159],
             dtype='int64', length=1506160),
  Int64Index([1506160, 1506161, 1506162, 1506163, 1506164, 1506165, 1506166,
              1506167, 1506168, 1506169,
              ...
              1539636, 1539637, 1539638, 1539639, 1539640, 1539641, 1539642,
              1539643, 1539644, 1539645],
             dtype='int64', length=33486)],
 [Int64Index([      0,       1,       2,       3,       4,       5,       6,
                    7,       8,       9,
              ...
              1539636, 1539637, 1539638, 1539639, 1539640, 1539641, 1539642,
              1539643, 1539644, 1539645],
             dtype='int64', length=1539646),
  Int64Index([1539646, 1539647, 1539648, 1539649, 1539650, 1539651, 1539652,
              1539653, 1

In order to use more of the dataset I'll use everything for CV and use the public LB score as the holdout set. This should be enough to make sure the model generalizes.