## Purpose
This is to read raw data and merge/organize for dataloader.
The output format of sales is expected to be a dataframe with rows for timesteps, and columns as the time series id that needs to be predicted. 
Thus, the sales dataframe should be 1941x30490 for item-by-item organization. 
If add aggregations, the shape should be 1941x42840.

To distinguish the timeseries, we need save another dataframe as a header to map the timeseries id to its `state_id, store_id, cat_id, dept_id, item_id`.


This type of organization is good for channel-independent algorithms, which considers each time-series as an independent channel.

In [8]:
import logging
import pathlib
import joblib
import os
from tqdm import tqdm
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder


In [6]:
!pip install git+https://github.com/YangyangFu/transformer-time-series@main

/bin/bash: /home/yyf/miniconda3/envs/tts-tf/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Collecting git+https://github.com/YangyangFu/transformer-time-series@main
  Cloning https://github.com/YangyangFu/transformer-time-series (to revision main) to /tmp/pip-req-build-4n72acqg
  Running command git clone --filter=blob:none --quiet https://github.com/YangyangFu/transformer-time-series /tmp/pip-req-build-4n72acqg
  Resolved https://github.com/YangyangFu/transformer-time-series to commit c7b930944f10ccc1f08284b080c4dbf61b356aab
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: tsl
  Building wheel for tsl (setup.py) ... [?25ldone
[?25h  Created wheel for tsl: filename=tsl-1.0-py3-none-any.whl size=53908 sha256=d4679f069b571db62547f85473b643e165ead74473c899aca1abba318c7c6390
  Stored in directory: /tmp/pip-ephem-wheel-cache-ib9tox11/wheels/d1/ce/7e/249bc100ae340cfeae1d7b80e6f993ac1923811606147a4941
Successfully buil

In [9]:

from tsl.utils.utils import reduce_mem_usage

In [None]:
def dump(df, name, path):
    df = reduce_mem_usage(df)
    save_dir = pathlib.Path(path)
    if not save_dir.exists():
        save_dir.mkdir(parents=True)
    joblib.dump(df, save_dir / f'{name}.joblib', compress=True)

dump_dir = os.path.join('./data', 'individual')

In [None]:
data_dir = "./raw-data/m5-forecasting-uncertainty/"

sales_file = "sales_train_evaluation.csv"
calendar_file = "calendar.csv"
price_file = "sell_prices.csv"


In [None]:
sales = pd.read_csv(os.path.join(data_dir, sales_file))
calendar = pd.read_csv(os.path.join(data_dir, calendar_file), parse_dates=["date"])
price = pd.read_csv(os.path.join(data_dir, price_file))

In [None]:
print(sales.columns)
print(calendar.columns)
print(price.columns)
print(sales.shape, calendar.shape, price.shape)

## Global time features and time series

In [None]:
id_cols = ['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id']


In [None]:
s = sales.drop(id_cols, axis=1)
s.index = sales['id']
s = s.transpose()
s = s.reset_index()
s = s.rename(columns={'index': 'd'})
s.head()

In [None]:
s = s.merge(calendar[['d', 'date']], how = 'outer')
s = s.drop(['d'], axis=1)
s.index = s['date']
s = s.drop(['date'], axis=1)

In [None]:
s.shape, s.columns

In [None]:
s.head()

In [None]:
dump(s, 'ts', dump_dir)

In [None]:
del s

## Local features

### Time-invariant local features
Time-invariant local features refer to the features that are dependent on the time series itself but not on time, such as `item_id, store_id`, etc.

In [None]:
items = sales[id_cols]
items.head()

Need convert to integers using label encoding

In [None]:
label_encoders = {}
for col in items.columns:
    encoder = LabelEncoder()
    items[col] = encoder.fit_transform(items[col])
    label_encoders[col] = encoder

items.head()

In [None]:
dump(items, 'local_invariant', dump_dir)

### Time-variant local features

local features that is time variant are `SNAP`, `item sell price`, etc

snap is a local feature based on states. it will be different for each time series.

the targets are each item in each store in each state, thus, each item should have their own SNAP feature based on the states it locates.

here I create a snap table for each time series, so that during training, we can grab the snap for corresponding target.

In [None]:
snap = calendar[['date', 'snap_CA', 'snap_TX', 'snap_WI']]
snap = snap.rename(columns={'snap_CA':'CA',
             'snap_TX':'TX',
             'snap_WI':'WI'})
snap = pd.concat([snap, pd.DataFrame(columns=items['id'])])
for idx, state in zip(items['id'], items['state_id']):
    snap[idx] = snap[state]


In [None]:
snap.index = snap['date']

In [None]:
snap = snap.drop(['CA', 'TX', 'WI', 'date'], axis=1)

In [None]:
snap.shape, snap.columns

In [None]:
snap.head()

In [None]:
dump(snap, 'local_variant_snap', dump_dir)

In [None]:
del snap

Price for each item, should serve as a numerical time-variant local feature.

The price is represented on a weekly basis, while the sales is represent on a daily basis. 
Thus, we need manipulate the weekly price to daily price.

In [None]:
price.head(), price.shape

In [None]:
# some items are only for sale after a specific date.
# here we have a release date to consider this effect
releases = price.groupby(['store_id','item_id'])['wm_yr_wk'].min().reset_index()
releases.columns = ['store_id','item_id','wm_yr_wk']
weekday = calendar.groupby('wm_yr_wk')['date'].min().reset_index()
releases = releases.merge(weekday)
releases.columns = ['store_id','item_id','release_week', 'release_date']
releases.drop('release_week', axis=1, inplace=True)
releases.head()

In [None]:
price.columns
pr = price.merge(releases)
pr.columns, pr.head()

In [None]:
pr = pr.merge(calendar[['wm_yr_wk', 'd', 'date']])
pr = pr.merge(items)
pr.columns, pr.head()

In [None]:
#pr['released'] = (pr['date'] >= pr['release_date']).astype(int)
pr = pr.drop(['wm_yr_wk', 'item_id', 'store_id', 'sell_price', 'dept_id', 'cat_id','state_id'], axis=1)
pr = pr.pivot(index='date', columns='id', values='release_date')

In [None]:
pr.head()

there are some `NaT` in the columns, checkout why???

In [None]:
pr.shape, pr.columns

In [None]:
pr = pr[items['id']]

In [None]:
pr = pr.apply(lambda x: x <= pr.index, axis=0)
pr = pr.astype(int)

In [None]:
pr.head(), pr.shape

In [None]:
dump(pr, 'local_variant_release', dump_dir)
del pr

In [None]:
price.head()

In [None]:
price = price.merge(calendar[['wm_yr_wk', 'd', 'date']])

In [None]:
price.head(), price.shape

In [None]:
price = price.merge(items)

In [None]:
price.head(), price.shape

In [None]:
price = price.pivot(index='date', columns='id', values='sell_price')

In [None]:
price.head(), price.shape

In [None]:
price = price[items['id']]
price.head()

In [None]:
price = price.fillna(value=0)
dump(price, 'local_variant_price', dump_dir)

In [None]:
del price


Note some items are not for sales within the given time period, thus the prices are `NAN`. 

As noticed, there are a lot of zeros/NAN in the price, which basically due to "out of stock" or "not released". However, we dont have information to indicate if a zero/NAN price is due to out of stock. 

Based on the original price data, we can see some items only have price info after specific date. We will set that date as the release date. 

The NANs for testing during `d1942-d1969` are due to the fact that the testing data is not openly accessible during competition.