**<div style="font-size:200%">Create gluonts DataSet</div>**

This notebook demonstrates how to

1. create a gluonts dataset which consists of train split, test split, and metadata.
2. add label encodings as a custom metadata to the gluonts dataset.

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

from IPython.display import Markdown

import pandas as pd
from time import gmtime, strftime
import warnings
from pathlib import Path
import json

from gluonts.dataset.common import ListDataset, save_datasets, TrainDatasets, MetaData, CategoricalFeatureInfo, load_datasets

#from smallmatter.common.pogr import fill_dt_all
from smallmatter.ds import DFBuilder

# Global Config

In [None]:
# Choose weekly or daily.
#freq, fcast_length = "D", 14
freq, fcast_length = "W", 12

dataset_name = f'gluonts-20200528-cat-{freq.lower()}'

# Input
pogr_csv = ['../../data/raw/Consolidated_PO_GR_AT14.csv',
            '../../data/raw/Consolidated_PO_GR_AT16.csv']

# Dates from consolidate_pogr to use.
min_date = '2017-01-01'
max_date = '2020-05-06'

# Output
bucket = 'OUTPUT_BUCKET'
prefix = 'OUTPUT_PREFIX'

%set_env BUCKET=$bucket
%set_env PREFIX=$prefix
%set_env DATASET_NAME=$dataset_name

# Download .csv timeseries

In [None]:
https://archive.ics.uci.edu/ml/machine-learning-databases/00409/Daily_Demand_Forecasting_Orders.csv

# Load Timeseries

Our example dataset is from the [UCI Daily Demand Forecasting Orders Data Set](https://archive.ics.uci.edu/ml/datasets/Daily+Demand+Forecasting+Orders).

In [None]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00409/Daily_Demand_Forecasting_Orders.csv

## Filter timeseries

In [None]:
ts = (po_gr[['goods_receipt_date', 'std_quantity', 'frequency', 'category']]
          .reset_index(drop=True)
          .rename({'goods_receipt_date': 'x', 'std_quantity': 'y', 'frequency': 'freq', 'category': 'cat'}, axis=1)
     )
ts = ts[(ts['x'] >= min_date) & (ts['x'] <= max_date)]

# Track statistics
stats = DFBuilder(columns=['stat', 'value'])
stats += ['cat', ts['cat'].unique().shape[0]]
display(stats.df)

# IMPORTANT: consolidate to daily level
ts = ts.groupby(by=['x', 'cat'], as_index=False)['y'].sum()

## Fill dates

For each category, fill dates between min(category's GR date) until `max_date`.

In [None]:
%%time
from nvtsbe_planning.common.pogr import fill_dt_all
ts_filled = fill_dt_all(ts, ts_id=['cat'], dates=("min", max_date, "D"), freq=freq)

# Make sure each timeseries starts at their original earliest date, instead of 2017-01-01.
ts_filled.groupby(by='cat', as_index=False)['x'].min()

# Generate gluonts TRAIN dataset

Implementation notes:

- `df2glounts()` is done in the spirit of bias-for-action. However, in the spirit of invent-and-simplify and insist-on-highest-standard, we should fix port `csv2deepar` submodule from 1P to glounts.

- We keep all gluonts timeseries in memory. This is not memory-optimal, however it's simple to implement.

- We use `gluonts.dataset.common.save_datasets()` but note that this function writes only to local filesystem, hence needs a follow-up step to upload to s3.

## TRAIN: in-memory gluonts data

In [None]:
def encode_cat(cats):
    # FIXME: pulled to nvtsbe_planning.common
    return {c:i for i,c in enumerate(cats)}

def df2gluonts(df, cat_idx, fcast_len=12, freq='W', ts_id=['cat', 'cc'], static_cat=['cat', 'cc'], item_id_fn=None):
    # FIXME:
    # - This hack is bias-for-action.
    # - For invent-and-simplify + insist-on-highest-standard: fix module csv2deepar, which checks for missing ts, etc.
    data_iter = []

    # Build the payload
    for item_id, dfg in df.groupby(ts_id, as_index=False):
        if len(ts_id) < 2:
            item_id = [item_id]

        if fcast_len > 0:  # Typically for training
            ts_len = len(dfg) - fcast_len
            target = dfg['y'][:-fcast_len]
        else:
            target = dfg['y']

        feat_static_cat = []
        for col in static_cat:
            assert dfg[col].nunique() == 1
            cat_value = dfg[col].iloc[0]
            feat_static_cat.append(cat_idx[col][cat_value])  # Encoded cat values to appear in feat_static_cat.

        if item_id_fn is None:
            item_id = '|'.join(item_id)  # NOTE: sm-glounts entrypoint will plot '|' as '\n'.
        else:
            item_id = item_id_fn(*item_id)

        data_iter.append({
            'start': dfg.iloc[0]['x'],
            'target': target,
            'feat_static_cat': feat_static_cat,
            'item_id': item_id
        })

    data = ListDataset(data_iter, freq = freq)
    return data

In [None]:
cat_inverted_idx = {'cat': encode_cat(ts_filled['cat'].unique())}

# Drop the final fcast_length from train data.
train_data= df2gluonts(ts_filled, cat_inverted_idx, fcast_len=fcast_length, freq=freq, ts_id=['cat'], static_cat=['cat'])

# Test data include fcast_length which are ground truths.
test_data = df2gluonts(ts_filled, cat_inverted_idx, fcast_len=0, freq=freq, ts_id=['cat'], static_cat=['cat'])

In [None]:
# Track statistics
(stats
     + ['train_ts_count', len(train_data.list_data)]
     + ['test_ts_count', len(test_data.list_data)]
)

stats.df

## TRAIN: write to local fs, then s3

In [None]:
gluonts_datasets = TrainDatasets(
    metadata=MetaData(
                freq=freq,
                target={'name': 'gr'},
                feat_static_cat=[
                    CategoricalFeatureInfo(name=k, cardinality=len(v)+1)   # Add 'unkown'.
                    for k,v in cat_inverted_idx.items()
                ],
                prediction_length = fcast_length
    ),
    train=train_data,
    test=test_data
)

# Setting `overwrite=True` means rm -fr path_str, mkdir path_str, then write individual files.
local_path=f'../../data/processed/{dataset_name}'
save_datasets(
    dataset=gluonts_datasets,
    path_str=local_path,
    overwrite=True
)

# Save also our indexes
with open(Path(local_path) / 'metadata' / 'cat.json', 'w') as f:
    json.dump(cat_inverted_idx, f)
    
%set_env LOCAL_PATH=$local_path
!cat $LOCAL_PATH/metadata/metadata.json | head -1 | jq
!cat $LOCAL_PATH/train/data.json | head -1 | jq
!cat $LOCAL_PATH/test/data.json | head -1 | jq
!cat $LOCAL_PATH/metadata/cat.json | jq

Verify that we can re-read the output files

In [None]:
import os
reloaded_dataset = load_datasets(
                        metadata=os.path.join(local_path, "metadata"),
                        train=os.path.join(local_path, "train"),
                        test=os.path.join(local_path, "test")
                   )
display(
    reloaded_dataset.metadata,
    reloaded_dataset.train,
    reloaded_dataset.test
)

Everything looks ok. Now, let's upload to S3.

In [None]:
!aws s3 cp --recursive $LOCAL_PATH s3://$BUCKET/$PREFIX/$DATASET_NAME/

# Summary

Recap on stats.

In [None]:
stats.df