Notes on the TFT training process over this dataset:

Per (DeepAR), we use 500k samples taken between 2014-01-01 to 2014-09-01 – using the first 90% for training, and the last 10% as a validation set. Testing is done over the 7 days immediately following the training set – as described in (DeepAR, TRMF). Given the large differences in magnitude between trajectories, we also apply z-score normalization separately to each entity for real-valued inputs.

In line with previous work, we consider the electricity usage, day-of-week, hour-of-day and a time index – i.e. the number of time steps from the first observation – as real-valued inputs, and treat the entity identifier as a categorical variable.

The interval used by DeepAR and TFT is 250 days, that is 6000 hourly lectures per series, therefore 2.22 million hourly lectures. Only 500K samples are randomly selected for training process.

However, from DeepAR documentation: "Note that in addition to the context_length the model also takes into account the values of the time series at typical seasonal windows e.g. for hourly data the model will look at the value of the series 24h ago, one week ago one month ago etc. So it is not necessary to make the context_length span an entire month if you expect monthly seasonalities in your hourly data."

From DeepAR: 

ToDo:

how do DeepAR and TFT produce lags? Do they sub-sample lags?

apply z-score normalization separately to power usage

build BSCTRFM positional encodings

build absolute positional encoding with hours_from_start and/or days_from_start

build an embedding layer from customer_id    

### Make time series from electricity dataset for forecasting models SLDB

In [1]:
import os
import numpy as np
import pandas as pd
import time
import json
import joblib

In [2]:
import tensorflow as tf

In [3]:
tf.__version__

'2.4.1'

In [4]:
from bokeh.plotting import figure, show, output_file, save
from bokeh.io import output_notebook
from bokeh.palettes import d3
output_notebook()

### get the time series from the electricity dataset

In [5]:
! ls -l /home/developer/gcp/cbidmltsf/datasets/electricity

total 1375176
-rw-rw-r-- 1 developer developer 208129432 ago  9 10:38 hourly_electricity.csv
-rw-rw-r-- 1 developer developer 227696887 ago 10 12:25 hourly_electricity.pkl
-rw-rw-r-- 1 developer developer 710998915 ago  9 09:56 LD2011_2014.txt
-rw-rw-r-- 1 developer developer 261335609 ago  9 09:56 LD2011_2014.txt.zip


In [6]:
# start with the 250-day dataset used by TFT

In [7]:
df = pd.read_pickle('/home/developer/gcp/cbidmltsf/datasets/electricity/hourly_electricity.pkl')

In [8]:
columns = ['power_usage', 'id',
           'date', 'hours_from_start', 'days_from_start',
           'hour', 'day', 'day_of_week', 'month']

In [26]:
# customer ids to analyze and process
start_id, end_id = 1, 1

In [32]:
# preprocess time series and persist SLDB
# start with only one customer

In [33]:
customer_ids = ['MT_{:03d}'.format(id) for id in range(start_id, end_id + 1)]
customer_ids

['MT_001']

In [34]:
# a dictionary to manage time series by customer id
data = dict()

In [35]:
# pass individual customer data to dictionary values, make a copy from original dataframe
for customer_id in customer_ids:
    data[customer_id] = df[df['id'] == customer_id][columns].copy()
    
    # rename columns
    data[customer_id] = data[customer_id].rename(columns={"hour": "hour_of_day",
                                                          "day": "day_of_month",
                                                          "month": "month_of_year"})

In [36]:
plots = dict()

In [37]:
label = 'customers'

plots[label] = figure(
    x_axis_type='datetime',
    # y_range=(0., ceil_kw),
    plot_width=960,
    plot_height=400,
    title='Electricity consumption for {}.'.format(label))

plots[label].grid.grid_line_alpha=0.3

plots[label].xaxis.axis_label = 'Date'
plots[label].yaxis.axis_label = 'Active Power [KW]'

for index, customer_id in enumerate(customer_ids):
    plots[label].line(data[customer_id].date,
                      data[customer_id].power_usage,
                      color=d3['Category10'][10][index],
                      legend_label=customer_id)

# uncomment the following two lines to save plot
# output_file('/home/developer/gcp/cbidmltsf/datasets/cfe/{}_H_kw.html'.format(device))
# save(fig_kw)

# uncomment the following line to display plot
show(plots[label])

### pre-process the time series in LCD20112014 dataset for BSCTRFM architecture

In [43]:
data['MT_001']

Unnamed: 0,power_usage,id,date,hours_from_start,days_from_start,hour_of_day,day_of_month,day_of_week,month_of_year
17544,2.538071,MT_001,2014-01-01 00:00:00,26304.0,1096,0,1,2,1
17545,2.855330,MT_001,2014-01-01 01:00:00,26305.0,1096,1,1,2,1
17546,2.855330,MT_001,2014-01-01 02:00:00,26306.0,1096,2,1,2,1
17547,2.855330,MT_001,2014-01-01 03:00:00,26307.0,1096,3,1,2,1
17548,2.538071,MT_001,2014-01-01 04:00:00,26308.0,1096,4,1,2,1
...,...,...,...,...,...,...,...,...,...
23539,16.497462,MT_001,2014-09-07 19:00:00,32299.0,1345,19,7,6,9
23540,3.172589,MT_001,2014-09-07 20:00:00,32300.0,1345,20,7,6,9
23541,8.565990,MT_001,2014-09-07 21:00:00,32301.0,1345,21,7,6,9
23542,16.497462,MT_001,2014-09-07 22:00:00,32302.0,1345,22,7,6,9
