# Generation of training, validation and test set for the GTS model

In this notebook, we create the training, validation and test set for the GTS model. 

For our 57 time series with 24 observations each, we choose the train-validation-test split such that we have the first 15 observations, i.e., the ones from 09-2019 to 11-2020, in the training set, the following 3 observations, i.e., the ones from 12-2020 to 02-2021, in the validation set and the last 6 observations, i.e., the ones from 03-2021 to 08-2021, in the test set.
In our application, we want to forecast only for the next month based on the current month, i.e., the horizon equals 1.

For each of the three sets, both an x set and a y set are created. The x set contains the time series data for forecasting, while the y set contains the time series data to be forecast. For example, the train x set should contain the time series data from 09-2019 to 10-2020 to use it to forecast the time series data from 10-2019 to 11-2020 (end of train set) so that we always forecast next month based on current month. Similarly, the validation x set should contain the time series data, i.e., for all 57 regions, from 11-2020 to 01-2021 to forecast the time series data from 12-2020 to 02-2021 and the test x set should contain the time series data from 02-2021 to 07-2021 to forecast the time series data from 03-2021 to 08-2021.

From the original repository on GitHub, we use the following code from the following files:
* [`generate_training_data.py`](https://github.com/chaoshangcs/GTS/blob/main/scripts/generate_training_data.py): def generate_graph_seq2seq_io_data, def generate_train_val_test. However, some parts of the code are changed.

In [1]:
import argparse
import numpy as np
import os
import pandas as pd

## Define helper functions

In [2]:
def generate_graph_seq2seq_io_data(df, x_offsets, y_offsets, add_time_in_day=False, add_day_in_week=False, scaler=None):
    num_samples, num_nodes = df.shape
    data = np.expand_dims(df.values, axis=-1)
    data_list = [data]
    if add_time_in_day:
        time_ind = (df.index.values - df.index.values.astype("datetime64[D]")) / np.timedelta64(1, "D")
        time_in_day = np.tile(time_ind, [1, num_nodes, 1]).transpose((2, 1, 0))
        data_list.append(time_in_day)
    if add_day_in_week:
        day_in_week = np.zeros(shape=(num_samples, num_nodes, 7))
        day_in_week[np.arange(num_samples), :, df.index.dayofweek] = 1
        data_list.append(day_in_week)

    data = np.concatenate(data_list, axis=-1)
    x, y = [], []
    min_t = abs(min(x_offsets))
    max_t = abs(num_samples - abs(max(y_offsets)))  
    for t in range(min_t, max_t):
        x_t = data[t + x_offsets, ...]
        y_t = data[t + y_offsets, ...]
        x.append(x_t)
        y.append(y_t)
    x = np.stack(x, axis=0)
    y = np.stack(y, axis=0)
    return x, y

In [3]:
def generate_train_val_test(sales_df_filename, output_dir):
    df = pd.read_hdf(sales_df_filename)
   
    x_offsets = np.sort(
        np.concatenate((np.arange(0, 1, 1),))
    )

    y_offsets = np.sort(np.arange(1, 2, 1))
    x, y = generate_graph_seq2seq_io_data(
        df,
        x_offsets=x_offsets,
        y_offsets=y_offsets,
        add_time_in_day=False,
        add_day_in_week=False,
    )

    print("x shape: ", x.shape, ", y shape: ", y.shape)
    num_samples = x.shape[0]
    num_test = round(num_samples * 6/24) 
    num_train = round(num_samples * 15/24) 
    num_val = num_samples - num_test - num_train

    # train
    x_train, y_train = x[:num_train], y[:num_train]
    # val
    x_val, y_val = (
        x[num_train: num_train + num_val],
        y[num_train: num_train + num_val],
    )
    # test
    x_test, y_test = x[-num_test:], y[-num_test:]

    for cat in ["train", "val", "test"]:
        _x, _y = locals()["x_" + cat], locals()["y_" + cat]
        print(cat, "x: ", _x.shape, "y:", _y.shape)
        np.savez_compressed(
            os.path.join(output_dir, "%s.npz" % cat),
            x=_x,
            y=_y,
            x_offsets=x_offsets.reshape(list(x_offsets.shape) + [1]),
            y_offsets=y_offsets.reshape(list(y_offsets.shape) + [1]),
        )

## Generation of data sets

We now generate the training, validation and test data, including both the respective x and y set.

In [4]:
# Create SALES folder
route0 = "./data/SALES"

if not os.path.exists(route0):
    os.mkdir(route0)

In [5]:
print("Generating training, validation and test data")
generate_train_val_test(output_dir = 'data/SALES', sales_df_filename = 'data/sales.h5')

Generating training, validation and test data
x shape:  (23, 1, 57, 1) , y shape:  (23, 1, 57, 1)
train x:  (14, 1, 57, 1) y: (14, 1, 57, 1)
val x:  (3, 1, 57, 1) y: (3, 1, 57, 1)
test x:  (6, 1, 57, 1) y: (6, 1, 57, 1)


We see that both x and y only contain data for 23 months, where x - the time series data used for forecasting - contains the time series data from 09-2019 to 07-2021, while y - the time series data to be forecast contains the time series from 10-2019 to 08-2021, so that e.g. the time series data from 09-2019 (begin of time series) can be used to forecast the time series data starting from 10-2019. 

This x and y data is then subdivided into training, validation and test set. 

The training sets contain the beginning of the time series and therefore only contain time series data for 14 months: train x -  the data used for forecasting - contains data from 09-2019 to 10-2020 to forecast the data in train y from 10-2019 to 11-2020. 
The validation sets contain data for 3 months: val x - the data used for forecasting - contains data from 11-2019 to 01-2021 to forecast the data in val y from 12-2020 to 02-2021. The test sets contain data for 6 months: test x - the data used for forecasting - contains data from 02-2021 to 07-2021 to forecast the data in test y from 03-2021 to 08-2021.

## Data check

We now conduct a data check to make sure all sets contain the data that they should contain.

The [.npz file format](https://numpy.org/doc/stable/reference/generated/numpy.savez.html) is a zipped archive of files named after the variables they contain. The archive is not compressed and each file in the archive contains one variable in .npy format. For a description of the .npy format, see numpy.lib.format.

In [6]:
# read in npz files
category = 'train'
train = np.load(os.path.join('./data/SALES/' + category + '.npz'))
category = 'val'
val = np.load(os.path.join('./data/SALES/' + category + '.npz'))
category = 'test'
test = np.load(os.path.join('./data/SALES/' + category + '.npz'))

In [7]:
sales = pd.read_hdf('./data/sales.h5')

date_dict = dict({0: '09-2019', 1: '10-2019', 2: '11-2019', 3: '12-2019', 4: '01-2020', 5: '02-2020', 6: '03-2020', 
                  7: '04-2020', 8: '05-2020', 9: '06-2020', 10: '07-2020', 11: '08-2020', 12: '09-2020', 13: '10-2020',
                 14: '11-2020', 15: '12-2020', 16: '01-2021', 17: '02-2021', 18: '03-2021', 19: '04-2021', 20: '05-2021', 
                 21: '06-2021', 22: '07-2021', 23: '08-2021'})

sales['time'] = sales.index.map(date_dict)

sales

territory,Blekinge,Blekinge ONCO,Dalarna,Dalarna ONCO,Gävleborg-Gävle,Gävleborg-Gävle ONCO,Halland-Halmstad,Halland-Halmstad ONCO,Halland-Varberg-Falkenberg,Jämtland,...,Västra Götaland-Lidköping,Västra Götaland-Skövde,Västra Götaland-SÄS ONCO,Västra Götaland-Uddevalla,Örebro-Örebro,Örebro-Örebro ONCO,Östergötland-Linköping,Östergötland-Linköping ONCO,Östergötland-Norrköping,time
0,61.0,288.0,30.0,326.0,10.0,198.0,30.0,96.0,30.0,0.0,...,91.0,20.0,1319.0,30.0,91.0,294.0,0.0,548.0,0.0,09-2019
1,122.0,526.0,51.0,325.0,0.0,466.0,51.0,96.0,30.0,0.0,...,61.0,122.0,1137.0,51.0,30.0,576.0,0.0,630.0,0.0,10-2019
2,30.0,123.0,51.0,537.0,30.0,133.0,51.0,288.0,30.0,0.0,...,30.0,51.0,1452.0,61.0,152.0,585.0,0.0,619.0,0.0,11-2019
3,122.0,336.0,51.0,154.0,30.0,304.0,20.0,154.0,30.0,0.0,...,91.0,61.0,1352.0,30.0,152.0,560.0,0.0,490.0,0.0,12-2019
4,103.0,127.0,42.0,75.0,101.0,372.0,106.0,270.0,52.0,42.0,...,99.0,109.0,1315.0,30.0,135.0,607.0,30.0,633.0,0.0,01-2020
5,108.0,199.0,0.0,127.0,87.0,246.0,64.0,87.0,21.0,50.0,...,5.0,50.0,982.0,21.0,64.0,335.0,50.0,358.0,0.0,02-2020
6,177.0,352.0,7.0,258.0,73.0,340.0,78.0,232.0,0.0,78.0,...,21.0,28.0,1081.0,85.0,120.0,990.0,14.0,503.0,0.0,03-2020
7,113.0,143.0,0.0,55.0,14.0,549.0,52.0,270.0,43.0,71.0,...,57.0,0.0,1320.0,64.0,50.0,419.0,28.0,653.0,42.0,04-2020
8,128.0,108.0,7.0,52.0,30.0,593.0,28.0,157.0,0.0,21.0,...,43.0,50.0,1513.0,43.0,106.0,294.0,21.0,465.0,50.0,05-2020
9,57.0,260.0,14.0,127.0,0.0,956.0,57.0,288.0,64.0,64.0,...,43.0,21.0,1356.0,64.0,191.0,532.0,0.0,380.0,35.0,06-2020


### Training set

The train.npz is a zipped archive of files named after the variables they contain. We now look at which files train.npz contains.

In [8]:
train.files

['x', 'y', 'x_offsets', 'y_offsets']

We see that the train.npz file contains 4 files. Let us look at them in detail.

In [9]:
train['x']

array([[[[  61.],
         [ 288.],
         [  30.],
         [ 326.],
         [  10.],
         [ 198.],
         [  30.],
         [  96.],
         [  30.],
         [   0.],
         [ 151.],
         [ 803.],
         [ 304.],
         [  30.],
         [  61.],
         [ 101.],
         [ 207.],
         [  30.],
         [ 172.],
         [  19.],
         [   0.],
         [ 404.],
         [2076.],
         [ 274.],
         [  30.],
         [ 182.],
         [1277.],
         [2365.],
         [   0.],
         [   0.],
         [  10.],
         [ 764.],
         [ 527.],
         [ 210.],
         [  20.],
         [ 886.],
         [   0.],
         [  91.],
         [ 177.],
         [  91.],
         [ 458.],
         [  30.],
         [   0.],
         [ 479.],
         [   0.],
         [   0.],
         [  41.],
         [1982.],
         [  91.],
         [  20.],
         [1319.],
         [  30.],
         [  91.],
         [ 294.],
         [   0.],
         [

In [10]:
train['y']

array([[[[ 122.],
         [ 526.],
         [  51.],
         [ 325.],
         [   0.],
         [ 466.],
         [  51.],
         [  96.],
         [  30.],
         [   0.],
         [ 254.],
         [ 311.],
         [ 507.],
         [ 111.],
         [  30.],
         [ 213.],
         [ 163.],
         [  61.],
         [ 213.],
         [ 114.],
         [   0.],
         [ 171.],
         [1943.],
         [ 142.],
         [  30.],
         [ 274.],
         [2026.],
         [2141.],
         [   0.],
         [   0.],
         [  10.],
         [ 625.],
         [ 689.],
         [ 363.],
         [  30.],
         [ 813.],
         [   0.],
         [  81.],
         [ 297.],
         [ 122.],
         [ 231.],
         [  30.],
         [  30.],
         [ 516.],
         [  30.],
         [   0.],
         [  41.],
         [2594.],
         [  61.],
         [ 122.],
         [1137.],
         [  51.],
         [  30.],
         [ 576.],
         [   0.],
         [

In [11]:
train['x_offsets']

array([[0]])

In [12]:
train['y_offsets']

array([[1]])

A comparison with the above sales data confirms that `x` contains the sales data from 09-2019 to 10-2020 for all 57 time series, while `y` contains the sales data from 10-2019 to 11-2020 for all 57 time series. In this way, the sales data in `x` can be used to forecast the sales data in `y`. This is also shown by `x_offsets` and `y_offsets`: this month's sales values, their index being equal to 0 in `x_offsets`, is used to forecast next month's sales values, their index being equal to 1 in `y_offsets`.

### Validation set

The val.npz is a zipped archive of files named after the variables they contain. We now look at which files val.npz contains.

In [13]:
val.files

['x', 'y', 'x_offsets', 'y_offsets']

We see that the val.npz file contains 4 files. Let us look at them in detail.

In [14]:
val['x']

array([[[[ 120.],
         [ 229.],
         [  35.],
         [ 185.],
         [  21.],
         [ 850.],
         [  85.],
         [ 146.],
         [  57.],
         [  14.],
         [ 268.],
         [ 316.],
         [  43.],
         [  43.],
         [  31.],
         [  99.],
         [ 311.],
         [  50.],
         [ 113.],
         [  56.],
         [   0.],
         [ 411.],
         [ 900.],
         [ 198.],
         [ 149.],
         [ 793.],
         [ 792.],
         [3037.],
         [   0.],
         [   0.],
         [  64.],
         [ 647.],
         [ 241.],
         [ 601.],
         [  28.],
         [ 936.],
         [  28.],
         [  28.],
         [ 584.],
         [  85.],
         [ 369.],
         [  64.],
         [  43.],
         [ 560.],
         [   0.],
         [  57.],
         [   0.],
         [2184.],
         [  21.],
         [  35.],
         [1031.],
         [  85.],
         [  85.],
         [ 309.],
         [  80.],
         [

In [15]:
val['y']

array([[[[ 142.],
         [ 149.],
         [  42.],
         [  90.],
         [  21.],
         [ 381.],
         [ 120.],
         [ 161.],
         [  21.],
         [  21.],
         [  38.],
         [ 328.],
         [ 149.],
         [ 206.],
         [  31.],
         [ 113.],
         [ 290.],
         [  28.],
         [ 170.],
         [ 108.],
         [   0.],
         [ 436.],
         [ 932.],
         [ 298.],
         [  85.],
         [ 489.],
         [ 660.],
         [2269.],
         [   0.],
         [   0.],
         [  21.],
         [ 303.],
         [ 241.],
         [ 592.],
         [  50.],
         [ 318.],
         [  43.],
         [  50.],
         [ 681.],
         [  64.],
         [ 173.],
         [  64.],
         [  21.],
         [ 561.],
         [   0.],
         [  28.],
         [   0.],
         [2016.],
         [   7.],
         [  21.],
         [1294.],
         [ 170.],
         [  43.],
         [ 185.],
         [  66.],
         [

In [16]:
val['x_offsets']

array([[0]])

In [17]:
val['y_offsets']

array([[1]])

A comparison with the above sales data confirms that `x` contains the sales data from 11-2020 to 01-2021 for all 57 time series, while `y` contains the sales data from 12-2020 to 02-2021 for all 57 time series. In this way, the sales data in `x` can be used to forecast the sales data in `y`. This is also shown by `x_offsets` and `y_offsets`: this month's sales values, their index being equal to 0 in `x_offsets`, is used to forecast next month's sales values, their index being equal to 1 in `y_offsets`.

### Test set

The test.npz is a zipped archive of files named after the variables they contain. We now look at which files test.npz contains.

In [18]:
test.files

['x', 'y', 'x_offsets', 'y_offsets']

We see that the test.npz file contains 4 files. Let us look at them in detail.

In [19]:
test['x']

array([[[[ 106.],
         [  90.],
         [   7.],
         [  96.],
         [  43.],
         [ 221.],
         [  21.],
         [ 108.],
         [  64.],
         [  35.],
         [ 127.],
         [ 295.],
         [ 156.],
         [  85.],
         [  43.],
         [ 113.],
         [ 240.],
         [   7.],
         [ 106.],
         [ 108.],
         [   0.],
         [ 143.],
         [ 954.],
         [ 191.],
         [  43.],
         [ 510.],
         [ 669.],
         [2579.],
         [   0.],
         [   0.],
         [   7.],
         [ 537.],
         [ 326.],
         [ 497.],
         [   0.],
         [ 520.],
         [  21.],
         [  71.],
         [ 333.],
         [  64.],
         [ 113.],
         [  85.],
         [   0.],
         [ 196.],
         [   0.],
         [  14.],
         [   0.],
         [1420.],
         [   0.],
         [  21.],
         [1038.],
         [  85.],
         [  64.],
         [ 112.],
         [  84.],
         [

In [20]:
test['y']

array([[[[ 149.],
         [ 185.],
         [  50.],
         [ 339.],
         [  64.],
         [ 400.],
         [  99.],
         [ 161.],
         [  43.],
         [  43.],
         [ 180.],
         [ 368.],
         [ 206.],
         [ 113.],
         [  43.],
         [ 113.],
         [ 168.],
         [  43.],
         [ 142.],
         [ 108.],
         [   0.],
         [ 319.],
         [1729.],
         [ 291.],
         [  85.],
         [ 858.],
         [ 631.],
         [2815.],
         [   0.],
         [   0.],
         [  14.],
         [ 420.],
         [ 255.],
         [ 763.],
         [  50.],
         [ 443.],
         [  57.],
         [  64.],
         [ 808.],
         [  21.],
         [ 286.],
         [ 128.],
         [   0.],
         [ 317.],
         [   0.],
         [  28.],
         [  21.],
         [1909.],
         [  28.],
         [  28.],
         [1630.],
         [ 149.],
         [ 128.],
         [ 274.],
         [  43.],
         [

In [21]:
test['x_offsets']

array([[0]])

In [22]:
test['y_offsets']

array([[1]])

A comparison with the above sales data confirms that `x` contains the sales data from 02-2021 to 07-2021 for all 57 time series, while `y` contains the sales data from 03-2021 to 08-2021 for all 57 time series. In this way, the sales data in `x` can be used to forecast the sales data in `y`. This is also shown by `x_offsets` and `y_offsets`: this month's sales values, their index being equal to 0 in `x_offsets`, is used to forecast next month's sales values, their index being equal to 1 in `y_offsets`.

### Result

Overall, we have seen that the split into training, validation and test set has been made correctly.