# Lab: Porting LSTNet to Amazon SageMaker

In this lab the LSTNet model is ported to be trained using Amazon Sagemaker.

It starts will assuming that an LSTNet model has already been developed. In the same directory as this notebook there are several modules containing supporting classes:

1. lstnet.py - The declaration of the model and forward function. The model consists of a convolutional layer, dropout, a gru, a skip gru, a fully connected layer, and the parallel autoregresive component.
2. timeseriesdataset.py - Classes for loading the data. TimeSeriesData, and TimeSeriesDataset. TimeSeriesDataset is a subclass of gluon.data.Dataset. It implements the __getitem__ function which returns a time series for the given index. These classes are used to load an input file and to generate successive examples with a specified window and horizon. The window is the length of timeseries used as input data for the prediction and the horizon is the number of time steps between the end of the window and the time at which the prediction is for.
2. lstnet_sagemaker.py - This module implements the train() function which is used as the entrypoint for training the model on a server. This is called by Amazon SageMaker on each host in the training cluster.

In [1]:
import boto3
import os
import pandas as pd
import numpy as np
from sagemaker.mxnet import MXNet
from sagemaker import get_execution_role

#### Environment Variables

Configure the following variables for your environment:

1. bucket - The bucket name to be used to store the training data and model artifacts.
2. prefix - The folder name which is used inside the bucket.

In [2]:
bucket = 'eduthie-sagemaker-1'
prefix = 'lstnet'

role = get_execution_role()

data_dir = '../../data'
data_file_path = os.path.join(data_dir,'electricity.txt')

test_bucket_prefix = '/test/'
single_host_train_bucket_prefix = '/train/single_host/'

#### Load the Data

The first step is to load the electricity dataset from a file. The dataset itself is included in this github repo in the data directory. 

* The data is normalised so each reading is between 0 and 1. This is done by dividing each column by the maximum value of the column. A column is an electricity consumption time series for a single customer.

There are 321 time series of electricity consumption with 26,304 time periods in each. 

In [3]:
df = pd.read_csv(data_file_path,header=None)
max_columns = df.max().astype(np.float64)
df = df/max_columns # normalize
print(df.shape)

(26304, 321)


#### Split Training and Test and Upload to S3

The first 80% of the time series is used for training and the last 20% is used as a test set.

These datasets are written to a csv file and then uploaded to S3 to be used in training.

In [4]:
train_frac = 0.8

num_time_steps = len(df)
split_index = int(num_time_steps*train_frac)
train = df[0:split_index]
print('Training size {}'.format(len(train)))
test = df[split_index:]
print('Test size {}'.format(len(test)))

test_file_path = os.path.join(data_dir,'test.csv')
test.to_csv(test_file_path,header=None,index=False)
train_file_path = os.path.join(data_dir,'train.csv')
train.to_csv(train_file_path,header=None,index=False)

client = boto3.client('s3')
client.upload_file(test_file_path, bucket, prefix + test_bucket_prefix + 'test.csv')
client.upload_file(train_file_path, bucket, prefix + single_host_train_bucket_prefix + 'train.csv')

Training size 21043
Test size 5261


#### Test Locally

To make sure there are no obvious bugs in the code, the train() function is called in the notebook. This is done with 1 epoch to verify that it executed correctly. There are also some basic unit tests included in the directory. A notebook instance with a GPU is requred to execute the following steps. Otherwise skip on to deploying with SageMaker.

The key parameters to the train() function in this case are:

        - hyperparameters: The Amazon SageMaker Hyperparameters dictionary. A dict
            of string to string.
        - channel_input_dirs: A dict of string-to-string maps from the
            Amazon SageMaker algorithm input channel name to the directory containing
            files for that input channel. Note, if the Amazon SageMaker training job
            is run in PIPE mode, this dictionary will be empty.
        - output_data_dir:
            The Amazon SageMaker output data directory. After the function returns, data written to this
            directory is made available in the Amazon SageMaker training job
            output location.
        - num_gpus: The number of GPU devices available on the host this script
            is being executed on.
        - num_cpus: The number of CPU devices available on the host this script
            is being executed on.
        - hosts: A list of hostnames in the Amazon SageMaker training job cluster.
        - current_host: This host's name. It will exist in the hosts list.
        - kwargs: Other keyword args.

In [5]:
from lstnet_sagemaker import train
hyperparameters = {
    'conv_hid' : 10,
    'gru_hid' : 10,
    'skip_gru_hid' : 2,
    'skip' : 5,
    'ar_window' : 6,
    'window' : 24*7,
    'horizon' : 24,
    'learning_rate' : 0.01,
    'clip_gradient' : 10.,
    'batch_size' : 128,
    'epochs' : 1
}
channel_input_dirs = {'train':data_dir,'test':data_dir}
train(
    hyperparameters = hyperparameters,
    input_data_config = None,
    channel_input_dirs = channel_input_dirs,
    output_data_dir = os.path.join(data_dir,'output'),
    model_dir = None,
    num_gpus = 1,
    num_cpus = 1,
    hosts = ['alg-1'],
    current_host = 'alg-1')

Train file path ../../data/test.csv
Test file path ../../data/test.csv
Loading file ../../data/test.csv
Loading file ../../data/test.csv
Is it a file True
Data length 5261
Loading file ../../data/test.csv
Loading file ../../data/test.csv
Is it a file True
Data length 5261
Running on [gpu(0)]
Hosts ['alg-1']
Current Host alg-1
kvstore device
Training Start
Epoch 0: rmse 0.31743179211536277 time 4.5512 s
Final rmse 0.23415213010202235
Total training time: 8.440000057220459
Training End


#### Choose Hyperparameters

In one of the next sessions the automatic model tuning using Bayesian Optimisation will be covered. In this case a set of reasonable hyperparameters are chosen. Can you tweak these to make the network converge faster to a lower rmse?

In [6]:
hyperparameters = {
    'conv_hid' : 100,
    'gru_hid' : 100,
    'skip_gru_hid' : 5,
    'skip' : 24,
    'ar_window' : 24,
    'window' : 24*7,
    'horizon' : 24,
    'learning_rate' : 0.001,
    'clip_gradient' : 10.,
    'batch_size' : 64,
    'epochs' : 100
}

#### Trigger the training job using the SageMaker Python API.

The final step is to trigger the training job using the high-level Python API. A lower-level API is also available for more detailed control of the parameters.

First an estimator is created with sagemaker.mxnet.MXNet. The inputs are:

* entry_point='lstnet_sagemaker.py' - The module used to run the training by calling the train() function
* source_dir='.' - An optional directory containing code with is copied onto the SageMaker training hosts and made available to the training script.
* role=role - The IAM role which is given to the training hosts giving them privileges such as access to the S3 bucket.
* output_path='s3://{}/{}/output'.format(bucket, prefix) - The S3 bucket to store artifacts such as the model parameters.
* train_instance_count=1 - The number of hosts used for training. Using a number > 1 will start a cluster. To take advantage of this the trianing data should be sharded. See the next lab.
* train_instance_type='ml.p3.2xlarge' - The EC2 instance type to be used for training hosts. In this case the latest generation p3 is chosen with a Nvidia Tesla v100 GPU.
* hyperparameters=hyperparameters - The hyperparameter dictionary made available to the train() function in the endpoint script.

Then the fit() method of the estimator is called. The parameters are:

* inputs - A dictionary containing the URLs in S3 of the 'train' data directory and the 'test' data directory.
* wait - This is specified as False so the fit() method returns immediately after the training job is created. Go to the SageMaker console to monitor the progress of the job. Set wait to True to block and see the progress of the training job output in the notebook.


In [7]:
lstnet1 = MXNet(entry_point='lstnet_sagemaker.py',
    source_dir='.',
    role=role,
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    train_instance_count=1,
    train_instance_type='ml.p3.2xlarge',
    hyperparameters=hyperparameters)
lstnet1.fit(inputs={'train': 's3://{}/{}{}'.format(bucket, prefix, single_host_train_bucket_prefix),
    'test': 's3://{}/{}{}'.format(bucket, prefix, test_bucket_prefix)},wait=False)

INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2018-07-17-10-14-05-563


Congratulations. You have successfully ported LSTNet to Amazon SageMaker. The next step is to modify it to run using multiple GPUs and multiple hosts to train faster.

Next lab: [LSTNet Multi GPU and Distributed Training](../lstnet_multi_gpu_distributed/lstnet_multi_gpu_distributed.ipynb)