# Lab: LSTNet Multi GPU and Distributed Training

In this lab the LSTNet model which has been ported to use Amazon SageMaker is modified to be run using multipe GPUs and with distributed training.

#### Pre-requisites
It is assumed you are farmiliar with multi-GPU and distributed training using MXNet and SageMaker as covered in the previous labs ['Training on multiple GPUs with gluon'](../multiple_gpus_gluon/multiple_gpus_gluon.ipynb) and ['Distributed Training with Amazon SageMaker and gluon'](distributed_training_gluon/distributed_training_gluon.ipynb). Also it is assumed that you have just ported your LSTNet code to SageMaker in the ['Porting LSTNet to Amazon SageMaker'](porting_lstnet_to_sagemaker/porting_lstnet_to_sagemaker.ipynb) lab.

#### Overview

There are three main steps required to scale the training using multiple GPUs and multiple hosts:

1. Pass the appropriate kvstore parameter to the Gluon traininer. This specifies how parameters are synchronised between batches. In this case 'dist_device_sync' will be used which uses a parameter server to manage multiple hosts and performs the gradient updates on the GPUs when possible.

2. Shard the training dataset. To perform distributed cluster training, the training dataset is split into shards with at least 1 shard per host. In this case it is split into 5 shards as 5 hosts will be used. Each host trains using only a portion of the dataset. The sharded training data is stored Amazon S3.

3. Split each batch into portions and copy the portions onto one GPU per portion. In this case 4 GPUs will be used. Each GPU trains on only a portion of each batch. The gradients are summed over all GPUs at the end of the batch and all GPUs (and hosts when combining with distributed) are updated. These updates are performed on the GPU when possible. We need to perform the splitting, Gluon automatically manages the synchronising and updates.

In [1]:
import boto3
import os
import pandas as pd
import numpy as np
from sagemaker.mxnet import MXNet
from sagemaker import get_execution_role

#### Environment Variables

Configure the following variables for your environment:

1. bucket - The bucket name to be used to store the training data and model artifacts.
2. prefix - The folder name which is used inside the bucket.

In [2]:
bucket = 'eduthie-sagemaker-1'
prefix = 'lstnet'

role = get_execution_role()

data_dir = '../../data'
data_file_path = os.path.join(data_dir,'electricity.txt')

test_bucket_prefix = '/test/'
single_host_train_bucket_prefix = '/train/single_host/'
multiple_host_train_bucket_prefix = '/train/multiple_host/'

#### Load the Data

The first step is to load the electricity dataset from a file. The dataset itself is included in this github repo in the data directory. 

* The data is normalised so each reading is between 0 and 1. This is done by dividing each column by the maximum value of the column. A column is an electricity consumption time series for a single customer.

There are 321 time series of electricity consumption with 26,304 time periods in each. 

In [16]:
df = pd.read_csv(data_file_path,header=None)
max_columns = df.max().astype(np.float64)
df = df/max_columns # normalize
print(df.shape)

(26304, 321)


#### Split Training and Test, Shard the Training Data, and Upload to S3

The first 80% of the time series is used for training and the last 20% is used as a test set.

The training set is sharded sequentially into 5 parts, one for each host in the cluster.

These datasets are written to a csv file and then uploaded to Amazon S3 to be used in training.

In [19]:
splits = 5
train_frac = 0.8

num_time_steps = len(df)
split_index = int(num_time_steps*train_frac)
train = df[0:split_index]
print('Training size {}'.format(len(train)))
test = df[split_index:]
print('Test size {}'.format(len(test)))

train_sets = []
train_len = len(train)
train_size = int(train_len)/splits
for i in range(0,splits):
    start = int(i*train_size)
    end = int((i+1)*train_size)
    print('start {}'.format(start))
    print('end {}'.format(end))
    if end < (train_len-1):
        train_sets.append(train[start:end])
    else:
        train_sets.append(train[start:])


test_file_path = os.path.join(data_dir,'test.csv')
test.to_csv(test_file_path,header=None,index=False)
train_file_path = os.path.join(data_dir,'train.csv')
train.to_csv(train_file_path,header=None,index=False)

client = boto3.client('s3')

for i in range(0,splits):
    file_path = os.path.join(data_dir,'train_{}.csv'.format(i))
    print('Uploading file: {} with {} rows'.format(file_path,len(train_sets[i])))
    train_sets[i].to_csv(file_path,header=None,index=False)
    s3_path = prefix + '{}train_{}.csv'.format(multiple_host_train_bucket_prefix,i)
    print('Uploading to {}'.format(s3_path))
    client.upload_file(file_path, bucket, s3_path)

client.upload_file(test_file_path, bucket, prefix + '{}test.csv'.format(test_bucket_prefix))
client.upload_file(train_file_path, bucket, prefix + '{}train.csv'.format(single_host_train_bucket_prefix))

Training size 21043
Test size 5261
start 0
end 4208
start 4208
end 8417
start 8417
end 12625
start 12625
end 16834
start 16834
end 21043
Uploading file: ../../data/train_0.csv with 4208 rows
Uploading to lstnet/train/multiple_host/train_0.csv
Uploading file: ../../data/train_1.csv with 4209 rows
Uploading to lstnet/train/multiple_host/train_1.csv
Uploading file: ../../data/train_2.csv with 4208 rows
Uploading to lstnet/train/multiple_host/train_2.csv
Uploading file: ../../data/train_3.csv with 4209 rows
Uploading to lstnet/train/multiple_host/train_3.csv
Uploading file: ../../data/train_4.csv with 4209 rows
Uploading to lstnet/train/multiple_host/train_4.csv


#### Modifications to lstnet_sagemaker.py

There are two main changes to the module:
    
1. Choose the kvstore as 'dist_device_sync' when multiple gpus and hosts are available.
2. Split each batch into one part per GPU and copy each part to a separate GPU before training.

In [1]:
!cat lstnet_sagemaker.py

import mxnet as mx
from mxnet import nd, gluon, autograd, kv
import numpy as np
from mxnet.gluon import nn, rnn
import os
from lstnet import LSTNet
from timeseriesdataset import TimeSeriesData, TimeSeriesDataset
import re
import time

def get_first_file_path_in_dir(input_dir):
    """
    Returns the path of the first file found in the directory path provided,
    or None if no file is found

    :param input_dir: directory path to search
    :return: file file
    :rtype string
    """
    for root, dirs, files in os.walk(input_dir):
        for filename in files:
            return os.path.join(input_dir,filename)
    return None

def get_file_path(input_dir,current_host,hosts):
    file_path = None
    if len(hosts) <= 1:
        file_path = get_first_file_path_in_dir(input_dir)
    else:
        numbers_in_host_name = re.findall('\d+', current_host)
        index = int(numbers_in_host_name[0]) - 1
        file_path = '{}/train_{}.csv'.format(input_dir

#### Test Locally

To make sure there are no obvious bugs in the code, the train() function is called in the notebook. This is done with 1 epoch to verify that it executed correctly. There are also some basic unit tests included in the directory. A notebook instance with a GPU is requred to execute the following steps. Otherwise skip on to deploying with SageMaker.

The key parameters to the train() function in this case are:

        - hyperparameters: The Amazon SageMaker Hyperparameters dictionary. A dict
            of string to string.
        - channel_input_dirs: A dict of string-to-string maps from the
            Amazon SageMaker algorithm input channel name to the directory containing
            files for that input channel. Note, if the Amazon SageMaker training job
            is run in PIPE mode, this dictionary will be empty.
        - output_data_dir:
            The Amazon SageMaker output data directory. After the function returns, data written to this
            directory is made available in the Amazon SageMaker training job
            output location.
        - num_gpus: The number of GPU devices available on the host this script
            is being executed on.
        - num_cpus: The number of CPU devices available on the host this script
            is being executed on.
        - hosts: A list of hostnames in the Amazon SageMaker training job cluster.
        - current_host: This host's name. It will exist in the hosts list.
        - kwargs: Other keyword args.

In [20]:
from lstnet_sagemaker import train
hyperparameters = {
    'conv_hid' : 10,
    'gru_hid' : 10,
    'skip_gru_hid' : 2,
    'skip' : 5,
    'ar_window' : 6,
    'window' : 24*7,
    'horizon' : 24,
    'learning_rate' : 0.01,
    'clip_gradient' : 10.,
    'batch_size' : 128,
    'epochs' : 1
}
channel_input_dirs = {'train':data_dir,'test':data_dir}
train(
    hyperparameters = hyperparameters,
    input_data_config = None,
    channel_input_dirs = channel_input_dirs,
    output_data_dir = os.path.join(data_dir,'output'),
    model_dir = None,
    num_gpus = 1,
    num_cpus = 1,
    hosts = ['alg-1'],
    current_host = 'alg-1')

Train file path ../../data/train_1.csv
Test file path ../../data/train_1.csv
Loading file ../../data/train_1.csv
Loading file ../../data/train_1.csv
Is it a file True
Data length 4209
Loading file ../../data/train_1.csv
Loading file ../../data/train_1.csv
Is it a file True
Data length 4209
Running on [gpu(0)]
Hosts ['alg-1']
Current Host alg-1
kvstore device
Training Start
Epoch 0: rmse 0.3383692430873071
Final rmse 0.4321038925103963
Total training time: 64.28396916389465
Training End


#### Choose Hyperparameters

One parameter to watch when switching to multi-GPU is the batch-size. As each batch is split across 4 GPUs it is common to increase the batch size.

In [3]:
hyperparameters = {
    'conv_hid' : 100,
    'gru_hid' : 100,
    'skip_gru_hid' : 5,
    'skip' : 24,
    'ar_window' : 24,
    'window' : 24*7,
    'horizon' : 24,
    'learning_rate' : 0.0001,
    'clip_gradient' : 10.,
    'batch_size' : 512,
    'epochs' : 100
}

#### Trigger the training job using the SageMaker Python API.

The final step is to trigger the training job using the high-level Python API. A lower-level API is also available for more detailed control of the parameters.

The main differences to not from the single host and GPU case is that the train_instance_count is set to 5 to run on a 5 node cluster and the train_instance_type of ml.p3.8xlarge is chosen with 4 GPUs. SageMaker automatically bootstraps an MXNet cluster with 5 nodes.

First an estimator is created with sagemaker.mxnet.MXNet. The inputs are:

* entry_point='lstnet_sagemaker.py' - The module used to run the training by calling the train() function
* source_dir='.' - An optional directory containing code with is copied onto the SageMaker training hosts and made available to the training script.
* role=role - The IAM role which is given to the training hosts giving them privileges such as access to the S3 bucket.
* output_path='s3://{}/{}/output'.format(bucket, prefix) - The S3 bucket to store artifacts such as the model parameters.
* train_instance_count=5 - The number of hosts used for training. Using a number > 1 will start a cluster. To take advantage of this the training data is sharded.
* train_instance_type='ml.p3.8xlarge' - The EC2 instance type to be used for training hosts. In this case the latest generation p3 is chosen with 4 Nvidia Tesla v100 GPUs.
* hyperparameters=hyperparameters - The hyperparameter dictionary made available to the train() function in the endpoint script.

Then the fit() method of the estimator is called. The parameters are:

* inputs - A dictionary containing the URLs in S3 of the 'train' data directory and the 'test' data directory.
* wait - This is specified as False so the fit() method returns immediately after the training job is created. Go to the SageMaker console to monitor the progress of the job. Set wait to True to block and see the progress of the training job output in the notebook.

2 different versions are run to compare traning speeds:

1. 5 hosts with 4 GPUs
2. 5 hosts with 1 GPU

Experiment with more combinations to improve performance. Can you find the most efficient batch-size vs harware combination for this network?

In [5]:
lstnet1 = MXNet(entry_point='lstnet_sagemaker.py',
    source_dir='.',
    role=role,
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    train_instance_count=5,
    train_instance_type='ml.p3.8xlarge',
    hyperparameters=hyperparameters)
lstnet1.fit(inputs={'train': 's3://{}/{}{}'.format(bucket, prefix, multiple_host_train_bucket_prefix),
    'test': 's3://{}/{}{}'.format(bucket, prefix, test_bucket_prefix)},wait=False)

INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2018-07-19-12-00-25-501


In [7]:
lstnet3 = MXNet(entry_point='lstnet_sagemaker.py',
    source_dir='.',
    role=role,
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    train_instance_count=1,
    train_instance_type='ml.p3.2xlarge',
    hyperparameters=hyperparameters)
lstnet3.fit(inputs={'train': 's3://{}/{}{}'.format(bucket, prefix, single_host_train_bucket_prefix),
    'test': 's3://{}/{}{}'.format(bucket, prefix, test_bucket_prefix)},wait=False)

INFO:sagemaker:Creating training-job with name: sagemaker-mxnet-2018-07-19-12-03-13-796
