# Scalable TensorFlow models with Pipe mode and distributed training

New data scientists and machine learning engineers have a treasure trove of examples available on the internet to help them get started. These examples typically leverage small public datasets and demonstrate common use cases and approaches. The data in these examples can be downloaded quickly to a training instance and training can be completed typically in minutes. However, many customers have large scale datasets for machine learning that make the simple approach of downloading the full dataset prohibitive. Imagine your training algorithm waiting for a download of 100GB of images for a travel web site, 100TB of video, or even 10PB of monitoring data from heart patients worldwide. Likewise, large scale training jobs can take days to run on a single instance.

Amazon SageMaker provides Pipe mode and distributed training for TensorFlow developers for exactly this purpose. Pipe mode lets you establish a channel to your dataset and feed your training algorithm batches of that data incrementally. Your training can start quickly, and you can train on an infinite size dataset. Distributed training lets you reduce training job durations by adding more training instances to parallelize the training. These scaling capabilities work well with SageMaker's TensorFlow container, allowing data scientists to bring their own TensorFlow scripts without having to do the heavier lifting of building Docker containers or standing up machine learning clusters.

While there are several examples available on the use of Pipe mode, not all possible scenarios and use cases are covered. This example notebook provides an end to end example for approaching TensorFlow training with large datasets and projected long training durations. It includes use of the following:

1. **Script mode** using SageMaker's TensorFlow container and a custom TensorFlow neural network.
2. **Pipe mode** to incrementally stream data to the training algorithm. 
3. Data stored in **TFRecords** format.
4. Multiple data channels each containing **multiple files**.
5. Data **sharding by S3 keys**.
6. Training across **multiple training instances** using SageMaker's built in **parameter server**, plus proper handling of **saving the model artifacts** from only the master node.
7. Definition of **SageMaker training metrics** to support experimentation, visualization, and hyperparameter tuning.

## Simple synthetic classification dataset

For this example, we use a simple numeric dataset that we will use for binary classification. With the focus of this notebook on quickly and easily demonstrating pipe mode, our synthetic dataset has a configurable number of features and samples. Feel free to scale it up to see the approach in action on large datasets. To get started, we create a synthetic dataset and split it into train, test, and validation. 

Note that for showcasing distributed training, we use a larger dataset. You may need to use an ml.t3.2xlarge notebook instance to have sufficient memory.

In [None]:
import os
import shutil

This notebook works in a few different modes. If you are wanting to run through it quickly to get a feel for the code and what it takes to do Pipe mode, leave it at `fast_demo`. If you would like to see how the use of multiple training instances can dramatically speed your training job, run the notebook twice:

1. Once using `one_slow_node` to see how long a job would take with a single node and a modest amount of data. This serves as a baseline. 
2. Then run it in `speedy_cluster` mode to run the same exact scenario with more training instances. You will see a significant improvement in training job duration.

In [None]:
train_instance_type = 'ml.c5.xlarge' 
serve_instance_type = 'ml.m4.xlarge'

In [None]:
MODE = 'fast-demo' # 'fast-demo' or 'one-slow-node' or 'speedy-cluster'

if MODE == 'fast-demo':  ## 3-minute training job
    TRAIN_INSTANCE_COUNT = 1
    NUM_SAMPLES  = 10000
    NUM_FEATURES = 50
    NUM_FILES    = 6
    NUM_EPOCHS   = 200
    BASE_JOB_NAME = 'tf-pipe-fast-demo'
else:
    NUM_SAMPLES  = 90000
    NUM_FEATURES = 5000
    NUM_FILES    = 21 # NOTE: For ideal splicing of data across nodes, make this a multiple of TRAIN_INSTANCE_COUNT.
    NUM_EPOCHS   = 5000

    if MODE == 'one-slow-node':   ## 80-minute training job
        TRAIN_INSTANCE_COUNT = 1
        BASE_JOB_NAME = 'tf-pipe-one-slow-node'
    else:
        TRAIN_INSTANCE_COUNT = 3  ## 30-minute training job
        BASE_JOB_NAME = 'tf-pipe-speedy-cluster-' + str(TRAIN_INSTANCE_COUNT)

# NOTE: For ideal splicing of data across nodes, make NUM_FILES a multiple of TRAIN_INSTANCE_COUNT.

BATCH_SIZE   = 64
INPUT_MODE   = 'Pipe' # Can try it with 'File' mode as well

Our training script uses the number of features
to define the input shape for a simple TensorFlow neural network. Here we use a `sed` script
ensure the input shape is consistent across the training script and the notebook generating the dataset and the 
training files.

In [None]:
!sed 's/NUM_FEATURES = /NUM_FEATURES = {NUM_FEATURES} \#/' scripts/train.py > scripts/tmp.py
!mv scripts/tmp.py scripts/train.py

In [None]:
!pygmentize scripts/train.py

Generate the dataset and split it across train, test, and val.

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

X1, Y1 = make_classification(n_samples=NUM_SAMPLES, n_features=NUM_FEATURES, n_redundant=0, 
                             n_informative=1, n_classes=2, n_clusters_per_class=1, 
                             shuffle=True, class_sep=2.0)

# split data into train and test sets
seed = 7
val_size  = 0.20
test_size = 0.10

# Give 70% to train
X_train, X_test, y_train, y_test = \
    train_test_split(X1, Y1, test_size=(test_size + val_size), random_state=seed)
# Of the remaining 30%, give 2/3 to validation and 1/3 to test
X_test, X_val, y_test, y_val     = \
    train_test_split(X_test, y_test, test_size=(test_size / (test_size + val_size)), 
                     random_state=seed)

print('Train shape: {}, Test shape: {}, Val shape: {}'.format(X_train.shape, 
                                                              X_test.shape, X_val.shape))
print('Train target: {}, Test target: {}, Val target: {}'.format(y_train.shape, 
                                                                 y_test.shape, y_val.shape))
print('\nSample observation: {}\nSample target: {}'.format(X_test[0], y_test[0]))

Here we capture how many samples are in each channel. We will be passing this to the training script to define the number of
steps per epoch.

In [None]:
num_train_samples = X_train.shape[0]
num_val_samples   = X_val.shape[0]
num_test_samples  = X_test.shape[0]

## Saving data to TFRecord files

Pipe mode supports RecordIO, TFRecord, and TextLine. Here we will use TFRecord format, and for each of train, test, and val, we generate a set of files so we can see how Pipe mode is able to deal with sets of files. We divide the dataset into a configurable set of slices and save each slice to a separate file. If we were dealing with a massive dataset, dividing the data into separate files makes it easier to feed the data to your training algorithm, as well as facilitating training across a cluster of machines to reduce training time.

In [None]:
import tensorflow as tf
from sagemaker.tensorflow import TensorFlow

In [None]:
def convert_to_tfr(x, y, out_file):
    with tf.python_io.TFRecordWriter(out_file) as record_writer:
      num_samples = len(x)
      for i in range(num_samples):
        example = tf.train.Example()
        example.features.feature['features'].float_list.value.extend(x[i])
        example.features.feature['label'].int64_list.value.append(int(y[i]))
        record_writer.write(example.SerializeToString())

Remove old data directories and files if they exist. Recreate a data folder with subfolders for each of the three channels.

In [None]:
shutil.rmtree('data', ignore_errors=True)
os.makedirs('data/train')
os.makedirs('data/test')
os.makedirs('data/val')

Save each of the datasets into their own folder of files based on the configurable number of files. The data will be split as evenly as possible across that number of files in each channel.

In [None]:
def save_to_n_files(x, y, n_files, channel):
    _split_x = np.array_split(x, n_files)
    _split_y = np.array_split(y, n_files)
    for i in range(n_files):
        convert_to_tfr(_split_x[i], _split_y[i], 
                       './data/{}/{}{}.tfrecords'.format(channel, channel, i))

In [None]:
%%time
save_to_n_files(X_train, y_train, NUM_FILES, 'train')
save_to_n_files(X_test,  y_test,  NUM_FILES, 'test')
save_to_n_files(X_val,   y_val,   NUM_FILES, 'val')

Free up memory as these portions of the synthetic dataset are no longer needed in this notebook.

In [None]:
del X_train
del y_train
del X_val
del y_val

## Upload the input data to S3
Save the entire data folder hierarchy up to S3. For channels configured with `Pipe` mode, the data will be piped to the training job as the training algorithm progresses. If using `File` mode, the entire set of files for each data channel will be downloaded to the training instance at the start of the job.

In [None]:
import boto3
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
data_prefix = 'data/DEMO-hello-pipe-mode'

In [None]:
%%time
# clear out any old data
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket)
s3_bucket.objects.filter(Prefix=data_prefix + '/').delete()

# upload the entire set of data for all three channels
inputs = sagemaker_session.upload_data(path='data', key_prefix=data_prefix)
print('Data was uploaded to s3 at: {}'.format(inputs))

# Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Note that we provide metric definitions to track validation accuracy and validation loss. By providing these definitions, we will now be able to see these charted on the training job detail page in the console. Likewise, we can navigate to the CloudWatch algorithm metrics for the job for further analysis.

In [None]:
from sagemaker.tensorflow import TensorFlow

hyperparameters = {'epochs'    : NUM_EPOCHS, 'batch_size': BATCH_SIZE,
                   'num_train_samples': num_train_samples,
                   'num_val_samples'  : num_val_samples,
                   'num_test_samples' : num_test_samples}

estimator = TensorFlow(entry_point='train.py',
                            source_dir='scripts',
                            input_mode=INPUT_MODE,
                            train_instance_type=train_instance_type,
                            train_instance_count=TRAIN_INSTANCE_COUNT,
                            distributions={'parameter_server': {'enabled': True}},
                            metric_definitions=[
                               {'Name': 'validation:acc',  'Regex': '- val_acc: (.*?$)'},
                               {'Name': 'validation:loss', 'Regex': '- val_loss: (.*?) '}],
                            hyperparameters=hyperparameters,
                            role=sagemaker.get_execution_role(),
                            framework_version='1.12',
                            py_version='py3',
                            base_job_name=BASE_JOB_NAME)

Efficient distributed training is facilitated by the use of sharding the data by S3 key. The alternative data distribution mechanism is fully replicated. When you fully replicate the data, all the data files will be sent to every training instance. S3 sharding speeds the time to training completion by copying only a subset of data files to each training instance.

In [None]:
%%time

DISTRIBUTION_MODE = 'ShardedByS3Key' # 'FullyReplicated'

train_input = sagemaker.s3_input(s3_data=inputs+'/train', 
                                 distribution=DISTRIBUTION_MODE)
test_input  = sagemaker.s3_input(s3_data=inputs+'/test', 
                                 distribution=DISTRIBUTION_MODE)
val_input   = sagemaker.s3_input(s3_data=inputs+'/val', 
                                 distribution=DISTRIBUTION_MODE)

remote_inputs = {'train': train_input, 'val': val_input, 'test': test_input}

estimator.fit(remote_inputs, wait=True)

## Deploy and make predictions

In [None]:
%%time
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type=serve_instance_type)

Make a handful of predictions to ensure the model is being served properly and is making accurate predictions. The endpoint should yield similar accuracy to that reported at the end of the training job, as it evaluates the model using the same test dataset.

In [None]:
total_to_test = 100 # or to use the whole test suite, set this to: len(X_test)
num_accurate  = 0

for i in range(total_to_test):
    result = predictor.predict(X_test[i])
    predicted_prob = result['predictions'][0][0]
    predicted_label = round(predicted_prob)
    if y_test[i] == predicted_label:
        num_accurate += 1
        print('PASS. Actual: {:.0f}, Prob: {:.4f}'.format(y_test[i], predicted_prob))
    else:
        print('FAIL. Actual: {:.0f}, Prob: {:.4f}'.format(y_test[i], predicted_prob))
print('Acc: {:.2%}'.format(num_accurate/total_to_test))

## Clean up

Remove the SageMaker-hosted endpoint and avoid additional billing.

In [None]:
sagemaker_session.delete_endpoint(predictor.endpoint)

Remove all the generated data from the notebook instance folders and from S3.

In [None]:
shutil.rmtree('data', ignore_errors=True)
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket(bucket)
resp = s3_bucket.objects.filter(Prefix=data_prefix + '/').delete()

Free up memory for a subsequent run.

In [None]:
del X_test
del y_test