# TensorFlow script mode training and serving

Script mode is a training script format for TensorFlow that lets you execute any TensorFlow training script in SageMaker with minimal modification. The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native TensorFlow support sets up training-related environment variables and executes your training script. In this tutorial, we use the SageMaker Python SDK to launch a training job and deploy the trained model.

Script mode supports training with a Python script, a Python module, or a shell script. In this example, we use a Python script to train a classification model on the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). In this example, we will show how easily you can train a SageMaker using TensorFlow 1.x and TensorFlow 2.0 scripts with SageMaker Python SDK. In addition, this notebook demonstrates how to perform real time inference with the [SageMaker TensorFlow Serving container](https://github.com/aws/sagemaker-tensorflow-serving-container). The TensorFlow Serving container is the default inference method for script mode. For full documentation on the TensorFlow Serving container, please visit [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst).

This lab goes through 3 parts:

1. Training the Model
2. Deploying and evaluating the Trained Model
3. Hyperparameter Optimization

# Part 1: Training the Model

## Set up the environment

Let's start by setting up the environment:

In [1]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
region = sagemaker_session.boto_session.region_name

### Training Data

The MNIST dataset has been loaded to the public S3 buckets ``sagemaker-sample-data-<REGION>`` under the prefix ``tensorflow/mnist``. There are four ``.npy`` file under this prefix:
* ``train_data.npy``
* ``eval_data.npy``
* ``train_labels.npy``
* ``eval_labels.npy``

In [2]:
training_data_uri = 's3://sagemaker-sample-data-{}/tensorflow/mnist'.format(region)

## Construct a script for distributed training

This tutorial's training script was adapted from TensorFlow's official [CNN MNIST example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/layers/cnn_mnist.py). We have modified it to handle the ``model_dir`` parameter passed in by SageMaker. This is an S3 path which can be used for data sharing during distributed training and checkpointing and/or model persistence. We have also added an argument-parsing function to handle processing training-related variables.

At the end of the training job we have added a step to export the trained model to the path stored in the environment variable ``SM_MODEL_DIR``, which always points to ``/opt/ml/model``. This is critical because SageMaker uploads all the model artifacts in this folder to S3 at end of training.

Here is the entire script, for both TF 1.x and TF 2.0:

In [3]:
# Tensorflow 1.x script
!pygmentize 'mnist.py'

[37m# Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m
[37m#[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m
[37m# may not use this file except in compliance with the License. A copy of[39;49;00m
[37m# the License is located at[39;49;00m
[37m#[39;49;00m
[37m#     http://aws.amazon.com/apache2.0/[39;49;00m
[37m#[39;49;00m
[37m# or in the "license" file accompanying this file. This file is[39;49;00m
[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m
[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m
[37m# language governing permissions and limitations under the License.[39;49;00m
[33m"""Convolutional Neural Network Estimator for MNIST, built with tf.layers."""[39;49;00m

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import
[34mfrom[39;49;00m [04m[36m__future__[39;49;00

In [4]:
# TensorFlow 2.0 script
!pygmentize 'mnist-2.py'

[37m# Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m
[37m#[39;49;00m
[37m# Licensed under the Apache License, Version 2.0 (the "License"). You[39;49;00m
[37m# may not use this file except in compliance with the License. A copy of[39;49;00m
[37m# the License is located at[39;49;00m
[37m#[39;49;00m
[37m#     http://aws.amazon.com/apache2.0/[39;49;00m
[37m#[39;49;00m
[37m# or in the "license" file accompanying this file. This file is[39;49;00m
[37m# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF[39;49;00m
[37m# ANY KIND, either express or implied. See the License for the specific[39;49;00m
[37m# language governing permissions and limitations under the License.import tensorflow as tf[39;49;00m

[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00

## Create a training job using the `TensorFlow` estimator

The `sagemaker.tensorflow.TensorFlow` estimator handles locating the script mode container, uploading your script to a S3 location and creating a SageMaker training job. Let's call out a couple important parameters here:

* `py_version` is set to `'py3'` to indicate that we are using script mode since legacy mode supports only Python 2. Though Python 2 will be deprecated soon, you can use script mode with Python 2 by setting `py_version` to `'py2'` and `script_mode` to `True`.

* `distributions` is used to configure the distributed training setup. It's required only if you are doing distributed training either across a cluster of instances or across multiple GPUs. Here we are using parameter servers as the distributed training schema. SageMaker training jobs run on homogeneous clusters. To make parameter server more performant in the SageMaker setup, we run a parameter server on every instance in the cluster, so there is no need to specify the number of parameter servers to launch. Script mode also supports distributed training with [Horovod](https://github.com/horovod/horovod). You can find the full documentation on how to configure `distributions` [here](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#distributed-training). 



In [11]:
from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             train_instance_count=2,
                             train_instance_type='ml.c5.2xlarge',
                             framework_version='1.14',
                             py_version='py3',
                             distributions={'parameter_server': {'enabled': True}})

You can also initiate an estimator to train with TensorFlow 2.0 script. The only things that you will need to change are the script name and ``framewotk_version``.

We'll include metric extraction from the CloudWatch logs of the training job. The TF 2.0 script was adapted to log train and eval loss and accuracies, and we'll set the expressions up to capture them all. 

This will be used in part 3 of the lab for hyperparameter optimization.

In [12]:
metric_definitions = [{'Name': 'train_loss',
                       'Regex': 'train_loss: ([0-9\\.]+)'},
                      {'Name': 'train_acc',
                       'Regex': 'train_accuracy: ([0-9\\.]+)'},
                      {'Name': 'eval_loss',
                       'Regex': 'Evaluation loss: ([0-9\\.]+)'},
                      {'Name': 'eval_acc',
                       'Regex': 'Evaluation accuracy: ([0-9\\.]+)'},
                     ]

In [13]:
mnist_estimator2 = TensorFlow(entry_point='mnist-2.py',
                              role=role,
                              train_instance_count=1,
                              train_instance_type='ml.c5.2xlarge',
                              framework_version='2.0.0',
                              py_version='py3',
                              distributions={'parameter_server': {'enabled': True}},
                              metric_definitions=metric_definitions
                             )

### Calling ``fit``

To start a training job, we call `estimator.fit(training_data_uri)`.

An S3 location is used here as the input. `fit` creates a default channel named `'training'`, which points to this S3 location. In the training script we can then access the training data from the location stored in `SM_CHANNEL_TRAINING`. `fit` accepts a couple other types of input as well. See the API doc [here](https://sagemaker.readthedocs.io/en/stable/estimators.html#sagemaker.estimator.EstimatorBase.fit) for details.

When training starts, the TensorFlow container executes mnist.py, passing `hyperparameters` and `model_dir` from the estimator as script arguments. Because we didn't define either in this example, no hyperparameters are passed, and `model_dir` defaults to `s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>`, so the script execution is as follows:
```bash
python mnist.py --model_dir s3://<DEFAULT_BUCKET>/<TRAINING_JOB_NAME>
```
When training is complete, the training job will upload the saved model for TensorFlow serving.

Training should take about **10 minutes**.

In [14]:
mnist_estimator.fit(training_data_uri, wait=False)

Calling fit to train a model with TensorFlow 2.0 scroipt.

In [15]:
mnist_estimator2.fit(training_data_uri, wait=False)

In [16]:
from time import sleep
while (mnist_estimator.latest_training_job.describe()['TrainingJobStatus'] == 'InProgress' or
       mnist_estimator2.latest_training_job.describe()['TrainingJobStatus'] == 'InProgress'):
    print('Training in progress...')
    sleep(30)
print("Training finished. Status:\n"
      f"\tTF 1: {mnist_estimator.latest_training_job.describe()['TrainingJobStatus']}\n"
      f"\tTF 2: {mnist_estimator2.latest_training_job.describe()['TrainingJobStatus']}")

Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training in progress...
Training finished. Status:
	TF 1: Completed
	TF 2: Completed


# Part 2: Deploy the trained model to an endpoint

The `deploy()` method creates a SageMaker model, which is then deployed to an endpoint to serve prediction requests in real time. We will use the TensorFlow Serving container for the endpoint, because we trained with script mode. This serving container runs an implementation of a web server that is compatible with SageMaker hosting protocol. The [Using your own inference code]() document explains how SageMaker runs inference containers.

The 2 cells below deploy the TF 1.x and TF 2.0 models as service endpoints. Execute both cells, deployment should take about 10 minutes.

In [None]:
predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge', wait=False)

Deploy the trained TensorFlow 2.0 model to an endpoint.

In [None]:
predictor2 = mnist_estimator2.deploy(initial_instance_count=1, instance_type='ml.p3.2xlarge', wait=True)

## Invoke the endpoint

Let's download the test data and use that as input for inference.

In [None]:
import numpy as np

!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/eval_data.npy test_data.npy
!aws --region {region} s3 cp s3://sagemaker-sample-data-{region}/tensorflow/mnist/eval_labels.npy test_labels.npy

test_data = np.load('test_data.npy')
test_labels = np.load('test_labels.npy')

The formats of the input and the output data correspond directly to the request and response formats of the `Predict` method in the [TensorFlow Serving REST API](https://www.tensorflow.org/serving/api_rest). SageMaker's TensforFlow Serving endpoints can also accept additional input formats that are not part of the TensorFlow REST API, including the simplified JSON format, line-delimited JSON objects ("jsons" or "jsonlines"), and CSV data.

In this example we are using a `numpy` array as input, which will be serialized into the simplified JSON format. In addtion, TensorFlow serving can also process multiple items at once as you can see in the following code. You can find the complete documentation on how to make predictions against a TensorFlow serving SageMaker endpoint [here](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint).

In [None]:
predictions = predictor.predict(test_data[:50])
errors = []
for i in range(0, 50):
    prediction = predictions['predictions'][i]['classes']
    label = test_labels[i]
    if (prediction != label):
        errors.append(i)
    print(f'{i}: prediction is {prediction}, label is {label}, matched: {prediction == label}')

So, the model made a few errors. Those were capture in the `errors` array, which we'll use to manually inspect what could be the problem.

Examine the prediction result from the TensorFlow 2.0 model. The TF 2.0 model returns only the probabilities for each class, so we run a quick processing to determine the most probable class.

In [None]:
predictions2 = predictor2.predict(test_data[:50])
predictions2['classes'] = [np.argmax(x) for x in predictions2['predictions']]
errors2 = []
for i in range(0, 50):
    prediction = predictions2['classes'][i]
    label = test_labels[i]
    if (prediction != label):
        errors2.append(i)
    print('prediction is {}, label is {}, matched: {}'.format(prediction, label, prediction == label))

## Analyze Prediction errors

We have collected the errors for both predictors, and some simple code can help us analyze them.

We'll define a simple function to inspect MNIST images, in case our model makes prediction mistakes.

In [None]:
from PIL import Image

def plot(data):
    data = data.reshape((28, 28))
    gray_range = data.max() - data.min()
    img_data = (((data - data.min()) / gray_range) * 255.).astype(np.uint8)
    img = Image.fromarray(img_data)
    return(img)    

Then we use that function with the error labels to see what the problem could be. The code below shows the first error for each predictor.

In [None]:
error_imgs = [plot(test_data[i]) for i in errors]
error_imgs[0] if (len(errors) > 0) else None    

In [None]:
error_imgs2 = [plot(train_data[i]) for i in errors2]
error_imgs2[0] if (len(errors2) > 0) else None    

## Delete the endpoints

Let's delete the endpoints we just created to prevent incurring any extra costs. We won't need them for the hyperparameter tuning.

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint)

Delete the TensorFlow 2.0 endpoint as well.

In [None]:
sagemaker.Session().delete_endpoint(predictor2.endpoint)

# Part 3: Hyperparameter Tuning
*Note, with the default setting below, the hyperparameter tuning job can take about 40 minutes to complete.*

Now we will set up the hyperparameter tuning job using SageMaker Python SDK, following below steps:
* We'll euse the TF 2.0 Estimator we defined before, but any estimator can be used, whether pretrained or not.
* Define the ranges of hyperparameters we plan to tune, in this example, we are tuning "learning_rate"
* Define the objective metric for the tuning job to optimize
* Create a hyperparameter tuner with above setting, as well as tuning resource configurations 

In [None]:
import boto3
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

With our estimator we can specify the hyperparameters we'd like to tune and their possible values.  We have three different types of hyperparameters.
- Categorical parameters need to take one value from a discrete set.  We define this by passing the list of possible values to `CategoricalParameter(list)`
- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`
- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`

*Note, if possible, it's almost always best to specify a value as the least restrictive type.  For example, tuning learning rate as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with values 0.01, 0.1, 0.15, or 0.2.*

We'll also specify the objective metric that we'd like to tune and its definition. We will use `eval_loss` as the objective metric, we also set the objective_type to be 'minimize', so that hyperparameter tuning seeks to minize the objective metric when searching for the best hyperparameter setting. By default, objective_type is set to 'maximize'.

In [None]:
hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.001, 0.2)}
objective_metric_name = 'eval_loss'
objective_type = 'Minimize'

Now, we'll create a `HyperparameterTuner` object, to which we pass:
- The TensorFlow estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [None]:
tuner = HyperparameterTuner(estimator=mnist_estimator2, 
                            objective_metric_name=objective_metric_name,
                            hyperparameter_ranges=hyperparameter_ranges,
                            metric_definitions=metric_definitions,
                            max_jobs=8,
                            max_parallel_jobs=2,
                            objective_type=objective_type)

## Launch hyperparameter tuning job
And finally, we can start our hyperprameter tuning job by calling `.fit()` and passing in the S3 path to our train and test dataset.

After the hyperprameter tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of the progress of the hyperparameter tuning job.

In [None]:
tuner.fit(training_data_uri)

## Analyzing the Hyperparameter Tuning Results

Let's inspect what's going on inside the training job.

In [None]:
analytics = tuner.analytics()

In [None]:
tuning = analytics.dataframe(force_refresh=True).set_index('TrainingJobName').sort_index()
while tuning[tuning.TrainingJobStatus == 'Completed'].shape[0] == 0:
    print('Waiting for some job to finish...')
    sleep(30)
    tuning = analytics.dataframe(force_refresh=True).set_index('TrainingJobName').sort_index()
tuning

In [None]:
%matplotlib inline
points = tuning.dropna()[['learning_rate', 'FinalObjectiveValue']]
ax = points.plot.scatter('learning_rate', 'FinalObjectiveValue', figsize=(15, 8))
for k, v in points.iterrows():
    ax.annotate(k[32:35], v)