# TensorFlow BYOM: Train with Custom Training Script, Compile with Neo, and Deploy on SageMaker

This notebook can be compared to [TensorFlow MNIST distributed training notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb) in terms of its functionality. We will do the same classification task, but this time we will compile the trained model using the Neo API backend, to optimize for our choice of hardware. Finally, we setup a real-time hosted endpoint in SageMaker for our compiled model using the Neo Deep Learning Runtime.

### Set up the environment

In [2]:
%cd /root/amazon-sagemaker-examples/aws_sagemaker_studio/sagemaker_neo_compilation_jobs/tensorflow_distributed_mnist

[Errno 2] No such file or directory: '/root/amazon-sagemaker-examples/aws_sagemaker_studio/sagemaker_neo_compilation_jobs/tensorflow_distributed_mnist'
/opt/ml/processing/input


In [3]:
import sys

!{sys.executable} -m pip install tensorflow==1.15.4

Collecting tensorflow==1.15.4
  Downloading tensorflow-1.15.4-cp37-cp37m-manylinux2010_x86_64.whl (110.5 MB)
[K     |████████████████████████████████| 110.5 MB 102.1 MB/s eta 0:00:01
[?25hCollecting grpcio>=1.8.6
  Downloading grpcio-1.38.0-cp37-cp37m-manylinux2014_x86_64.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 39.6 MB/s eta 0:00:01
[?25hCollecting keras-applications>=1.0.8
  Downloading Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 11.6 MB/s eta 0:00:01
[?25hCollecting absl-py>=0.7.0
  Downloading absl_py-0.12.0-py3-none-any.whl (129 kB)
[K     |████████████████████████████████| 129 kB 88.9 MB/s eta 0:00:01
Collecting astor>=0.6.0
  Downloading astor-0.8.1-py2.py3-none-any.whl (27 kB)
Collecting termcolor>=1.1.0
  Downloading termcolor-1.1.0.tar.gz (3.9 kB)
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 7.6 MB/s  eta 0:

In [1]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

### Download the MNIST dataset

In [2]:
import utils
from tensorflow.contrib.learn.python.learn.datasets import mnist
import tensorflow as tf

data_sets = mnist.read_data_sets("data", dtype=tf.uint8, reshape=False, validation_size=5000)

utils.convert_to(data_sets.train, "train", "data")
utils.convert_to(data_sets.validation, "validation", "data")
utils.convert_to(data_sets.test, "test", "data")


Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Writing data/train.tfrecords

Writing data/validation.tfrecord

### Upload the data
We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job.

In [5]:
default_bucket = sagemaker_session.default_bucket()
inputs = sagemaker_session.upload_data(bucket=default_bucket, path="data", key_prefix="data/DEMO-mnist")

In [8]:
print(inputs)

s3://sagemaker-us-west-2-688520471316/data/DEMO-mnist


# Construct a script for distributed training 
Here is the full code for the network model:

In [7]:
!cat 'mnist.py'

import os

import tensorflow as tf
from tensorflow.python.estimator.model_fn import ModeKeys as Modes

INPUT_TENSOR_NAME = "inputs"
SIGNATURE_NAME = "predictions"

LEARNING_RATE = 0.001


def model_fn(features, labels, mode, params):
    # Input Layer
    input_layer = tf.reshape(features[INPUT_TENSOR_NAME], [-1, 28, 28, 1])

    # Convolutional Layer #1
    conv1 = tf.layers.conv2d(
        inputs=input_layer, filters=32, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
    )

    # Pooling Layer #1
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

    # Convolutional Layer #2 and Pooling Layer #2
    conv2 = tf.layers.conv2d(
        inputs=pool1, filters=64, kernel_size=[5, 5], padding="same", activation=tf.nn.relu
    )
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

    # Dense Layer
    pool2_flat = tf.reshape(pool2, [-1, 7 * 7 * 64])
    dense = tf.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.

The script here is and adaptation of the [TensorFlow MNIST example](https://github.com/tensorflow/models/tree/master/official/mnist). It provides a ```model_fn(features, labels, mode)```, which is used for training, evaluation and inference. See [TensorFlow MNIST distributed training notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb) for more details about the training script.

At the end of the training script, there are two additional functions, to be used with Neo Deep Learning Runtime:
* `neo_preprocess(payload, content_type)`: Function that takes in the payload and Content-Type of each incoming request and returns a NumPy array
* `neo_postprocess(result)`: Function that takes the prediction results produced by Deep Learining Runtime and returns the response body

## Create a training job using the sagemaker.TensorFlow estimator

In [11]:
from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(
    entry_point="mnist.py",
    role=role,
    framework_version="1.15.4",
    py_version="py3",
    training_steps=10,
    evaluation_steps=1,
    model_dir='/opt/ml/model',
    output_path="s3://" + default_bucket + '/' + 'output/DEMO-mnist',
    instance_count=2,
    instance_type="ml.c5.2xlarge",
)

mnist_estimator.fit(inputs)

2021-06-01 19:52:53 Starting - Starting the training job...
2021-06-01 19:53:20 Starting - Launching requested ML instancesProfilerReport-1622577172: InProgress
......
2021-06-01 19:54:20 Starting - Preparing the instances for training.........
2021-06-01 19:55:52 Downloading - Downloading input data
2021-06-01 19:55:52 Training - Downloading the training image...
2021-06-01 19:56:23 Uploading - Uploading generated training model
2021-06-01 19:56:23 Completed - Training job completed
[0m
[35m2021-06-01 19:56:11,217 sagemaker-training-toolkit INFO     Imported framework sagemaker_tensorflow_container.training[0m
[35m2021-06-01 19:56:11,224 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2021-06-01 19:56:11,615 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2021-06-01 19:56:11,630 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[35m2021-06-01 19:56:11,645 s

The **```fit```** method will create a training job in two **ml.c4.xlarge** instances. The logs above will show the instances doing training, evaluation, and incrementing the number of **training steps**. 

In the end of the training, the training job will generate a saved model for TF serving.

# Deploy the trained model to prepare for predictions (the old way)

The deploy() method creates an endpoint which serves prediction requests in real-time.

In [12]:
mnist_predictor = mnist_estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-us-west-2-688520471316/output/DEMO-mnist/tensorflow-training-2021-06-01-19-52-52-782/output/model.tar.gz.

## Invoking the endpoint

In [None]:
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

for i in range(10):
    data = mnist.test.images[i].tolist()
    tensor_proto = tf.make_tensor_proto(
        values=np.asarray(data), shape=[1, len(data)], dtype=tf.float32
    )
    predict_response = mnist_predictor.predict(tensor_proto)

    print("========================================")
    label = np.argmax(mnist.test.labels[i])
    print("label is {}".format(label))
    prediction = np.argmax(predict_response["outputs"]["probabilities"]["float_val"])
    print("prediction is {}".format(prediction))

## Deleting the endpoint

In [None]:
mnist_predictor.delete_endpoint()

# Deploy the trained model using Neo

Now the model is ready to be compiled by Neo to be optimized for our hardware of choice. We are using the  ``TensorFlowEstimator.compile_model`` method to do this. For this example, our target hardware is ``'ml_c5'``. You can changed these to other supported target hardware if you prefer.

## Compiling the model
The ``input_shape`` is the definition for the model's input tensor and ``output_path`` is where the compiled model will be stored in S3. **Important. If the following command result in a permission error, scroll up and locate the value of execution role returned by `get_execution_role()`. The role must have access to the S3 bucket specified in ``output_path``.**

In [None]:
output_path = "/".join(mnist_estimator.output_path.split("/")[:-1])
optimized_estimator = mnist_estimator.compile_model(
    target_instance_family="ml_c5",
    input_shape={"data": [1, 784]},  # Batch size 1, 3 channels, 28x28 Images.
    output_path=output_path,
    framework="tensorflow",
    framework_version="1.15.4",
)

## Deploying the compiled model

In [None]:
optimized_predictor = optimized_estimator.deploy(
    initial_instance_count=1, instance_type="ml.c5.4xlarge"
)

In [None]:
def numpy_bytes_serializer(data):
    f = io.BytesIO()
    np.save(f, data)
    f.seek(0)
    return f.read()


optimized_predictor.content_type = "application/vnd+python.numpy+binary"
optimized_predictor.serializer = numpy_bytes_serializer

## Invoking the endpoint

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
from IPython import display
import PIL.Image
import io

mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)

for i in range(10):
    data = mnist.test.images[i]
    # Display image
    im = PIL.Image.fromarray(data.reshape((28, 28)) * 255).convert("L")
    display.display(im)
    # Invoke endpoint with image
    predict_response = optimized_predictor.predict(data)

    print("========================================")
    label = np.argmax(mnist.test.labels[i])
    print("label is {}".format(label))
    prediction = predict_response
    print("prediction is {}".format(np.argmax(prediction)))

## Deleting endpoint

In [None]:
optimized_predictor.delete_endpoint()