# Using Amazon Elastic Inference with a pre-trained TensorFlow Serving model on SageMaker

This notebook demonstrates how to enable and use Amazon Elastic Inference with our predefined SageMaker TensorFlow Serving containers.

Amazon Elastic Inference (EI) is a resource you can attach to your Amazon EC2 instances to accelerate your deep learning (DL) inference workloads. EI allows you to add inference acceleration to an Amazon SageMaker hosted endpoint or Jupyter notebook for a fraction of the cost of using a full GPU instance. For more information please visit: https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html

This notebook's main objective is to show how to create an endpoint, backed by an Elastic Inference, to serve our pre-trained TensorFlow Serving model for predictions. With a more efficient cost per performance, Amazon Elastic Inference can prove to be useful for those looking to use GPUs for higher inference performance at a lower cost.

## Setup 

We'll begin with some necessary imports, and get an Amazon SageMaker session to help perform certain tasks, as well as an IAM role with the necessary permissions.

In [None]:
%matplotlib inline
import numpy as np
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')

print('inputs:\n{}'.format(inputs))

## Use the TensorFlow model pretrained using Horovod

Horovod is an open source distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. It is an alternative to the more "traditional" parameter servers method of performing distributed training demonstrated above.  Horovod can be more performant than parameter servers in large, GPU-based clusters where large models are trained. In Amazon SageMaker, Horovod is only available with TensorFlow version 1.12 or newer. 

We can now plot training curves for the Horovod training:

To retrieve the history, we first download the model locally, then unzip it to gain access to the history data structure. We can then simply load the history as JSON:

In [None]:
import matplotlib.pyplot as plt

def plot_training_curves(history): 
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharex=True)
    ax = axes[0]
    ax.plot(history['acc'], label='train')
    ax.plot(history['val_acc'], label='validation')
    ax.set(
        title='model accuracy',
        ylabel='accuracy',
        xlabel='epoch')
    ax.legend()

    ax = axes[1]
    ax.plot(history['loss'], label='train')
    ax.plot(history['val_loss'], label='validation')
    ax.set(
        title='model loss',
        ylabel='loss',
        xlabel='epoch')
    ax.legend()
    fig.tight_layout()

In [None]:
import json 

!aws s3 cp {inputs} ./hvd_model/model.tar.gz
!tar -xzf ./hvd_model/model.tar.gz -C ./hvd_model

with open('./hvd_model/hvd_history.p', "r") as f:
    hvd_history = json.load(f)
    
plot_training_curves(hvd_history)

## Model Deployment with Amazon Elastic Inference

Amazon SageMaker supports both real time inference and batch inference. In this notebook, we will focus on setting up an Amazon SageMaker hosted endpoint for real time inference with TensorFlow Serving (TFS).  Additionally, we will discuss why and how to use Amazon Elastic Inference with the hosted endpoint.

### Deploying the Model

When considering the overall cost of a machine learning workload, inference often is the largest part, up to 90% of the total.  If a GPU instance type is used for real time inference, it typically is not fully utilized because, unlike training, real time inference usually does not involve continuously sending large batches of data to the model.  Elastic Inference provides GPU acceleration suited for inference, allowing you to add just the right amount of inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.

The `deploy` method of the Estimator object creates an endpoint which serves prediction requests in near real time.  To utilize Elastic Inference with the SageMaker TFS container, simply provide an `accelerator_type` parameter, which determines the type of accelerator that is attached to your endpoint. Refer to the **Inference Acceleration** section of the [instance types chart](https://aws.amazon.com/sagemaker/pricing/instance-types) for a listing of the supported types of accelerators. 

Here we'll use a general purpose CPU compute instance type along with an Elastic Inference accelerator:  together they are much cheaper than the smallest P3 GPU instance type.

In [None]:
from sagemaker.tensorflow.serving import Model

tf_model = Model(model_data=inputs, role=role, framework_version='1.14')

predictor = tf_model.deploy(initial_instance_count=1,
                                  instance_type='ml.m5.xlarge')

In [None]:
#deployment fails. The accelerator for production variant AllTraffic did not pass the ping health check.
predictor = tf_model.deploy(initial_instance_count=1,
                                  instance_type='ml.m5.4xlarge',
                                  accelerator_type='ml.eia1.medium')

By using a general purpose CPU instance with an Elastic Inference accelerator instead of a GPU intance, substantial costs savings are achieved.  As of Q4 2019, On-Demand pricing for those resources is \\$0.269 per hour (ml.m5.xlarge), plus \\$0.182 per hour (ml.eia1.medium), for a total of \\$0.451 per hour. The total cost compared to the pricing of the smallest P3 family (NVIDIA Volta V100) GPU instance is as follows:

- Elastic Inference solution: \\$0.451 per hour
- GPU instance ml.p3.2xlarge: \\$4.284 per hour

To summarize, the Elastic Inference solution cost is about 10% of the cost of using a full P3 family GPU instance. 

###  Labels and Sample Data
  
Now that we have a Predictor object wrapping a real time Amazon SageMaker hosted enpoint, we'll define the label names and look at a sample of 10 images, one from each class.

In [None]:
from IPython.display import Image, display

os.system("aws s3 cp s3://sagemaker-workshop-pdx/cifar-10-module/sample-img ./sample-img --recursive --quiet")

labels = ['airplane','automobile','bird','cat','deer','dog','frog','horse','ship','truck']

images = []
for entry in os.scandir('sample-img'):
    if entry.is_file() and entry.name.endswith("png"):
        images.append('sample-img/' + entry.name)

for image in images:
    display(Image(image))

### Pre/post-postprocessing Script

The TFS container in Amazon SageMaker by default uses the TFS REST API to serve prediction requests. This requires the input data to be converted to JSON format.  One way to do this is to create a Docker container to do the conversion, then create an overall Amazon SageMaker model that links the conversion container to the TFS container with the model. This is known as an Amazon SageMaker Inference Pipeline, as demonstrated in another [sample notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_batch_transform/working_with_tfrecords).  

However, as a more convenient alternative for many use cases, the Amazon SageMaker TFS container provides a data pre/post-processing script feature that allows you to simply supply a data transformation script.  Using such a script, there is no need to build containers or directly work with Docker.  The simplest form of a script must only (1) implement an `input_handler` and `output_handler` interface, as shown in the code below, (2) be named `inference.py`, and (3) be placed in a `/code` directory.

In [None]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/tf-distribution-options/code/inference.py
#!cat ./inference.py

On the input preprocessing side, the code takes an image read from Amazon S3 and converts it to the required TFS REST API input format.  On the output postprocessing side, the script simply passes through the predictions in the standard TFS format without modifying them. Alternatively, we could have just returned a class label for the class with the highest score, or performed other postprocessing that would be helpful to the application consuming the predictions. 

### Requirements.txt

Besides an `inference.py` script implementing the handler interface, it also may be necessary to supply a `requirements.txt` file to ensure any necessary dependencies are installed in the container along with the script.  For this script, in addition to the Python standard libraries we used the Pillow and Numpy libraries.

In [None]:
!wget -q https://raw.githubusercontent.com/aws-samples/amazon-sagemaker-script-mode/master/tf-distribution-options/code/requirements.txt
!cat ./requirements.txt

### Make Predictions

Next we'll set up the Predictor object created by the `deploy` method call above. Since we are using a preprocessing script, we need to specify the Predictor's content type as `application/x-image` and override the default (JSON) serializer. We can now get predictions about the sample data displayed above simply by providing the raw .png image bytes to the Predictor.  

In [None]:
predictor.content_type = 'application/x-image'
predictor.serializer = None

def get_prediction(file_path):
    
    with open(file_path, "rb") as image:
        f = image.read()
    b = bytearray(f)
    return labels[np.argmax(predictor.predict(b)['predictions'], axis=1)[0]]

In [None]:
import time
start = time.time()
predictions = [get_prediction(image) for image in images]
end = time.time()
print(end - start)
print(predictions)

# Cleanup

To avoid incurring charges due to a stray endpoint, delete the Amazon SageMaker endpoint if you no longer need it:

In [None]:
sagemaker_session.delete_endpoint(predictor.endpoint)