# Serving a TensorFlow Model as a REST Endpoint with TensorFlow Serving and SageMaker

We need to understand the application and business context to choose between real-time and batch predictions. Are we trying to optimize for latency or throughput? Does the application require our models to scale automatically throughout the day to handle cyclic traffic requirements? Do we plan to compare models in production through A/B tests?

If our application requires low latency, then we should deploy the model as a real-time API to provide super-fast predictions on single prediction requests over HTTPS. We can deploy, scale, and compare our model prediction servers with SageMaker Endpoints.

## Interesting Reads
* [**How Roblox Scaled BERT to Serve 1+ Billion Daily Requests on CPUs**](https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus/)

<img src="img/sagemaker-architecture.png" width="80%" align="left">

In [1]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [2]:
%store -r training_job_name

In [3]:
try:
    training_job_name
    print('[OK]')
except NameError:
    print('+++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the previous TRAIN section before you continue.')
    print('+++++++++++++++++++++++++++++++')

[OK]


In [4]:
print(training_job_name)

tensorflow-training-2020-12-30-07-28-23-054


# Copy the Model to the Notebook

In [5]:
!aws s3 cp s3://$bucket/$training_job_name/output/model.tar.gz ./model.tar.gz

download: s3://sagemaker-us-east-1-835319576252/tensorflow-training-2020-12-30-07-28-23-054/output/model.tar.gz to ./model.tar.gz


In [6]:
!mkdir -p ./model/
!tar -xvzf ./model.tar.gz -C ./model/

transformers/
transformers/fine-tuned/
transformers/fine-tuned/tf_model.h5
transformers/fine-tuned/config.json
tensorboard/
metrics/
metrics/confusion_matrix.png
tensorflow/
tensorflow/saved_model/
tensorflow/saved_model/0/
tensorflow/saved_model/0/variables/
tensorflow/saved_model/0/variables/variables.index
tensorflow/saved_model/0/variables/variables.data-00000-of-00001
tensorflow/saved_model/0/assets/
tensorflow/saved_model/0/saved_model.pb


In [7]:
!saved_model_cli show --all --dir ./model/tensorflow/saved_model/0/

2020-12-30 08:17:15.872153: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/efa/lib:/opt/amazon/efa/lib:/opt/amazon/efa/lib64:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:
2020-12-30 08:17:15.872241: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/ef

# Show `inference.py`

In [10]:
# !pygmentize ./model/code/inference.py

# Deploy the Model
This will create a default `EndpointConfig` with a single model.  

The next notebook will demonstrate how to perform more advanced `EndpointConfig` strategies to support canary rollouts and A/B testing.

_Note:  If not using a US-based region, you may need to adapt the container image to your current region using the following table:_

https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

In [11]:
import time

timestamp = int(time.time())

tensorflow_model_name = '{}-{}-{}'.format(training_job_name, 'tf', timestamp)

print(tensorflow_model_name)

tensorflow-training-2020-12-30-07-28-23-054-tf-1609316254


In [12]:
from sagemaker.tensorflow.model import TensorFlowModel

tensorflow_model = TensorFlowModel(name=tensorflow_model_name,
                                   model_data='s3://{}/{}/output/model.tar.gz'.format(bucket, training_job_name),
                                   role=role,                
                                   framework_version='2.1.0')

In [13]:
tensorflow_endpoint_name = '{}-{}-{}'.format(training_job_name, 'tf', timestamp)

print(tensorflow_endpoint_name)

tensorflow-training-2020-12-30-07-28-23-054-tf-1609316254


In [14]:
tensorflow_model = tensorflow_model.deploy(endpoint_name=tensorflow_endpoint_name,
                                           initial_instance_count=1, # Should use >=2 for high(er) availability 
                                           instance_type='ml.m5.4xlarge', # requires enough disk space for tensorflow, transformers, and bert downloads
                                           wait=False)

update_endpoint is a no-op in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [15]:
tensorflow_endpoint_arn = sm.describe_endpoint(EndpointName=tensorflow_endpoint_name)['EndpointArn']
print(tensorflow_endpoint_arn)

arn:aws:sagemaker:us-east-1:835319576252:endpoint/tensorflow-training-2020-12-30-07-28-23-054-tf-1609316254


In [16]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">SageMaker REST Endpoint</a></b>'.format(region, tensorflow_endpoint_name)))


# _Wait Until the Endpoint is Deployed_

In [17]:
%%time

waiter = sm.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=tensorflow_endpoint_name)

CPU times: user 135 ms, sys: 17 ms, total: 152 ms
Wall time: 6min 31s


# _Wait Until the ^^ Endpoint ^^ is Deployed_

# Show the Experiment Tracking Lineage

In [18]:
from sagemaker.lineage.visualizer import LineageTableVisualizer

lineage_table_viz = LineageTableVisualizer(sess)
lineage_table_viz_df = lineage_table_viz.show(endpoint_arn=tensorflow_endpoint_arn)
lineage_table_viz_df

Unnamed: 0,Name/Source,Direction,Type,Association Type,Lineage Type
0,tensorflow-training-2020-12-30-07-28-23-054-tf...,Input,ModelDeployment,AssociatedWith,action


# Test the Deployed Model

TODO:  Create a custom serializer here:  https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/serializers.py

In [37]:
# class RequestHandler(object):
#   import json
  
#   def __init__(self, tokenizer, max_seq_length):
#     self.tokenizer = tokenizer
#     self.max_seq_length = max_seq_length

#   def __call__(self, instances):
#     transformed_instances = []

#     for instance in instances:
#       encode_plus_tokens = tokenizer.encode_plus(instance,
#                             pad_to_max_length=True,
#                             max_length=self.max_seq_length)

#       input_ids = encode_plus_tokens['input_ids']
#       input_mask = encode_plus_tokens['attention_mask']
#       segment_ids = [0] * self.max_seq_length

#       transformed_instance = {"input_ids": input_ids, 
#                   "input_mask": input_mask, 
#                   "segment_ids": segment_ids}

#       transformed_instances.append(transformed_instance)

#     transformed_data = {"instances": transformed_instances}

#     return json.dumps(transformed_data)


In [None]:
import sagemaker.serializer.SimpleBaseSerializer

class DistilBertSerializer(SimpleBaseSerializer):
    """Serialize data to DistilBERT format."""

    def serialize(self, data):
        """Serialize data of various formats to a JSON formatted string.
        Args:
            data (object): Data to be serialized.
        Returns:
            str: The data serialized as a JSON string.
        """
        if isinstance(data, dict):
            return json.dumps(
                {
                    key: value.tolist() if isinstance(value, np.ndarray) else value
                    for key, value in data.items()
                }
            )

        if hasattr(data, "read"):
            return data.read()

        if isinstance(data, np.ndarray):
            return json.dumps(data.tolist())

        return json.dumps(data)

In [43]:
# class ResponseHandler(object):
#   import json
#   import tensorflow as tf
    
#   def __init__(self, classes):
#     self.classes = classes
    
  
#   def __call__(self, response, accept_header):
#     import tensorflow as tf

#     response_body = response.read().decode('utf-8')

#     response_json = json.loads(response_body)

#     log_probabilities = response_json["predictions"]

#     predicted_classes = []

#     # Convert log_probabilities => softmax (all probabilities add up to 1) => argmax (final prediction)
#     for log_probability in log_probabilities:
#       softmax = tf.nn.softmax(log_probability)  
#       predicted_class_idx = tf.argmax(softmax, axis=-1, output_type=tf.int32)
#       predicted_class = self.classes[predicted_class_idx]
#       predicted_classes.append(predicted_class)

#     return predicted_classes


In [None]:
import sagemaker.serializer.SimpleBaseDeserializer

class DistilBertDeserializer(SimpleBaseDeserializer):
    """Deserialize DistilBert data from an inference endpoint into a Python object."""

    def __init__(self, accept="application/json"):
        """Initialize a ``JSONDeserializer`` instance.
        Args:
            accept (union[str, tuple[str]]): The MIME type (or tuple of allowable MIME types) that
                is expected from the inference endpoint (default: "application/json").
        """
        super(JSONDeserializer, self).__init__(accept=accept)

    def deserialize(self, stream, content_type):
        """Deserialize JSON data from an inference endpoint into a Python object.
        Args:
            stream (botocore.response.StreamingBody): Data to be deserialized.
            content_type (str): The MIME type of the data.
        Returns:
            object: The JSON-formatted data deserialized into a Python object.
        """
        try:
            return json.load(codecs.getreader("utf-8")(stream))
        finally:
            stream.close()
            

In [44]:
import json
from sagemaker.tensorflow.model import TensorFlowPredictor
from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

request_handler = RequestHandler(tokenizer=tokenizer,
                                 max_seq_length=64)

response_handler = ResponseHandler(classes=[1, 2, 3, 4, 5])

predictor = TensorFlowPredictor(endpoint_name=tensorflow_endpoint_name,
                                sagemaker_session=sess,
                                serializer=request_handler,
                                deserializer=response_handler,
#                                content_type='application/json',
                                model_name='saved_model',
                                model_version=0)

In [45]:
# import json
# from sagemaker.tensorflow.model import TensorFlowPredictor

# predictor = TensorFlowPredictor(endpoint_name=tensorflow_endpoint_name,
#                                 sagemaker_session=sess,
#                                 model_name='saved_model',
#                                 model_version=0)

### Waiting for the Endpoint to be ready to Serve Predictions

In [46]:
# import time

# time.sleep(30)

# Predict the `star_rating` with Ad Hoc `review_body` Samples

In [47]:
reviews = ["This is great!"]

predicted_classes = predictor.predict(reviews)

for predicted_class, review in zip(predicted_classes, reviews):
    print('[Predicted Star Rating: {}]'.format(predicted_class), review)

AttributeError: 'RequestHandler' object has no attribute 'serialize'

# Predict the `star_rating` with `review_body` Samples from our TSV's

In [None]:
import csv

df_reviews = pd.read_csv('./data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz', 
                         delimiter='\t', 
                         quoting=csv.QUOTE_NONE,
                         compression='gzip')
df_sample_reviews = df_reviews[['review_body', 'star_rating']].sample(n=100)
df_sample_reviews = df_sample_reviews.reset_index()
df_sample_reviews.shape

In [None]:
import pandas as pd

def predict(review_body):
    return predictor.predict(review_body)

df_sample_reviews['predicted_class'] = df_sample_reviews['review_body'].map(predict)
df_sample_reviews.head(5)

# Save for Next Notebook(s)

In [None]:
%store tensorflow_model_name

In [None]:
%store tensorflow_endpoint_name

In [None]:
%store tensorflow_endpoint_arn

In [None]:
%store

# Delete Endpoint
To save cost, we should delete the endpoint.

In [None]:
# sm.delete_endpoint(
#      EndpointName=tensorflow_endpoint_name
# )

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();