# Serving a TensorFlow Model as a REST Endpoint with TensorFlow Serving and SageMaker

We need to understand the application and business context to choose between real-time and batch predictions. Are we trying to optimize for latency or throughput? Does the application require our models to scale automatically throughout the day to handle cyclic traffic requirements? Do we plan to compare models in production through A/B tests?

If our application requires low latency, then we should deploy the model as a real-time API to provide super-fast predictions on single prediction requests over HTTPS. We can deploy, scale, and compare our model prediction servers with SageMaker Endpoints.

<img src="img/sagemaker-architecture.png" width="80%" align="left">

In [1]:
!pip install -q sagemaker==2.9.2

In [2]:
import boto3
import sagemaker
import pandas as pd

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

In [3]:
%store -r recommender_training_job_name

In [4]:
try:
    recommender_training_job_name
    print('[OK]')
except NameError:
    print('+++++++++++++++++++++++++++++++')
    print('[ERROR] Please run the notebooks in the previous TRAIN section before you continue.')
    print('+++++++++++++++++++++++++++++++')

[OK]


In [5]:
print(recommender_training_job_name)

tensorflow-training-201129-2302-005-aec2a92a


# Copy the Model from S3

In [6]:
!aws s3 cp s3://$bucket/$recommender_training_job_name/output/model.tar.gz ./model.tar.gz

download: s3://sagemaker-us-east-1-835319576252/tensorflow-training-201129-2302-005-aec2a92a/output/model.tar.gz to ./model.tar.gz


In [7]:
!mkdir -p ./deployed_model/
!tar -xvzf ./model.tar.gz -C ./deployed_model/

tensorflow/
tensorflow/saved_model/
tensorflow/saved_model/0/
tensorflow/saved_model/0/assets/
tensorflow/saved_model/0/variables/
tensorflow/saved_model/0/variables/variables.index
tensorflow/saved_model/0/variables/variables.data-00000-of-00001
tensorflow/saved_model/0/saved_model.pb
tensorboard/
tensorboard/train/
tensorboard/train/plugins/
tensorboard/train/plugins/profile/
tensorboard/train/plugins/profile/2020_11_29_23_19_30/
tensorboard/train/plugins/profile/2020_11_29_23_19_30/ip-10-2-212-34.ec2.internal.tensorflow_stats.pb
tensorboard/train/plugins/profile/2020_11_29_23_19_30/ip-10-2-212-34.ec2.internal.input_pipeline.pb
tensorboard/train/plugins/profile/2020_11_29_23_19_30/ip-10-2-212-34.ec2.internal.overview_page.pb
tensorboard/train/plugins/profile/2020_11_29_23_19_30/ip-10-2-212-34.ec2.internal.kernel_stats.pb
tensorboard/train/plugins/profile/2020_11_29_23_19_30/ip-10-2-212-34.ec2.internal.memory_profile.json.gz
tensorboard/train/plugins/profile/2020_11_29_23_19_30/ip-10-

In [8]:
!saved_model_cli show --all --dir ./deployed_model/tensorflow/saved_model/0/

2020-11-30 00:01:51.224430: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-30 00:01:51.224474: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['input_1'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_in

In [9]:
user_id = "42"

In [10]:
!saved_model_cli run --input_exprs 'input_1=np.array(["$user_id"])' --tag_set serve --signature_def serving_default --dir ./deployed_model/tensorflow/saved_model/0/

2020-11-30 00:01:57.402673: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-11-30 00:01:57.402719: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-11-30 00:02:00.084853: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-11-30 00:02:00.084909: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-11-30 00:02:00.084952: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395): /proc/driver/nvidia/version does not exist
2020-11-30 00:02:00

# Show `inference.py`

In [11]:
!pygmentize ./model/code/inference.py

[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m


[34mdef[39;49;00m [32minput_handler[39;49;00m(data, context):
    transformed_instances = []

    [34mfor[39;49;00m instance [35min[39;49;00m data:
        instance_str = instance.decode([33m'[39;49;00m[33mutf-8[39;49;00m[33m'[39;49;00m)        
        transformed_instances.append(instance_str)

    [36mprint[39;49;00m(transformed_instances)
    
    transformed_data = {[33m"[39;49;00m[33minstances[39;49;00m[33m"[39;49;00m: transformed_instances}
    [36mprint[39;49;00m(transformed_data)

    transformed_data_json = json.dumps(transformed_data)
    [36mprint[39;49;00m(transformed_data_json)
    
    [34mreturn[39;49;00m transformed_data_json


[34mdef[39;49;00m [32moutput_handler[39;49;00m(response, context):
    response_json = response.json()
    [36mprint[39;49;00m([33m'[39;49;00m[33mresponse_j

# Deploy the Model
This will create a default `EndpointConfig` with a single model.  

The next notebook will demonstrate how to perform more advanced `EndpointConfig` strategies to support canary rollouts and A/B testing.

_Note:  If not using a US-based region, you may need to adapt the container image to your current region using the following table:_

https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html

In [12]:
import time

timestamp = int(time.time())

In [13]:
recommender_tensorflow_endpoint_name = '{}-{}-{}'.format(recommender_training_job_name, 'tf', timestamp)

print(recommender_tensorflow_endpoint_name)

tensorflow-training-201129-2302-005-aec2a92a-tf-1606694522


In [14]:
from sagemaker.tensorflow.model import TensorFlowModel

tensorflow_model = TensorFlowModel(name=recommender_tensorflow_endpoint_name,
                                   model_data='s3://{}/{}/output/model.tar.gz'.format(bucket, recommender_training_job_name),
                                   role=role,                
                                   framework_version='2.1.0')

In [15]:
tensorflow_endpoint = tensorflow_model.deploy(endpoint_name=recommender_tensorflow_endpoint_name,
                                              initial_instance_count=1, # Should use >=2 for high(er) availability 
                                              instance_type='ml.m5.4xlarge', # requires enough disk space for tensorflow, transformers, and bert downloads
                                              wait=False)

In [16]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/endpoints/{}">SageMaker REST Endpoint</a></b>'.format(region, recommender_tensorflow_endpoint_name)))


# _Wait Until the Endpoint is Deployed_

In [17]:
%%time

waiter = sm.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=recommender_tensorflow_endpoint_name)

CPU times: user 190 ms, sys: 15.7 ms, total: 206 ms
Wall time: 6min 31s


# _Wait Until the ^^ Endpoint ^^ is Deployed_

# Test the Deployed Model

In [18]:
import json
from sagemaker.tensorflow.model import TensorFlowPredictor

predictor = TensorFlowPredictor(endpoint_name=recommender_tensorflow_endpoint_name,
                                sagemaker_session=sess,
                                model_name='saved_model',
                                model_version=0)

# Predict the `star_rating` with Ad Hoc `review_body` Samples

In [19]:
from pprint import pprint

user_id = "42"

recommendations = predictor.predict([user_id])

pprint(recommendations)

pprint(recommendations['predictions'][0]['output_2'])

{'predictions': [{'output_1': [0.0224526152,
                               0.0222037733,
                               0.0183034446,
                               0.0179341771,
                               0.017752476,
                               0.0173165798,
                               0.0167744718,
                               0.016768666,
                               0.016750684,
                               0.0165527295],
                  'output_2': ['Half Baked (1998)',
                               'Underneath, The (1995)',
                               'Baton Rouge (1988)',
                               'Leading Man, The (1996)',
                               'Firm, The (1993)',
                               'Sex, Lies, and Videotape (1989)',
                               'Ransom (1996)',
                               "Wooden Man's Bride, The (Wu Kui) (1994)",
                               'Cable Guy, The (1996)',
                               'Four 

# Save for Next Notebook(s)

In [20]:
%store recommender_tensorflow_endpoint_name 

Stored 'recommender_tensorflow_endpoint_name' (str)


In [21]:
%store

Stored variables and their in-db values:
blazingtext_test_s3_uri                               -> 's3://sagemaker-us-east-1-835319576252/data/amazon
blazingtext_train_s3_uri                              -> 's3://sagemaker-us-east-1-835319576252/data/amazon
blazingtext_validation_s3_uri                         -> 's3://sagemaker-us-east-1-835319576252/data/amazon
ingest_create_athena_db_passed                        -> True
ingest_create_athena_table_parquet_passed             -> True
ingest_create_athena_table_tsv_passed                 -> True
raw_input_data_s3_uri                                 -> 's3://sagemaker-us-east-1-835319576252/DLAI/amazon
recommender_multitask_training_job_name               -> 'tensorflow-training-201129-2249-002-19a0db08'
recommender_tensorflow_endpoint_name                  -> 'tensorflow-training-201129-2302-005-aec2a92a-tf-1
recommender_training_job_name                         -> 'tensorflow-training-201129-2302-005-aec2a92a'
s3_private_path_tsv      

# Delete Endpoint
To save cost, we should delete the endpoint.

In [22]:
# sm.delete_endpoint(
#      EndpointName=tensorflow_endpoint_name
# )