# SageMaker Real-time Dynamic Batching Inference with Torchserve

This notebook demonstrates the use of dynamic batching on SageMaker with [torchserve](https://github.com/pytorch/serve/) as a model server. It demonstrates the following
1. Batch inference using DLC i.e. SageMaker's default backend container. This is done by using sagemaker python sdk in script-mode.
2. Specifying inference parameters for torchserve using environment variables.
3. Option to use a custom container with config file for torchserve baked-in the container.

**Imports**

In [1]:
import base64
import json
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import os
import boto3, time, json
import sagemaker

**Initiate session and retrieve region, account details**

In [2]:
sm_sess = sagemaker.Session()
role = sagemaker.get_execution_role()

In [3]:
sess = boto3.Session()
region = sess.region_name
account = boto3.client("sts").get_caller_identity().get("Account")

**Prepare model**

In [4]:
bucket = sm_sess.default_bucket()
prefix = "ts-dynamic-batching"
model_file_name = "BERTSeqClassification"

!wget https://torchserve.s3.amazonaws.com/tar_gz_files/BERTSeqClassification.tar.gz
!aws s3 cp BERTSeqClassification.tar.gz s3://{bucket}/{prefix}/models/

f"s3://{bucket}/{prefix}/models/"

--2021-11-10 17:05:02--  https://torchserve.s3.amazonaws.com/tar_gz_files/BERTSeqClassification.tar.gz
Resolving torchserve.s3.amazonaws.com (torchserve.s3.amazonaws.com)... 52.216.179.99
Connecting to torchserve.s3.amazonaws.com (torchserve.s3.amazonaws.com)|52.216.179.99|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 405264576 (386M) [application/x-tar]
Saving to: ‘BERTSeqClassification.tar.gz.2’


2021-11-10 17:05:28 (14.9 MB/s) - ‘BERTSeqClassification.tar.gz.2’ saved [405264576/405264576]

upload: ./BERTSeqClassification.tar.gz to s3://sagemaker-us-west-2-850464037171/ts-dynamic-batching/models/BERTSeqClassification.tar.gz


's3://sagemaker-us-west-2-850464037171/ts-dynamic-batching/models/'

In [5]:
model_artifact = f"s3://{bucket}/{prefix}/models/{model_file_name}.tar.gz"

In [6]:
model_name = "hf-dynamic-torchserve-sagemaker"

## Using AWS Deep Learning Container
`Note: See end of notebook for using a custom container`

In [7]:
# We'll use a pytorch inference DLC image that ships with sagemaker-pytorch-inference-toolkit v2.0.6. This version includes support for Torchserve environment variables used below.
image_uri = sagemaker.image_uris.retrieve(framework="pytorch", region="us-west-2", py_version="py38", image_scope="inference", version="1.9", instance_type="ml.c5.4xlarge")

#### Create Sagemaker model, deploy and predict

In [8]:
from sagemaker.pytorch.model import PyTorchModel

env_variables_dict = {"SAGEMAKER_TS_BATCH_SIZE": "3", "SAGEMAKER_TS_MAX_BATCH_DELAY": "100000"}

pytorch_model = PyTorchModel(
    model_data=model_artifact,
    role=role,
    image_uri=image_uri,
    source_dir="code",
    framework_version="1.9",
    entry_point="inference.py",
    env=env_variables_dict,
)

In [9]:
# Change the instance type as necessary, or use 'local' for executing in Sagemaker local mode
instance_type = "ml.c5.9xlarge"

predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.BytesDeserializer(),
)

------!

## Predictions

#### The following prediction call could timeout for certain instance types (SageMaker 60 second limit)

In [10]:
import time

start = time.time()
result = predictor.predict(
    "{Bloomberg has decided to publish a new report on global economic situation.}"
)
print("TIME:", time.time() - start)
print("ENDPOINT RESULT:", result)

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from primary with message "Your invocation timed out while waiting for a response from container primary. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logEventViewer:group=/aws/sagemaker/Endpoints/pytorch-inference-2021-11-10-17-06-07-915 in account 850464037171 for more information.

#### By spawning a pool of 3 processes we're able to simulate requests from multiple clients and verify inference results

In [None]:
import multiprocessing


def invoke(endpoint_name):
    predictor = sagemaker.predictor.Predictor(
        endpoint_name,
        sm_sess,
        serializer=sagemaker.serializers.JSONSerializer(),
        deserializer=sagemaker.deserializers.BytesDeserializer(),
    )
    return predictor.predict(
        "{Bloomberg has decided to publish a new report on global economic situation.}"
    )


endpoint_name = predictor.endpoint_name
pool = multiprocessing.Pool(3)
results = pool.map(invoke, 3 * [endpoint_name])
pool.close()
pool.join()
print(results)

In [None]:
predictor.delete_endpoint(predictor.endpoint_name)

## Using a custom container

#### Details (Also see ./docker/)
* Prebaked config.properties file included
  * 1.66 Minute Batch Delay (Longer than SageMaker 60s Timeout)
  * Batch size of 3
  * Note: the image needs to be built and pushed only once.

In [None]:
%%sh

container_name=custom-dynamic-torchserve
account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${container_name}"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${container_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${container_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build  -t ${container_name} docker/
docker tag ${container_name} ${fullname}

docker push ${fullname}

## Conclusion

Through this exercise, we were able to understand the basics of batch inference using torchserve on Amazon SageMaker. We learnt that we can have several inference requests from different processes/users batched together, and the results will be processed as a batch of inputs. We also learnt that we could either use SageMaker's default DLC container as the base environment, and supply an inference.py script with the model, or create a custom container that can be used with SageMaker for more involved workflows.