### Amazon SageMaker Asynchronous Inference with Hugging Face Model
_**A new near real-time Inference option for generating machine learning model predictions**_

**Table of Contents**

* [Background](#background)
* [Notebook Scope](#scope)
* [Overview and sample end to end flow](#overview)
* [Section 1 - Setup](#setup) 
    * [Create Model](#createmodel)
    * [Create EndpointConfig](#endpoint-config)
    * [Create Endpoint](#create-endpoint)
* [Section 2 - Using the Endpoint](#endpoint) 
    * [Invoke Endpoint](#invoke-endpoint)
    * [Check Output Location](#check-output) 
* [Section 3 - Clean up](#clean)

### Background <a id='background'></a>  
Amazon SageMaker Asynchronous Inference is a new capability in SageMaker that queues incoming requests and processes them asynchronously. SageMaker currently offers two inference options for customers to deploy machine learning models: 1) a real-time option for low-latency workloads 2) Batch transform, an offline option to process inference requests on batches of data available upfront. Real-time inference is suited for workloads with payload sizes of less than 6 MB and require inference requests to be processed within 60 seconds. Batch transform is suitable for offline inference on batches of data. 

Asynchronous inference is a new inference option for near real-time inference needs. Requests can take up to 15 minutes to process and have payload sizes of up to 1 GB. Asynchronous inference is suitable for workloads that do not have sub-second latency requirements and have relaxed latency requirements. For example, you might need to process an inference on a large image of several MBs within 5 minutes. In addition, asynchronous inference endpoints let you control costs by scaling down endpoints instance count to zero when they are idle, so you only pay when your endpoints are processing requests. 

### Notebook scope <a id='scope'></a>  
This notebook provides an introduction on how to use the SageMaker Asynchronous inference capability with Hugging Face models. This notebook will cover the steps required to create an Asynchronous inference endpoint and test it with some sample requests. 

### Overview <a id='overview'></a>
Asynchronous inference endpoints have many similarities (and some key differences) compared to real-time endpoints. The process to create asynchronous endpoints is similar to real-time endpoints. You need to create: a model, an endpoint configuration, and then an endpoint. However, there are specific configuration parameters specific to asynchronous inference endpoints which we will explore below. 

Invocation of asynchronous endpoints differ from real-time endpoints. Rather than pass request payload inline with the request, you upload the payload to Amazon S3 and pass an Amazon S3 URI as a part of the request. Upon receiving the request, SageMaker provides you with a token with the output location where the result will be placed once processed. Internally, SageMaker maintains a queue with these requests and processes them. During endpoint creation, you can optionally specify an Amazon SNS topic to receive success or error notifications. Once you receive the notification that your inference request has been successfully processed, you can access the result in the output Amazon S3 location. 

---
## 1. Setup <a id='setup'></a>

First we ensure we have an updated version of Sagemaker, which includes the latest SageMaker features:

Import the required python libraries:

In [None]:
!python -m pip install --upgrade pip --quiet
!pip install -U awscli --quiet
!pip install --upgrade sagemaker --quiet

In [None]:
from time import gmtime, strftime
from sagemaker import image_uris
import sagemaker
import logging
import boto3
import json

In [None]:
logger = logging.getLogger("__name__")
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

In [None]:
logger.info(f"Using SageMaker version: {sagemaker.__version__}")

In [None]:
region = sagemaker.Session().boto_region_name
role = sagemaker.get_execution_role()
boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)
sm_session = sagemaker.session.Session()
sagemaker_client = boto_session.client("sagemaker")
sm_runtime = boto_session.client("sagemaker-runtime")
s3_bucket = sm_session.default_bucket()
current_timestamp = strftime("%m-%d-%H-%M", gmtime())
logger.info(f"Region = {region}")
logger.info(f"Role = {role}")

Specify your IAM role. Go the AWS IAM console (https://console.aws.amazon.com/iam/home) and add the following policies to your IAM Role:
* SageMakerFullAccessPolicy
* Amazon S3 access: Apply this to get and put objects in your Amazon S3 bucket. Replace `bucket_name` with the name of your Amazon S3 bucket:      

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::<bucket_name>/*"
        }
    ]
}
```

* (Optional) Amazon SNS access: Add `sns:Publish` on the topics you define. Apply this if you plan to use Amazon SNS to receive notifications.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sns:Publish"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sns:us-east-2:123456789012:MyTopic"
        }
    ]
}
```

* (Optional) KMS decrypt, encrypt if your Amazon S3 bucket is encrypted.

Specify your SageMaker IAM Role (`role`) and Amazon S3 bucket . You can optionally use a default SageMaker Session IAM Role and Amazon S3 bucket. Make sure the role you use has the necessary permissions for SageMaker, Amazon S3, and optionally Amazon SNS.

### 1.1 Create Model <a id='createmodel'></a>
Specify the location of the pre-trained model stored in Amazon S3. This example uses a pre-trained Hugging Face model name (https://huggingface.co/finiteautomata/beto-sentiment-analysis) sentimentanalysis.tar.gz. The full Amazon S3 URI is stored in a string variable `MODEL_DATA_URL`. 

In [None]:
MODEL_DATA_URL = "s3://asyncendpointexperiment/sentimentanalysis.tar.gz"

Specify a primary container. For the primary container, you specify the Docker image that contains inference code, artifacts (from prior training), and a custom environment map that the inference code uses when you deploy the model for predictions. In this example, we retrieve the appropriate container image by specifying the right framework version and framework details. Here in this case we are downloading container image associated with Hugging Face framework. For further details on right container images to use for your use case please refer to this link https://github.com/awsdocs/amazon-sagemaker-developer-guide/blob/master/doc_source/ and look in to appropriate ecr folder pertaining to the region of your interest

In [None]:
ecr_image = image_uris.retrieve(
    framework="huggingface",
    region=region,
    version="4.6.1",
    image_scope="inference",
    base_framework_version="pytorch1.7.1",
    py_version="py36",
    container_version="ubuntu18.04",
    instance_type="ml.m5.xlarge",
)
ecr_image

In [None]:
model_name = f"beto-sentiment-analysis-async"

Create a model by specifying the `ModelName`, the `ExecutionRoleARN` (the ARN of the IAM role that Amazon SageMaker can assume to access model artifacts/ docker images for deployment), and the `PrimaryContainer`.

In [None]:
response = sagemaker_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": ecr_image,
        "ModelDataUrl": MODEL_DATA_URL,
        "Environment": {
            "HF_MODEL_ID": "finiteautomata/beto-sentiment-analysis",
            "HF_TASK": "text-classification",
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_REGION": region,
        },
    },
)
model_arn = response["ModelArn"]

logger.info(f"Created Model: {model_arn}")

In [None]:
endpoint_config_name = model_name

### 1.2 Create EndpointConfig <a id='endpointconfig'></a>

Once you have a model, create an endpoint configuration with CreateEndpointConfig. Amazon SageMaker hosting services uses this configuration to deploy models. In the configuration, you identify one or more models that were created using with CreateModel API, to deploy the resources that you want Amazon SageMaker to provision. Specify the AsyncInferenceConfig object and provide an output Amazon S3 location for OutputConfig. You can optionally specify Amazon SNS topics on which to send notifications about prediction results.

In [None]:
response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant-1",
            "ModelName": model_name,
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 1,
        }
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            "S3OutputPath": f"s3://{s3_bucket}/output",
            # Optionally specify Amazon SNS topics
            # "NotificationConfig": {
            #   "SuccessTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
            #   "ErrorTopic": "arn:aws:sns:us-east-2:123456789012:MyTopic",
            # }
        },
    },
)
endpoint_config_arn = response["EndpointConfigArn"]
logger.info(f"Created EndpointConfig: {endpoint_config_arn}")

### 1.3 Create Endpoint <a id='create-endpoint'></a>

Once you have your model and endpoint configuration, use the CreateEndpoint API to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account.

In [None]:
endpoint_name = model_name
response = sagemaker_client.create_endpoint(
    EndpointName="HuggingFaceAsyncEndpoint", EndpointConfigName="beto-sentiment-analysis-async"
)
endpoint_arn = response["EndpointArn"]
logger.info(f"Created Endpoint: {endpoint_arn}")

--- 
## 2. Using the Endpoint <a id='endpoint'></a>

### 2.1 Uploading the Request Payload <a id='upload'></a>

Sample input.json placed in the input location

{"inputs": ["I like you. I love you","This is sad","am so happy that i want to cry","async endpoints are awesome"]}

In [None]:
input_s3_location = f"s3://{s3_bucket}/input/input.json"
print(input_s3_location)

### 2.1 Invoke Endpoint   <a id='invoke-endpoint'></a>

Get inferences from the model hosted at your asynchronous endpoint with InvokeEndpointAsync. Specify the location of your inference data in the InputLocation field and the name of your endpoint for EndpointName. The response payload contains the output Amazon S3 location where the result will be placed.

In [None]:
response = sm_runtime.invoke_endpoint_async(
    EndpointName="HuggingFaceAsyncEndpoint",
    InputLocation=input_s3_location,
    ContentType="application/json",
)

### 2.2 Check Output Location <a id='check-output'></a>

Check the output location to see if the inference has been processed.

Sample inference output processed and  placed in the output location

[{"label":"POS","score":0.9982852339744568},{"label":"NEG","score":0.9333241581916809},{"label":"POS","score":0.595783531665802},{"label":"NEU","score":0.9964613318443298}]

### 3. Summary & Clean up <a id='clean'></a>

To Summarize, In this notebook we learned how to use the SageMaker Asynchronous inference capability with pre-trained Hugging Face models.

If you enabled auto-scaling for your endpoint, ensure you deregister the endpoint as a scalable target before deleting the endpoint. To do this, run the following:

In [None]:
response = client.deregister_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="resource_id",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
)

Remember to delete your endpoint after use as you will be charged for the instances used in this Demo. 

You may also want to delete any other resources you might have created such as SNS topics, S3 objects, etc.