# Benchmarking AWS Graviton Performance on SageMaker with PyTorch

## Introduction

Amazon SageMaker real-time endpoints allow you to host ML applications at scale. In this notebook, we provide a design pattern for benchmarking AWS Graviton instance performance so you can choose the right deployment configuration for your application.

## Environment Setup
This notebook assumes you are running on AWS SageMaker today and have access to an S3 bucket from your SageMaker environment. If you are not and would like to get started, take a look at the getting started documentation [here.](https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html)
In the next steps, you import standard methods and libraries as well as set variables that will be used in this notebook. The get_execution_role function retrieves the AWS Identity and Access Management (IAM) role you created at the time of creating your notebook instance.


In [None]:
# Install latest botocore
!pip install --upgrade pip awscli botocore boto3 sagemaker torch transformers --quiet
!mkdir -p ./models
!mkdir -p ./example-payloads
!mkdir -p ./tarballs

from sagemaker import get_execution_role, Session
from sagemaker.model import Model
import boto3
import time

region = boto3.Session().region_name
role = get_execution_role()
sm_client = boto3.client("sagemaker", region_name=region)
sagemaker_session = Session()



## Preparing Model For Benchmarking
Amazon SageMaker runs the Inference Recommender (IR) utility to automate performance benchmarking across different instances. This service can be used to get the real-time inference endpoint that delivers the best performance at the lowest cost for a given ML model. In order to benchmark a model using AWS SageMaker Inference Recommender we will need a model and an example payload to test the model. SageMaker expects all models and example payloads to be stored in S3. For this example I will be using the [twitter-roberta-base-sentiment-latest](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest) model downloaded from Hugging Face. We will be using this to perform sentiment analysis on an example payload. The following code block downloads the selected model and then creates a tarball of the model and then uploads that model to S3.

### Download Model Using HuggingFace Pipeline

In [None]:
from transformers import pipeline
saved_model_path = "./models/twitter-roberta-base-sentiment-latest"

pipe = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment-latest")
pipe.save_pretrained(saved_model_path)

!mkdir {saved_model_path}/code

### Create Inference Script For Sagemaker Inference Recommender Job
We need a script that SageMaker will call to execute the inference. This code gets packaged along with the model and is a pre-requisite for running an inference recommender job.

In [None]:
%%writefile {saved_model_path}/code/inference.py

import json
from transformers import pipeline

REQUEST_CONTENT_TYPE = "application/x-text"
STR_DECODE_CODE = "utf-8"
RESULT_CLASS = "sentiment"
RESULT_SCORE = "score"


def model_fn(model_dir):
    sentiment_analysis = pipeline(
        "sentiment-analysis",
        model=model_dir,
        tokenizer=model_dir,
        return_all_scores=True
    )
    return sentiment_analysis


def input_fn(request_body, request_content_type):
    if request_content_type == REQUEST_CONTENT_TYPE:
        input_data = request_body.decode(STR_DECODE_CODE)
        return input_data
    raise ValueError('{{"error": "unsupported content type {}"}}'.format(request_content_type or "unknown"))


def predict_fn(input_data, model):
    return model(input_data)


def output_fn(prediction, accept):
    class_label = None
    score = -1
    for _pred in prediction[0]:
        if _pred["score"] > score:
            score = _pred["score"]
            class_label = _pred["label"]
    return json.dumps({RESULT_CLASS: class_label, RESULT_SCORE: score})

In [None]:
%%writefile {saved_model_path}/code/requirements.txt
transformers

### Create Example Payload
We need an example payload to test our model with. We'll do this by creating a json file with the following data:

In [None]:
# Text to run the sentiment analysis against
# model_payload_path is the local path where you want to store the payload txt file
model_payload = "The sky is awfully cloudy today and I am quite tired of this winter weather."
model_payload_path = "./example-payloads/sentiment-analysis-payload.txt"

# Writing to sample.json
with open(model_payload_path, "w") as outfile:
    outfile.write(model_payload)

### Create Model and Payload Tarballs
AWS SageMaker requires models to be in a .tar.gz format containing the model file, and inference code.

In [None]:
model_tarball_path = "./tarballs/twitter-roberta-base-sentiment-latest.tar.gz"
model_payload_tarball_path = "./tarballs/twitter-roberta-base-sentiment-payload.tar.gz"

!tar -cvpzf {model_tarball_path} -C {saved_model_path} .
!tar -cvpzf {model_payload_tarball_path} -C ./example-payloads ./sentiment-analysis-payload.txt

### Upload Model and Payload to S3
Now that our tarballs are ready for upload, lets go ahead and upload them to S3 using the AWS SageMaker Python SDK:

In [None]:
# model package tarball (model artifact + inference code)
model_url = sagemaker_session.upload_data(path=model_tarball_path, key_prefix="model")
sample_payload_url = sagemaker_session.upload_data(path=model_payload_tarball_path, key_prefix="payload")
print("model uploaded to: {} and the sample payload to {}".format(model_url, sample_payload_url))

### Setup Job Details
We're almost ready to create our Inference Recommender job. We just need to specify a few more details to let Inference Recommender know what framework and DLC we're running.

In [None]:
model_name = "pytorch-bert-sa"
ml_domain = "NATURAL_LANGUAGE_PROCESSING"
ml_task = "FILL_MASK"
framework = "PYTORCH"
supported_content_types = ["application/x-text"]
supported_response_types = ["application/json"]
framework_version = "2.0.0"
container_url = "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.0-cpu-py310-ubuntu20.04-sagemaker"
benchmark_instance_types = ["ml.c7g.4xlarge"]

# For the best throughput, we recommend setting 
# SAGEMAKER_MODEL_SERVER_WORKERS to the number of vCPUs the instance being evaluated has
# and to not to over subscribe the threads, we recommend setting 
# OMP_NUM_THREADS to 1 so that each model server workers gets 1 thread.
model_container_environment_variables = {
    'OMP_NUM_THREADS': '1',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '16',
    'SAGEMAKER_NGINX_PROXY_READ_TIMEOUT_SECONDS': '600',
    'DNNL_DEFAULT_FPMATH_MODE': 'BF16',
    'DNNL_VERBOSE': '1'
}

In [None]:
def create_and_benchmark_model(model_name, container_url, model_url, execution_role, sample_payload_url, model_container_environment_variables, supported_content_types, supported_response_types, benchmark_instance_types, framework, framework_version, ml_domain, ml_task, sagemaker_session):
    model_package_name = model_name + str(round(time.time()))
    job_name = model_name + "-ir-job-" + str(round(time.time()))

    benchmark_model = Model(
        image_uri=container_url,
        model_data=model_url,
        role=execution_role,
        env=model_container_environment_variables,
        name=model_package_name,
        sagemaker_session=sagemaker_session
    )

    benchmark_model.right_size(sample_payload_url, supported_content_types,
                               benchmark_instance_types, job_name, framework)
    return job_name

### Launch Inference Recommender Job


In [None]:
job_name = create_and_benchmark_model(model_name, container_url, model_url, role, sample_payload_url, model_container_environment_variables, supported_content_types,
                           supported_response_types, benchmark_instance_types, framework, framework_version, ml_domain, ml_task, sagemaker_session)


### Get Inference Recommender Results
The next bit of code will allow you to pull the relevant cost metrics from the results of the SageMaker Inference Recommender job.


In [None]:
import pandas as pd

def get_ir_job_results(job_name, instance_type):
    response=sm_client.describe_inference_recommendations_job(JobName=job_name)
    inference_recommendations =response['InferenceRecommendations'][0]['Metrics']
    initial_instance_count = response['InferenceRecommendations'][0]['EndpointConfiguration']['InitialInstanceCount']
    cost_per_hour = inference_recommendations['CostPerHour']
    cost_per_inference = inference_recommendations['CostPerInference']
    cost_per_million_inferences = cost_per_inference * 1000000
    
    data_frame_data = {
        'InstanceType' : [instance_type],
        'CostPerInference' : [cost_per_inference],
        'CostPerHour' : [cost_per_hour],
        'CostPerMillionInferences' : [cost_per_million_inferences]
    }
    
    pd.set_option("max_colwidth", 400)
    
    data_frame = pd.DataFrame(data_frame_data)
    data_frame = data_frame.reindex(columns=['InstanceType', 'CostPerInference', 'CostPerHour', 'CostPerMillionInferences'])

    
    print(data_frame)

In [None]:
get_ir_job_results(job_name, benchmark_instance_types[0])