# Guide to Use Inference Recommender 

Updated: July 23, 2022

This notebook shows steps to run inference recommender for a registered model to help choose endpoint deployment instance type. Running this whole notebook using the current setting takes about 40 mins to finish end to end. 

* creating a model package group
* registering a model package version to the package group with specific setting for inference recommender 
    - register a model package version is prerequisite to run inference recommender since there is some specific configuration needing to be done 
      when registering the model
* use inferencer recommender to help choose instance size for the deployment

Resources reference: 

[AWS sagemaker notebook example](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-inference-recommender)

## Introdcution of Inference Recommender 

SageMaker Inference Recommender is a new capability of SageMaker that reduces the time required to get machine learning (ML) models in production by automating performance benchmarking and load testing models across SageMaker ML instances. You can use Inference Recommender to deploy your model to a real-time inference endpoint that delivers the best performance at the lowest cost.

In [28]:
import os
from sagemaker import get_execution_role, Session, session
import boto3

region = boto3.Session().region_name

role = get_execution_role()

sm_client = boto3.client('sagemaker', region_name=region)

sagemaker_session = Session()

import sagemaker
print(sagemaker.__version__)

2.94.0


## Create Model Package Group

this section only needs to be executed once to create iniital registry 

In [24]:
model_package_group_name = "inference-recommender-model-registry"

#### run below cell to create a new model package group 

In [25]:
model_package_group_name = "inference-recommender-model-registry"
model_package_group_description = "testing for inference recommendor"

model_package_group_input_dict = {
    "ModelPackageGroupName": model_package_group_name,
    "ModelPackageGroupDescription": model_package_group_description,
}
create_model_package_group_response = sm_client.create_model_package_group(
    **model_package_group_input_dict
)

# create_mode_package_response = sm.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_group_response["ModelPackageGroupArn"]
print("ModelPackage Version ARN : {}".format(model_package_arn))

## Prepare model URL for later model registry

SageMaker models need to be packaged in .tar.gz files. When your SageMaker Endpoint is provisioned, the files in the archive will be extracted and put in /opt/ml/model/ on the Endpoint.

To bring your own Deep Learning model, SageMaker expects a single archive file in .tar.gz format, containing a model file (*.pb) in TF SavedModel format and the script (*.py) for inference. (for example, use this command to create a tar file: !tar -cvpzf { "model.tar.gz"} ./model ./code)


In this case, `model.tar.gz` file was generated from deploying model in `inference_testing.ipynb`. It is pakcaged file with a model.pth and code/inference.py (the ones that are used to deploy the endpoint). No need to change the inference.py. If debugging is needed, turn the log level to debug can be very helpful. 

This uri is taken from navigating in the SageMaker console to the endpoint, then the model, and then taking the S3 uri of the model artifact. 

In [26]:
model_url = "s3://sagemaker-us-east-2-*******/model/model.tar.gz"

## Prepare a payload tar gz URL 

We need to create an archive that contains individual files that Inference Recommender can send to your Endpoint. Inference Recommender will randomly sample files from this archive so make sure it contains a similar distribution of payloads you'd expect in production. Note that your inference code must be able to read in the file formats from the sample payload.

Please follow below steps to create a payload S3 url:

1. create a folder "sample-payload" in local directory on SageMaker and put the testing images there
    - <mark> for some reason, if using the same terminal command to compress tar gz the image folder locally on computer and upload it to S3 bucket, it does not work. It will give the error  "INVALID_INPUT : 1. Inspect model request and try again." <mark>

2. run below command to create a tar gz file

 !cd ./sample-payload/ && tar czvf ../payload.tar.gz * 

3. upload the tar.gz file to S3 bucket for later usage

In [None]:
!tar czvf ../payload.tar.gz *

In [6]:
import boto3
s3 = boto3.resource('s3')

payload_data_url = sagemaker_session.upload_data(path="payload_aid_images.tar.gz", key_prefix="test")
print("model uploaded to: {}".format(payload_data_url))

In [2]:
# model package tarball (model artifact + inference code)
payload_data_url = sagemaker_session.upload_data(path="payload_aid_images.tar.gz", key_prefix="test")
print("model uploaded to: {}".format(payload_data_url))

In [3]:
# sample_payload_url = "s3://"+bucket+"/"+file_name_saved_in_s3  
sample_payload_url= payload_data_url
sample_payload_url

# Register model in Model Registry

In order to use Inference Recommender, you must have a versioned model in SageMaker Model Registry. To register a model in the Model Registry, you must have a model artifact packaged in a tarball and an inference container image. Registering a model includes the following steps:

### Define ML model details for configuring creating model package 

Inference Recommender uses metadata about your ML model to recommend the best instance types and endpoint configurations for deployment. You can provide as much or as little information as you'd like but the more information you provide, the better your recommendations will be.

ML Frameworks: TENSORFLOW, PYTORCH, XGBOOST, SAGEMAKER-SCIKIT-LEARN

ML Domains: COMPUTER_VISION, NATURAL_LANGUAGE_PROCESSING, MACHINE_LEARNING

Example ML Tasks: CLASSIFICATION, REGRESSION, IMAGE_CLASSIFICATION, OBJECT_DETECTION, SEGMENTATION, FILL_MASK, TEXT_CLASSIFICATION, TEXT_GENERATION, OTHER

Note: Select the task that is the closest match to your model. Chose OTHER if none apply.

In this step, you'll register your pretrained model that was packaged in the prior steps as a new version in SageMaker Model Registry. First, you'll configure the model package/version identifying which model package group this new model should be registered within as well as identify the initial approval status. You'll also identify the domain and task for your model. These values were set earlier in the notebook where ml_domain = 'COMPUTER_VISION' and ml_task = 'IMAGE_CLASSIFICATION'

Note: ModelApprovalStatus is a configuration parameter that can be used in conjunction with SageMaker Projects to trigger automated deployment pipeline.


In [11]:
framework = "pytorch"  # required for inference recommender job ;   "tensorflow" is another option 
framework_version = "1.8.0" #pytorch_framework_version
model_name = "deeplearning-model" 

instance_type = "ml.c4.xlarge" 

# ML model details
ml_domain = "COMPUTER_VISION"  # required for inference recommender job
ml_task = "IMAGE_SEGMENTATION"

model_package_description = "{} {} inference recommender".format(framework, model_name)

model_approval_status = "PendingManualApproval"

# define the parameters for creating model package 
create_model_package_input_dict = {
    "ModelPackageGroupName": model_package_group_name,
    "Domain": ml_domain.upper(),
    "Task": ml_task.upper(),
    "SamplePayloadUrl": sample_payload_url,
    "ModelPackageDescription": model_package_description,
    "ModelApprovalStatus": model_approval_status,
}

### Container image URL

In [12]:
import sagemaker
image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="1.8.0",
    py_version="py36",
    image_scope='inference',
    instance_type=instance_type,
)

print(image_uri)

763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-inference:1.8.0-cpu-py36


### Set up inference specification

You'll now setup the inference specification configuration for your model version. This contains information on how the model should be hosted.

Inference Recommender expects a single input MIME type for sending requests. Learn more about [common inference data formats](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) on SageMaker. This MIME type will be sent in the Content-Type header when invoking your endpoint.

In [35]:
input_mime_types = ["application/x-image"]

# user provides desired instance for testing comparison; if not provided, inference recommender will automatically tune 
supported_realtime_inference_types = ["ml.c4.xlarge", "ml.c5.xlarge", "ml.m5.xlarge", "ml.c5d.large", "ml.m5.large", "ml.inf1.xlarge"] 

In [13]:
# data_input_configuration = '{"input_1":[1,224,224,3]}' --# this is for Optional: Model optimization using SageMaker Neo

In [14]:
# specify more configurations for model package inference

modelpackage_inference_specification =  {
    "InferenceSpecification": {
      "Containers": [
         {
            "Image": image_uri,
            "ModelDataUrl": model_url,
            "Framework": framework.upper(),  # required 
            "FrameworkVersion": framework_version,
            "NearestModelName": model_name
#              "ModelInput": {"DataInputConfig": data_input_configuration},
         }
      ],
      "SupportedContentTypes": input_mime_types,  # required, must be non-null # application/x-image
      "SupportedResponseMIMETypes": [],
      "SupportedRealtimeInferenceInstanceTypes": supported_realtime_inference_types,  # optional
   }
 }

In [15]:
modelpackage_inference_specification["InferenceSpecification"]["Containers"][0][
    "ModelDataUrl"
]

's3://sagemaker-us-east-2-488955376385/credential-segmenter-ny-test/model.tar.gz'

### Update the create_model_package_input_dict

In [16]:
create_model_package_input_dict.update(modelpackage_inference_specification)

In [17]:
create_model_package_input_dict

{'ModelPackageGroupName': 'inference-recommender-model-registry',
 'Domain': 'COMPUTER_VISION',
 'Task': 'IMAGE_SEGMENTATION',
 'SamplePayloadUrl': 's3://sagemaker-us-east-2-488955376385/test/payload_aid_images.tar.gz',
 'ModelPackageDescription': 'pytorch credentialSegmenter inference recommender',
 'ModelApprovalStatus': 'PendingManualApproval',
 'InferenceSpecification': {'Containers': [{'Image': '763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-inference:1.8.0-cpu-py36',
    'ModelDataUrl': 's3://sagemaker-us-east-2-488955376385/credential-segmenter-ny-test/model.tar.gz',
    'Framework': 'PYTORCH',
    'FrameworkVersion': '1.8.0',
    'NearestModelName': 'credentialSegmenter'}],
  'SupportedContentTypes': ['application/x-image'],
  'SupportedResponseMIMETypes': [],
  'SupportedRealtimeInferenceInstanceTypes': ['ml.c5d.large',
   'ml.inf1.xlarge',
   'ml.m5.xlarge']}}

### Register the model package 

In [18]:
create_model_package_response = sm_client.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_response["ModelPackageArn"]

print('ModelPackage Version ARN : {}'.format(model_package_arn))

ModelPackage Version ARN : arn:aws:sagemaker:us-east-2:488955376385:model-package/inference-recommender-model-registry/33


# Create a SageMaker Inference Recommender Default Job

Now with your model in Model Registry, you can kick off a 'Default' job to get instance recommendations. This only requires your ModelPackageVersionArn and comes back with recommendations within an hour.

The output is a list of instance type recommendations with associated environment variables, cost, throughput and latency metrics.

In [19]:
import time
default_job_name = model_name + "-instance-" + str(round(time.time()))
job_description = "{} {}".format(framework, model_name)
job_type = "Default"

rv = sm_client.create_inference_recommendations_job(
    JobName=default_job_name,
    JobDescription=job_description,  # optional
    JobType=job_type,
    RoleArn=role,
    InputConfig={"ModelPackageVersionArn": model_package_arn},
)


print("job_name:", default_job_name)
print("job_description:", job_description)

job_name: credentialSegmenter-instance-1655912131
job_description: pytorch credentialSegmenter


In [20]:
rv = sm_client.create_inference_recommendations_job(
    JobName=default_job_name,
    JobDescription=job_description,  # optional
    JobType=job_type,
    RoleArn=role,
    InputConfig={"ModelPackageVersionArn": model_package_arn},
)

print(rv)

{'JobArn': 'arn:aws:sagemaker:us-east-2:488955376385:inference-recommendations-job/credentialsegmenter-instance-1655912131', 'ResponseMetadata': {'RequestId': 'a75518b6-927a-49ee-ada6-bc257e672db4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'a75518b6-927a-49ee-ada6-bc257e672db4', 'content-type': 'application/x-amz-json-1.1', 'content-length': '123', 'date': 'Wed, 22 Jun 2022 15:35:31 GMT'}, 'RetryAttempts': 0}}


# Check the Instance Recommendation Results

- Method 1 is to view the results under the console of SageMaker Inference Recommender and click the corresponding job name; you can also directly deploy an endpoint from the chosen instance there on the console

- Method 2 is to run below code to extract the results here. 

Each inference recommendation includes `InstanceType`, `InitialInstanceCount`, `EnvironmentParameters` which are tuned environment variable parameters for better performance. We also include performance and cost metrics such as `MaxInvocations`, `ModelLatency`, `CostPerHour` and `CostPerInference`. We believe these metrics will help you narrow down to a specific endpoint configuration that suits your use case. 

Example:   

If your motivation is overall price-performance with an emphasis on throughput, then you should focus on `CostPerInference` metrics  
If your motivation is a balance between latency and throughput, then you should focus on `ModelLatency` / `MaxInvocations` metrics

| Metric | Description |
| --- | --- |
| ModelLatency | The interval of time taken by a model to respond as viewed from SageMaker. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. <br /> Units: Milliseconds |
| MaximumInvocations | The maximum number of InvokeEndpoint requests sent to an endpoint per minute. <br /> Units: None |
| CostPerHour | The estimated cost per hour for your real-time endpoint. <br /> Units: US Dollars |
| CostPerInference | The estimated cost per inference for your real-time endpoint. <br /> Units: US Dollars |

In [21]:
import pprint
import pandas as pd

finished = False
while not finished:
    inference_recommender_job = sm_client.describe_inference_recommendations_job(
        JobName=str(default_job_name)
    )
    if inference_recommender_job["Status"] in ["COMPLETED", "STOPPED", "FAILED"]:
        finished = True
    else:
        print("In progress")
        time.sleep(300)

if inference_recommender_job["Status"] == "FAILED":
    print("Inference recommender job failed ")
    print("Failed Reason: {}".format(inference_recommender_job["FailureReason"]))
else:
    print("Inference recommender job completed")

In progress
In progress
In progress
In progress
In progress
Inference recommender job completed


## Detailing out the result

In [150]:
data = [
    {**x["EndpointConfiguration"], **x["ModelConfiguration"], **x["Metrics"]}
    for x in inference_recommender_job["InferenceRecommendations"]
]
df = pd.DataFrame(data)
df.drop("VariantName", inplace=True, axis=1)
pd.set_option("max_colwidth", 400)
df.head()

# supported_realtime_inference_types = ["ml.c4.xlarge", "ml.c5.xlarge", "ml.m5.xlarge", "ml.c5d.large", "ml.m5.large", "ml.inf1.xlarge"] 

Unnamed: 0,EndpointName,InstanceType,InitialInstanceCount,EnvironmentParameters,CostPerHour,CostPerInference,MaxInvocations,ModelLatency
0,sm-epc-9ac63c3e-d15e-4518-8b37-96f95fc41e53,ml.m5.large,1,[],0.115,4e-06,487,604
1,sm-epc-232d401e-ba3e-4154-8706-2f447843c7d4,ml.c5d.large,1,[],0.115,4e-06,532,371
2,sm-epc-f1d0b548-ec89-47e2-9252-a794b94764c2,ml.c4.xlarge,1,[],0.239,7e-06,609,309
3,sm-epc-0d3d57a6-59ec-4300-b4f3-5a6c200c8a4c,ml.c5.xlarge,1,[],0.204,4e-06,757,389


## Custom Load Test

With an 'Advanced' job, you can provide your production requirements, select instance types, tune environment variables and perform more extensive load tests. This typically takes 2 hours depending on your traffic pattern and number of instance types.

The output is a list of endpoint configuration recommendations (instance type, instance count, environment variables) with associated cost, throughput and latency metrics.

In the below example, we are tuning the endpoint against an environment variable OMP_NUM_THREADS with values [1, 2, 4] and we aim to limit the latency requirement to 500 ms. The goal is to find which value for OMP_NUM_THREADS provides the best performance.

For some context, Python internally uses OpenMP for implementing multithreading within processes. The default value for OMP_NUM_THREADS is equal to the number of CPU core. However, when implemented on top of Simultaneous Multi Threading (SMT) such Intel’s HypeThreading, a certain process might oversubscribe a particular core by spawning twice the threads as the number of actual CPU cores. In certain cases, a Python binary might end up spawning up to four times the threads as available actual processor cores. Therefore, an ideal setting for this parameter, if you have oversubscribed available cores using worker threads, is 1 or half the number of CPU cores on a SMT-enabled CPU.

In [132]:
instance_type = "ml.c5.xlarge"

In [129]:
role = get_execution_role()
advanced_job = uuid.uuid1()
advanced_response = sm_client.create_inference_recommendations_job(
    JobName=str(advanced_job),
    JobDescription="",
    JobType="Advanced",
    RoleArn=role,
    InputConfig={
        "ModelPackageVersionArn": model_package_arn,
        "JobDurationInSeconds": 7200,
        "EndpointConfigurations": [
            {
                "InstanceType": instance_type,
                "EnvironmentParameterRanges": {
                    "CategoricalParameterRanges": [
                        {"Name": "OMP_NUM_THREADS", "Value": ["1", "2", "4"]}
                    ]
                },
            }
        ],
        "ResourceLimit": {"MaxNumberOfTests": 3, "MaxParallelOfTests": 1},
        "TrafficPattern": {
            "TrafficType": "PHASES",
            "Phases": [{"InitialNumberOfUsers": 1, "SpawnRate": 1, "DurationInSeconds": 120}],
        },
    },
    StoppingConditions={
        "MaxInvocations": 1000,
        "ModelLatencyThresholds": [{"Percentile": "P95", "ValueInMilliseconds": 500}],
    },
)

print(advanced_response)

{'JobArn': 'arn:aws:sagemaker:us-east-2:488955376385:inference-recommendations-job/bab991c6-f1a8-11ec-b976-c1f8852c95fc', 'ResponseMetadata': {'RequestId': 'b5a59c11-acdb-478c-a1fe-e3a9606202c6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b5a59c11-acdb-478c-a1fe-e3a9606202c6', 'content-type': 'application/x-amz-json-1.1', 'content-length': '120', 'date': 'Tue, 21 Jun 2022 21:25:53 GMT'}, 'RetryAttempts': 0}}


## Custom Load Test Results

In [130]:
finished = False
while not finished:
    inference_recommender_job = sm_client.describe_inference_recommendations_job(
        JobName=str(advanced_job)
    )
    if inference_recommender_job["Status"] in ["COMPLETED", "STOPPED", "FAILED"]:
        finished = True
    else:
        print("In progress")
        time.sleep(300)

if inference_recommender_job["Status"] == "FAILED":
    print("Inference recommender job failed ")
    print("Failed Reason: {}".format(inference_recommender_job["FailureReason"]))
else:
    print("Inference recommender job completed")

In progress
In progress
In progress
In progress
In progress
In progress
In progress
Inference recommender job completed


## Detailing out the result

In [131]:
data = [
    {**x["EndpointConfiguration"], **x["ModelConfiguration"], **x["Metrics"]}
    for x in inference_recommender_job["InferenceRecommendations"]
]
df = pd.DataFrame(data)
df.drop("VariantName", inplace=True, axis=1)
pd.set_option("max_colwidth", 400)
df.head()

Unnamed: 0,EndpointName,InstanceType,InitialInstanceCount,EnvironmentParameters,CostPerHour,CostPerInference,MaxInvocations,ModelLatency
0,sm-epc-42aee4fd-5240-416c-8548-bd0ff6150b14,ml.c5.xlarge,4,"[{'Key': 'OMP_NUM_THREADS', 'ValueType': 'string', 'Value': '1'}]",0.816,1e-05,1320,119
1,sm-epc-d50b9303-c12d-4927-b548-08f4efac1a31,ml.c5.xlarge,2,"[{'Key': 'OMP_NUM_THREADS', 'ValueType': 'string', 'Value': '2'}]",0.408,6e-06,1132,94
2,sm-epc-0e2dedac-fc1c-468d-b5f7-e1b5ced3e2ea,ml.c5.xlarge,2,"[{'Key': 'OMP_NUM_THREADS', 'ValueType': 'string', 'Value': '4'}]",0.408,6e-06,1098,97
