# SageMaker Inference Recommender for Real-Time Endpoints & Benchmark tests

In this notebook we will demonstrate how you can use SageMaker Inference Recommender & benchmark test for your SageMaker Real-Time endpoints

## Table of Contents
- Setup
- SageMaker Inference Recommender - Instant Instance Recommendation
- SageMaker Inference Recommender - Load Testing and Benchmarking of Real-Time Endpoints
    - Setting up variables
    - Defining Sample Payload
    - Registering Model in the Model Registry
- Create an Inference Recommender Default Job

<div class="alert alert-block alert-warning">
<b>Note:</b> This notebook requires variable values / data from the <b>rtb_xgboost_Sagemaker_realtime_endpoint_deploy_invoke</b> notebook. Please execute the code in the <b>rtb_xgboost_Sagemaker_realtime_endpoint_deploy_invoke</b> notebook before proceeding.
</div>

## Setup

We recommend configuring your Notebook Role to have <b>SageMaker Full Access</b> for testing purposes only!

In [None]:
import boto3
import sagemaker
import time
import os

client = boto3.client(service_name="sagemaker")

boto_session = boto3.session.Session()
region = boto_session.region_name
print(region)

sagemaker_session = sagemaker.Session()

role = sagemaker.get_execution_role()
print(role)

Retrieve variables saved in the <b>rtb_xgboost_Sagemaker_realtime_endpoint_deploy_invoke</b> notebook

In [None]:
%store -r image_uri
%store -r model_url
%store -r model_name

In [None]:
try:
    image_uri
    model_url
    model_name
except NameError:
    print("*****************************************************************************")
    print("[ERROR] PLEASE RE-RUN THE SageMaker Real-Time Inference NOTEBOOK ************")
    print("[ERROR] THIS NOTEBOOK WILL NOT RUN PROPERLY. ********************************")
    print("*****************************************************************************")

Print variables saved in the <b>rtb_xgboost_Sagemaker_realtime_endpoint_deploy_invoke</b> notebook

In [None]:
print('Image uri : {}'.format(image_uri))
print('Model url: {}'.format(model_url))
print('Model name: {}'.format(model_name))

## SageMaker Inference Recommender - Instant Instance Recommendation

SageMaker's Inference Recommender simplifies the process of selecting the optimal instance type for deploying your machine learning model. It performs preliminary analysis on your model and provides a list of the top five recommended instance types on the model details page. You can access the list of prospective instances programmatically through the DescribeModel API, the SageMaker Python SDK, or directly from the SageMaker console.

The following code block demonstrates how to get instant deployment recommendations from the DescribeModel API.

In [None]:
describe_model_response = client.describe_model(ModelName=model_name)
deployment_recommendation = describe_model_response.get("DeploymentRecommendation")

We can visualize these recommendations by using the code block below:

In [None]:
import pandas as pd

pd.set_option("display.max_colwidth", None)
df = pd.DataFrame(
    deployment_recommendation.get("RealTimeInferenceRecommendations"),
    columns=["RecommendationId", "InstanceType", "Environment"],
)
display(df)

You can use the above recommendations as a start for load testing / benchmarking your model

In [None]:
recommended_instance_types = df['InstanceType'].values.tolist()
print('Recommended instance types : {}'.format(recommended_instance_types))

While these initial recommendations serve as a starting point, running additional instance recommendation jobs is advisable for more accurate and comprehensive results. 

## SageMaker Inference Recommender - Load Testing and Benchmarking Real-Time Endpoints

### Setting up variables

Inference Recommender uses metadata about your ML model to recommend the best instance types and endpoint configurations for deployment. You can provide as much or as little information as you'd like but the more information you provide, the better your recommendations will be.

ML Frameworks: TENSORFLOW, PYTORCH, XGBOOST, SAGEMAKER-SCIKIT-LEARN

ML Domains: COMPUTER_VISION, NATURAL_LANGUAGE_PROCESSING, MACHINE_LEARNING

Example ML Tasks: CLASSIFICATION, REGRESSION, IMAGE_CLASSIFICATION, OBJECT_DETECTION, SEGMENTATION, MASK_FILL, TEXT_CLASSIFICATION, TEXT_GENERATION, OTHER

We define variables as below:

In [None]:
ml_domain = "MACHINE_LEARNING"
ml_task = "CLASSIFICATION"

framework = "XGBOOST"
framework_version = "1.0.1"

We need to create an archive that contains sample payload that Inference Recommender can send to your SageMaker Endpoints. 

Here we are only adding a single CSV file with one example. In your own use case(s), it's recommended to add a variety of samples that is representative of your payloads.

### Defining Sample Payload

In [None]:
data_input_configuration = b"2,0,0.0,7.0,3.0,20.0,2"

In [None]:
payload_location = "./sample-payload/"

if not os.path.exists(payload_location):
    os.makedirs(payload_location)
    print("Directory ", payload_location, " Created ")
else:
    print("Directory ", payload_location, " already exists")

In [None]:
payload_archive_location = payload_location + "xgb_payload.tar.gz"
payload_file_location = payload_location + "sample.csv"

with open(payload_file_location,'wb') as file:
        file.write(data_input_configuration)
        file.write(b"\n")

Next, we create a tarball with sample payload

In [None]:
!tar -cvzf {payload_archive_location} {payload_file_location}

We upload the packaged payload examples (payload.tar.gz) that was created above to S3. The S3 location will be used as input to our Inference Recommender job later in this notebook.

In [None]:
sample_payload_url = sagemaker_session.upload_data(
    path=payload_archive_location, key_prefix="xgb_payload"
)

print("Sample Payload S3 URL: " + sample_payload_url)

### Registering Model in the Model Registry

In order to use Inference Recommender, you must have a versioned model in SageMaker Model Registry. To register a model in the Model Registry, you must have a model artifact packaged in a tarball and an inference container image. Registering a model includes the following steps:

- <b>Create Model Group:</b> This is a one-time task per machine learning use case. A Model Group contains one or more versions of your packaged model.
- <b>Register Model Version/Package:</b> This task is performed for each new packaged model version.

In the following code block we create a model group

In [None]:
from time import gmtime, strftime

model_package_group_name = "xgb-bid-filtering-rtb-model-package-group-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
model_package_group_input_dict = {
 "ModelPackageGroupName" : model_package_group_name,
 "ModelPackageGroupDescription" : "RTB traffic filtering model group"
}

create_model_package_group_response = client.create_model_package_group(**model_package_group_input_dict)
print('ModelPackageGroup Arn : {}'.format(create_model_package_group_response['ModelPackageGroupArn']))

### Defining Model Packages

For the inference container image we use the URI of the deep learning container (DLC) provided by Amazon (defined in the previous notebook). 

You'll register your pretrained model that was packaged in the prior steps as a new version in SageMaker Model Registry. First, you'll configure the model package/version identifying which model package group this new model should be registered within as well as identify the initial approval status. You'll also identify the domain and task for your model. These values were set earlier in the notebook where ml_domain = 'MACHINE_LEARNING' and ml_task = 'CLASSIFICATION'

> Note: ModelApprovalStatus is a configuration parameter that can be used in conjunction with SageMaker Projects to trigger automated deployment pipeline.

If you specify a set of instance types below (i.e. non-empty list), then Inference Recommender will only support recommendations within the set of instances. 

Here, as an example, we specified ths list of initially recommended instance types.

In [None]:
# Specify MIME type for the model (type of data model will accept and/or return)
mime_types = ["text/csv"]

# Specify the model inference specification
modelpackage_inference_specification =  {
    "InferenceSpecification": {
      "Containers": [
         {
            "Image": image_uri,
            "ModelDataUrl": model_url,
         }
      ],
      "SupportedContentTypes": mime_types,
      "SupportedResponseMIMETypes": mime_types,
      #"SupportedRealtimeInferenceInstanceTypes": recommended_instance_types
   }
 }

create_model_package_input_dict = {
    "ModelPackageGroupName" : model_package_group_name,
    "ModelPackageDescription" : "RTB traffic filtering model",
    "Domain": ml_domain.upper(),
    "Task": ml_task.upper(),
    "SamplePayloadUrl": sample_payload_url,    
    "ModelApprovalStatus" : "PendingManualApproval"
}
create_model_package_input_dict.update(modelpackage_inference_specification)

create_model_package_response = client.create_model_package(**create_model_package_input_dict)
model_package_arn = create_model_package_response["ModelPackageArn"]
print('ModelPackage Version ARN : {}'.format(model_package_arn))

## Create an Inference Recommender Default Job

Now with your model in Model Registry, you can start a 'Default' job to get instance recommendations. This only requires your ModelPackageVersionArn and comes back with recommendations within 15-20 minutes.

The output is a list of instance type recommendations with associated environment variables, cost, throughput and latency metrics.

In [None]:
from time import gmtime, strftime

# Create a low-level SageMaker service client.
aws_region = region
sagemaker_client = boto3.client('sagemaker', region_name=aws_region) 

# Provide your model package ARN that was created when you registered your model with Model Registry 
model_package_arn = model_package_arn

# Provide a unique job name for SageMaker Inference Recommender job
job_name = "xgb-bid-filtering-real-time-benchmark-test" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# Inference Recommender job type. Set to Default to get an initial recommendation
job_type = 'Default'

# Provide an IAM Role that gives SageMaker Inference Recommender permission to 
# access AWS services
role_arn = role
                                    
# Provide endpoint name for your endpoint that want to benchmark in Inference Recommender
endpoint_name = 'xgb-bid-filtering-realtime-ep2024-03-04-21-19-41'

sagemaker_client.create_inference_recommendations_job(
    JobName = job_name,
    JobType = job_type,
    RoleArn = role_arn,
    InputConfig = {
        'ModelPackageVersionArn': model_package_arn,
    }
)

The following code will wait for the job to complete

In [None]:
import pprint
import pandas as pd

finished = False
while not finished:
    inference_recommender_job = client.describe_inference_recommendations_job(JobName=job_name)
    if inference_recommender_job["Status"] in ["COMPLETED", "STOPPED", "FAILED"]:
        finished = True
    else:
        print("In progress")
        time.sleep(60)

if inference_recommender_job["Status"] == "FAILED":
    print("Inference recommender job failed ")
    print("Failed Reason: {}".inference_recommender_job["FailedReason"])
else:
    print("Inference recommender job completed")

To see the list of subtasks for an Inference Recommender job, we provide the JobName to the ListInferenceRecommendationsJobSteps API.

In [None]:
data = [
    {**x["EndpointConfiguration"], **x["ModelConfiguration"], **x["Metrics"]}
    for x in inference_recommender_job["InferenceRecommendations"]
]
df = pd.DataFrame(data)
dropFilter = df.filter(["VariantName"])
df.drop(dropFilter, inplace=True, axis=1)
pd.set_option("max_colwidth", 300)
df.head(20)

Each inference recommendation includes:
- `EndpointName` - Name of endpoint used by Inference Recommender to run the job
- `ServerlessConfig` - Configuraion/tests for three serverless endpoints of various Memory configurations
- `EnvironmentParameters` - Suggested tuned parameters for better performance. To take advantage of these optimizations you can include these parameters as Environment variables when creating your endpoints 


Output also includes performance and cost metrics such as 
- `CostPerHour` - Cost of running the endpoint for an hour (US Dollars)
- `CostPerInference` - Cost per one inference request (US Dollars)
- `MaxInvocations` - The number of invocations sent to endpoint (per minute)
- `ModelLatency` - Model latency registered during the stress test (in milliseconds)
- `InstanceType`- Instance type used for the test
- `InitialInstanceCount` - The number of instances initialized for each test. 
- `CPU Utilization` - The expected CPU utilization at maximum invocations per minute for the endpoint instance.

You can read more about interpreting the results here: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-interpret-results.html

These metrics can help you narrow down to a specific endpoint configuration that suits your use case

Example:

If your motivation is overall price-performance with an emphasis on throughput, then you should focus on `CostPerInference` metrics.

If your motivation is a balance between latency and throughput, then you should focus on `ModelLatency` / `MaxInvocations` metrics.