# Benchmark RoBERTa base model using Amazon SageMaker Multi-model endpoints (MME) with GPU support

Amazon SageMaker multi-model endpoints with GPU works using **NVIDIA Triton Inference Server**.

NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and **provides high inference performance**. Triton supports all major training and inference frameworks.
It offers **dynamic batching, concurrent execution, post-training quantization, optimal model configuration** to achieve high performance inference.

In this notebook, we are going to run benchmark testing for the most popular NLP models using MME on GPU. **We will evaluate model performance such as the inference latency, throughput, and optimum model count per instance.**

This notebook is tested on `PyTorch 1.12 Python 3.8 CPU Optimized` kernel on SageMaker Studio. An instance with at least 8 vCPU cores such as an `ml.c5.2xlarge` is recommended to run the load test. A smaller instance may be utilized by reducing the scale of the load test. The configuration provided here can simulate up to 200 concurrent workers.

## Set up the environment

Installs the dependencies required to package the model and run inferences using Triton server.

In [None]:
%pip install timm -Uqq
%pip install transformers -Uqq
%pip install locust -Uqq
%pip install boto3 -Uqq
%pip install sagemaker -Uqq
%pip install matplotlib -Uqq
%pip install Jinja2 -Uqq
%pip install ipywidgets -Uqq

In [None]:
%%capture
import IPython

IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

In [None]:
%env TOKENIZERS_PARALLELISM=False

In [None]:
import sagemaker
from sagemaker import get_execution_role
import torch
from pathlib import Path

import boto3
import json
from pathlib import Path
import time
import datetime as dt
import warnings

from utils import model_utils

region = boto3.Session().region_name
role = get_execution_role()
sess = sagemaker.Session()
account = sess.account_id()
bucket = sess.default_bucket()

model_name = "roberta-base" 
prefix = 'mme-roberta-base-benchmark'
use_case = "nlp"
max_seq_len = 128 # sequence length

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")

## Generate Pretrained Models

We are going to use the following SageMaker Processing script to generate our pretrained model.
Helper functions have been created for each of these steps and are imported from the `utils.model_utils` local module

#### Returns a model and tokenizer from HuggingFace Hub.


In [None]:
tokenizer, model = model_utils.get_model_from_hf_hub(model_name)
model.eval()
print(f"loaded model {model_name} with {model_utils.count_parameters(model)} parameters")
example_input = tokenizer("This is a sample", padding="max_length", max_length=max_seq_len, return_tensors="pt")

#### jit script the model and save the torchscript file

In [None]:
pytorch_model_path = Path(f"triton-serve-pt/{model_name}/1")
pytorch_model_path.mkdir(parents=True, exist_ok=True)
pt_model_path = model_utils.export_pt_jit(model, list(example_input.values()), pytorch_model_path) #export jit compiled model to specified directory

#### Create and package a model artifact 

The package contains torchscript file and a model configuration (config.pbtxt) for Triton serving
It is possible to configure the the input and output for triton according to your model.    

In [None]:
triton_inputs = [
    {"name": input_name, "data_type": "TYPE_INT32", "dims": f"[{max_seq_len}]"} 
        for input_name in example_input
]
triton_outputs = [
    {
        "name": "last_hidden_state",
        "data_type": "TYPE_FP32",
        "dims": f"[{max_seq_len}, {model.config.hidden_size}]",
    }
]

In [None]:
triton_config_path = model_utils.generate_triton_config(platform="pt", triton_inputs=triton_inputs, triton_outputs=triton_outputs, save_path=pytorch_model_path)
triton_config_path

In [None]:
model_artifact_path = model_utils.package_triton_model(model_name, pt_model_path, triton_config_path)

In [None]:
mme_path = f"s3://{bucket}/{prefix}/{model_name}/"
initial_model_path = sess.upload_data(model_artifact_path.as_posix(), bucket=bucket, key_prefix=f"{prefix}{model_name}")

In [None]:
initial_model_path

In [None]:
mme_path

#### We verify that no models located in the Multi Model Endpoint S3 Path

In [None]:
!aws s3 rm --recursive {mme_path}

In [None]:
!aws s3 ls {mme_path}

## Create a SageMaker Multi-Model Endpoint for PyTorch Model

In [None]:
from utils.endpoint_utils import create_endpoint, delete_endpoint, get_instance_utilization, run_load_test
from utils.model_utils import get_triton_image_uri

mme_triton_image_uri = get_triton_image_uri(region)
print(mme_triton_image_uri)
instance_type = 'ml.g4dn.xlarge'

In [None]:
container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": mme_path,
    "Mode": "MultiModel"
}

We'll deploy and endpoint is deployed using a helper function

In [None]:
sm_model_name, endpoint_config_name, endpoint_name = create_endpoint(sm_client, model_name, role, container, instance_type, "pt")

Next we'll upload a python model that we can use to query the instance utilization in real time

In [None]:
!tar czvf metrics.tar.gz server_metrics/
!aws s3 cp metrics.tar.gz {mme_path}

In [None]:
!aws s3 ls {mme_path}

In [None]:
get_instance_utilization(runtime_sm_client, endpoint_name) #invoke once to load the python model in memory

## Load PyTorch Models into Endpoint

In this section we will determine the maximum number of model copies that the endpoint can load into memory within a specified threshold
- When a model is invoked for the first time, SageMaker will load it into the GPU Memory
- In this section we will invoke the model with a sample endpoint which result in it being loaded into memory
- We'll then make copies of the model on S3 and invoke each copy until we reach the specified GPU Memory threshold which we set at 90% of Available memory 

In [None]:
payload = {
    "inputs":
        [{"name": name, "shape": list(data.size()), "datatype": "INT32", "data": data.tolist()} for name, data in example_input.items()]
}
payload['inputs'][0]['shape']

In [None]:
models_loaded = 0
memory_utilization_threshold = 0.9
memory_utilization_history = []
while True:
    # make a copy of the model
    !aws s3 cp {initial_model_path} {mme_path}{model_name}-v{models_loaded}.tar.gz
    
    # make a inference request to load model into memory
    response = runtime_sm_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="application/octet-stream",
            Body=json.dumps(payload),
            TargetModel=f"{model_name}-v{models_loaded}.tar.gz", 
        )
    
    models_loaded+=1
    
    #get instance metrics
    instance_metrics = get_instance_utilization(runtime_sm_client, endpoint_name)
    model_avg_mem_consumption = instance_metrics["gpu_used_memory"] / models_loaded
    
    # get an estimate of the gpu memory util once next model is loaded
    next_gpu_mem_util = (instance_metrics["gpu_used_memory"] + model_avg_mem_consumption) / instance_metrics["gpu_total_memory"]
    
    memory_utilization = instance_metrics["gpu_memory_utilization"]
    memory_utilization_history.append(memory_utilization)
    
    # terminate loop if the memory consumption is exceeded once next model is loaded
    if next_gpu_mem_util >= memory_utilization_threshold:
        print(f"This instance is able to load {models_loaded} models with {memory_utilization:.2%} of gpu memory consumed")
        break
        
    print(f"loaded {models_loaded} models with memory utilzation of {memory_utilization:.2%}")

In [None]:
!aws s3 ls {mme_path}

In [None]:
models_loaded

## Benchmark Pytorch Model using Locust

`locust_benchmark_sm.py` is provided in the 'locust' folder

<div class="alert alert-info"> <strong> Note: </strong>
The load test is run with up to 200 simulated workers. This may not be suitable for larger models with long response times. You can modify the <code>StagesShape</code> Class in the <code>locust/locust_benchmark_sm.py</code> file to adjust the traffic pattern and the number of concurrent workers
</div>

In [None]:
locust_result_path = Path("results") / model_name
locust_result_path.mkdir(parents=True,exist_ok=True)

In [None]:
%%time
output_path = (locust_result_path / f"{instance_type}*pt*{models_loaded}") # capture the instance type, engine, and models loaded in file name
run_load_test(endpoint_name, use_case, model_name, models_loaded, output_path, print_stdout=True, n_procs=6, sample_payload=json.dumps(payload))

In [None]:
# import some utilities to analyze the results of the load test
from utils.viz_utils import get_summary_results, generate_summary_plots, generate_metrics_summary

In [None]:
%matplotlib inline

In [None]:
load_test_summary = get_summary_results(locust_result_path)

In [None]:
generate_summary_plots(load_test_summary)

## Benchmark models invocation times

In [None]:
def invoke_models_sequentially(models_loaded, full_models_loop_counter):
    total_time = 0
    for x in range (full_models_loop_counter):
        total_loop_time = 0
        for counter in range(models_loaded):   
                st = time.time()
                target_model= f"{model_name}-v{counter}.tar.gz"
                print(f"invoking model {target_model}")
                response = runtime_sm_client.invoke_endpoint(
                            EndpointName=endpoint_name,
                            ContentType="application/octet-stream",
                            Body=json.dumps(payload),
                            TargetModel=target_model, 
                        )
                response = json.loads(response["Body"].read().decode("utf8"))
                output = response["outputs"][0]["data"]
                # print(len(output))
                et = time.time()
                elapsed_time = et - st
                total_loop_time += elapsed_time
                total_time += elapsed_time
                print('Execution time:', elapsed_time, 'seconds')
        print(f'\n***Invoked models: {models_loaded}, Total loop time: {total_loop_time}, Average loop time: {total_loop_time/models_loaded}\n')

    print('\n------------------------\n')
    print(f'Amount of models invoked: {full_models_loop_counter*models_loaded}')
    print(f'Total time: {total_time}')
    print(f'Average time: {total_time/(full_models_loop_counter*models_loaded)}')
    
invoke_models_sequentially(models_loaded, 5)

## Benchmark 2 X models invocation times

We will add additional models to reach to total of twice the amount of models loaded. 

In [None]:
target_models_to_load = models_loaded*2
target_models_to_load

In [None]:
while models_loaded < target_models_to_load:
    # make a copy of the model
    !aws s3 cp {initial_model_path} {mme_path}{model_name}-v{models_loaded}.tar.gz
    models_loaded+=1

In [None]:
!aws s3 ls {mme_path}

In [None]:
invoke_models_sequentially(models_loaded, 5)

## Clean Up PyTorch Endpoint

In [None]:
delete_endpoint(sm_client, sm_model_name, endpoint_config_name, endpoint_name)
! aws s3 rm --recursive {mme_path}