# Deploy a pretrained PyTorch BERT model from HuggingFace on Amazon SageMaker with Neuron container with custom installations and inference.py script. 




## Overview

This follows the documentation on [PyTorch BERT model](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/src/examples/pytorch/byoc_sm_bert_tutorial/sagemaker_container_neuron.html). Although the inference and code looks the same there has been considerable changes made on the docker file to adapt to the current env variables and also the dependencies 

## Install Dependencies:

This tutorial requires the following pip packages:

- torch-neuron
- neuron-cc[tensorflow]
- transformers

In [2]:
!pip install --upgrade -q pip

In [4]:
%env TOKENIZERS_PARALLELISM=True #Supresses tokenizer warnings making errors easier to detect
!pip install --upgrade -q --no-cache-dir torch-neuron neuron-cc[tensorflow] torchvision torch --extra-index-url=https://pip.repos.neuron.amazonaws.com
!pip install --upgrade -q --no-cache-dir transformers



## Compile the model into an AWS Neuron optimized TorchScript

In [5]:
import torch
import torch_neuron

from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

In [6]:
# Build tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc", return_dict=False)

# Setup some example inputs
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"

max_length=128
paraphrase = tokenizer.encode_plus(sequence_0, sequence_2, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")
not_paraphrase = tokenizer.encode_plus(sequence_0, sequence_1, max_length=max_length, padding='max_length', truncation=True, return_tensors="pt")

# Run the original PyTorch model on compilation exaple
paraphrase_classification_logits = model(**paraphrase)[0]

# Convert example inputs to a format that is compatible with TorchScript tracing
example_inputs_paraphrase = paraphrase['input_ids'], paraphrase['attention_mask'], paraphrase['token_type_ids']
example_inputs_not_paraphrase = not_paraphrase['input_ids'], not_paraphrase['attention_mask'], not_paraphrase['token_type_ids']

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

In [7]:
%%time
# Run torch.neuron.trace to generate a TorchScript that is optimized by AWS Neuron
# This step may need 3-5 min
model_neuron = torch.neuron.trace(model, example_inputs_paraphrase, verbose=1, compiler_workdir='./compilation_artifacts')

INFO:Neuron:There are 3 ops of 1 different types in the TorchScript that are not compiled by neuron-cc: aten::embedding, (For more information see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/compiler/neuron-cc/neuron-cc-ops/neuron-cc-ops-pytorch.html)
INFO:Neuron:Number of arithmetic operators (pre-compilation) before = 565, fused = 548, percent fused = 96.99%


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


INFO:Neuron:Compiling function _NeuronGraph$695 with neuron-cc
INFO:Neuron:Compiling with command line: '/home/ec2-user/anaconda3/envs/pytorch_p310/bin/neuron-cc compile /home/ec2-user/SageMaker/Sagemaker-BYOC/compilation_artifacts/60/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /home/ec2-user/SageMaker/Sagemaker-BYOC/compilation_artifacts/60/graph_def.neff --io-config {"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Add:0"]} --verbose 1'
09/26/2023 05:07:39 PM INFO 20697 [root]: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/neuron-cc compile /home/ec2-user/SageMaker/Sagemaker-BYOC/compilation_artifacts/60/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /home/ec2-user/SageMaker/Sagemaker-BYOC/compilation_artifacts/60/graph_def.neff --io-config '{"inputs": {"0:0": [[1, 128, 768], "float32"], "1:0": [[1, 1, 1, 128], "float32"]}, "outputs": ["Linear_5/aten_linear/Ad

09/26/2023 05:08:51 PM INFO [WalrusDriver.0]: max_allowed_parallelism=16
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Running walrus pass: unroll
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Input to unroll: modules=1 functions=1 allocs=1081 blocks=1 instructions=216
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: INFO (Unroll) Start unrolling at Tue Sep 26 17:08:52 2023
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: INFO (Unroll) DONE unrolling Tue Sep 26 17:08:52 2023
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Instruction count after Unroll: 
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Total count: 20452
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Matmult: 13401
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: TensorScalarPtr: 1837
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: TensorCopy: 1530
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: TensorTensor: 1165
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Activation: 959
09/26/2023 05:08:52 PM INFO [WalrusDriver.0]: Load: 513
09/26/2

Analyzing dependencies of sg00/Block1
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************


09/26/2023 05:08:54 PM INFO [WalrusDriver.0]: ru_maxrss:  3698mb (delta=0mb)
09/26/2023 05:08:54 PM INFO [WalrusDriver.0]: Walrus pass: anti_dependency_analyzer succeeded!
09/26/2023 05:08:54 PM INFO [WalrusDriver.0]: Output has 1 module(s), 1 function(s), 6138 memory location(s), 1 block(s), and 20482 instruction(s).
09/26/2023 05:08:54 PM INFO [WalrusDriver.0]: Running walrus pass: post_sched
09/26/2023 05:08:54 PM INFO [WalrusDriver.0]: Input to post_sched: modules=1 functions=1 allocs=6138 blocks=1 instructions=20482
09/26/2023 05:08:54 PM INFO [TheScheduler.0]: Start PosT ScheD 2 inferentia Tue Sep 26 17:08:54 2023
09/26/2023 05:08:55 PM INFO [TheScheduler.0]: Done  PosT ScheD Tue Sep 26 17:08:55 2023
09/26/2023 05:08:55 PM INFO [WalrusDriver.0]: ru_maxrss:  3698mb (delta=0mb)
09/26/2023 05:08:55 PM INFO [WalrusDriver.0]: Walrus pass: post_sched succeeded!
09/26/2023 05:08:55 PM INFO [WalrusDriver.0]: Output has 1 module(s), 1 function(s), 6138 memory location(s), 1 block(s), and 

09/26/2023 05:09:03 PM INFO 20697 [job.WalrusDriver.3]: IR signature: a108c70e6df278b5caf2b85790c7260858f3acf0333f4d04cf73480e97cb7f62 for sg00/walrus_bir.out.json
09/26/2023 05:09:03 PM INFO 20697 [job.WalrusDriver.3]: Job finished
09/26/2023 05:09:03 PM INFO 20697 [pipeline.compile.0]: Finished job job.WalrusDriver.3 with state 0
09/26/2023 05:09:03 PM INFO 20697 [pipeline.compile.0]: Starting job job.Backend.3 state state 0
09/26/2023 05:09:03 PM INFO 20697 [job.Backend.3]: Replay this job by calling: /home/ec2-user/anaconda3/envs/pytorch_p310/bin/neuron-cc compile --framework TENSORFLOW --state '{"model": ["/home/ec2-user/SageMaker/Sagemaker-BYOC/compilation_artifacts/60/graph_def.pb"], "tensormap": "tensor_map.json", "bir": "walrus_bir.out.json", "state_dir": "/home/ec2-user/SageMaker/Sagemaker-BYOC/compilation_artifacts/60/sg00", "state_id": "sg00"}' --pipeline Backend --enable-experimental-bir-backend
09/26/2023 05:09:03 PM INFO 20697 [job.Backend.3]: IR signature: de149e46e988b

CPU times: user 34.9 s, sys: 4.09 s, total: 39 s
Wall time: 2min 25s


You may inspect **model_neuron.graph** to see which part is running on CPU versus running on the accelerator. All native **aten** operators in the graph will be running on CPU.

In [8]:
# See  which part is running on CPU versus running on the accelerator.
print(model_neuron.graph)

graph(%self.1 : __torch__.torch_neuron.runtime.___torch_mangle_446.AwsNeuronGraphModule,
      %7 : Long(1, 128, strides=[128, 1], requires_grad=0, device=cpu),
      %tensor.1 : Long(1, 128, strides=[128, 1], requires_grad=0, device=cpu),
      %9 : Long(1, 128, strides=[128, 1], requires_grad=0, device=cpu)):
  %_NeuronGraph#60 : __torch__.torch_neuron.decorators.NeuronModuleV2 = prim::GetAttr[name="_NeuronGraph#60"](%self.1)
  %16 : int = prim::Constant[value=0]() # /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch_neuron/native_ops/aten.py:29:0
  %17 : int = prim::Constant[value=0]() # /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch_neuron/native_ops/aten.py:29:0
  %18 : int = prim::Constant[value=9223372036854775807]() # /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch_neuron/native_ops/aten.py:29:0
  %19 : int = prim::Constant[value=1]() # /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.

Save the compiled model, so it can be packaged and sent to S3.

In [9]:
# Save the TorchScript for later use
model_neuron.save('neuron_compiled_model.pt')

### Package the pre-trained model and upload it to S3

To make the model available for the SageMaker deployment, you will TAR the serialized graph and upload it to the default Amazon S3 bucket for your SageMaker session. 

In [10]:
# Now you'll create a model.tar.gz file to be used by SageMaker endpoint
!tar -czvf model.tar.gz neuron_compiled_model.pt

neuron_compiled_model.pt


In [11]:
import boto3
import time
from sagemaker.utils import name_from_base
import sagemaker

In [12]:
# upload model to S3
role = sagemaker.get_execution_role()
sess=sagemaker.Session()
region=sess.boto_region_name
bucket=sess.default_bucket()
sm_client=boto3.client('sagemaker')

In [None]:
model_key = '{}/model/model.tar.gz'.format('inf1_compiled_model')
model_path = 's3://{}/{}'.format(bucket, model_key)
boto3.resource('s3').Bucket(bucket).upload_file('model.tar.gz', model_key)
print("Uploaded model to S3:")
print(model_path)

## Build and Push the container

The following shell code shows how to build the container image using docker build and push the container image to ECR using docker push.
The Dockerfile in this example is available in the ***container*** folder.
Here's an example of the Dockerfile:



In [15]:
!cat container/Dockerfile

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.13.2-ubuntu20.04

# Install packages 
RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python3-pip \
         python3-setuptools \
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

RUN ln -s /usr/bin/python3 /usr/bin/python
RUN ln -s /usr/bin/pip3 /usr/bin/pip

RUN pip --no-cache-dir install transformers flask gunicorn
# CMD ["/usr/local/bin/entrypoint.sh"]

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

Before running the next cell, make sure your SageMaker IAM role has access to ECR. If not, you can attache the role `AmazonEC2ContainerRegistryPowerUser` to your IAM role ARN, which allows you to upload image layers to ECR.  

It takes 5 minutes to build docker images and upload image to ECR

In [None]:
%%sh

# The name of our algorithm
algorithm_name=neuron-inference-py36

cd container

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null


# Get the login command from ECR in order to pull down the SageMaker PyTorch image
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build  -t ${algorithm_name} . --build-arg REGION=${region}
docker tag ${algorithm_name} ${fullname}

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com
docker push ${fullname}

## Deploy Container and run inference based on the pretrained model

To deploy a pretrained PyTorch model, you'll need to use the PyTorch estimator object to create a PyTorchModel object and set a different entry_point.

You'll use the PyTorchModel object to deploy a PyTorchPredictor. This creates a SageMaker Endpoint -- a hosted prediction service that we can use to perform inference.

In [24]:
import sys

!{sys.executable} -m pip install Transformers



In [None]:
import os
import boto3
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

bucket = sess.default_bucket()
prefix = "inf1_compiled_model/model"

# Get container name in ECR
client=boto3.client('sts')
account=client.get_caller_identity()['Account']

my_session=boto3.session.Session()
region=my_session.region_name

algorithm_name="neuron-inference-py36"
ecr_image='{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account, region, algorithm_name)
print(ecr_image)

An implementation of *model_fn* is required for inference script.
We are going to implement our own **model_fn** and **predict_fn** for Hugging Face Bert, and use default implementations of **input_fn** and **output_fn** defined in sagemaker-pytorch-containers.

In this example, the inference script is put in ***code*** folder. Run the next cell to see it:


In [31]:
!pygmentize code/inference.py

[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m  [37m# to workaround a protobuf version conflict issue[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mneuron[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoTokenizer, AutoModelForSequenceClassification, AutoConfig[37m[39;49;00m
[37m[39;49;00m
JSON_CONTENT_TYPE = [33m'[39;49;00m[33mapplication/json[39;49;00m[33m'[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):[37m[39;49;00m
    tokenizer_init = AutoTokenizer.from_pretrained([33m"[39;49;00m[33mbert-base-cased-finetuned-mrpc[39;49;00m[33m"[39;49;00m)[37m[39;49;00m
    model_file =os.pa

Path of compiled pretrained model in S3:

In [32]:
key = os.path.join(prefix, "model.tar.gz")
pretrained_model_data = "s3://{}/{}".format(bucket, key)
print(pretrained_model_data)

s3://sagemaker-us-west-2-265645771569/inf1_compiled_model/model/model.tar.gz


The model object is defined by using the SageMaker Python SDK's PyTorchModel and pass in the model from the estimator and the entry_point. The endpoint's entry point for inference is defined by model_fn as seen in the previous code block that prints out **inference.py**. The model_fn function will load the model and required tokenizer.

Note, **image_uri** must be user's own ECR images.

In [33]:
from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data=pretrained_model_data,
    role=role,
    source_dir="code",
    framework_version="1.7.1",
    entry_point="inference.py",
    image_uri=ecr_image
)

# Let SageMaker know that we've already compiled the model via neuron-cc
pytorch_model._is_compiled_model = True

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint.

Here you will deploy the model to a single **ml.inf1.2xlarge** instance.
It may take 6-10 min to deploy.

In [34]:
%%time

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type="ml.inf1.2xlarge")

----------------!CPU times: user 10.9 s, sys: 1.22 s, total: 12.1 s
Wall time: 8min 44s


In [35]:
print(predictor.endpoint_name)

neuron-inference-py36-ml-inf1-2023-09-26-17-32-08-134


Since in the input_fn we declared that the incoming requests are json-encoded, we need to use a json serializer, to encode the incoming data into a json string. Also, we declared the return content type to be json string, we Need to use a json deserializer to parse the response.

In [36]:
predictor.serializer = sagemaker.serializers.JSONSerializer()
predictor.deserializer = sagemaker.deserializers.JSONDeserializer()

Using a list of sentences, now SageMaker endpoint is invoked to get predictions.

In [37]:
%%time
result = predictor.predict(
    [
        "Never allow the same bug to bite you twice.",
        "The best part of Amazon SageMaker is that it makes machine learning easy.",
    ]
)
print(result)

BERT says that "Never allow the same bug to bite you twice." and "The best part of Amazon SageMaker is that it makes machine learning easy." are not paraphrase
CPU times: user 14.4 ms, sys: 4.4 ms, total: 18.8 ms
Wall time: 162 ms


In [38]:
%%time
result = predictor.predict(
    [
        "The company HuggingFace is based in New York City",
        "HuggingFace's headquarters are situated in Manhattan",
    ]
)
print(result)

BERT says that "The company HuggingFace is based in New York City" and "HuggingFace's headquarters are situated in Manhattan" are paraphrase
CPU times: user 3.36 ms, sys: 255 µs, total: 3.62 ms
Wall time: 28.1 ms


## Benchmarking your endpoint

The following cells create a load test for your endpoint. You first define some helper functions: `inference_latency` runs the endpoint request, collects cliend side latency and any errors, `random_sentence` builds random to be sent to the endpoint.  

In [39]:
import numpy as np 
import datetime
import math
import time
import boto3   
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
import numpy as np
from tqdm import tqdm
import random

Matplotlib is building the font cache; this may take a moment.


In [40]:
def inference_latency(model,*inputs):
    """
    infetence_time is a simple method to return the latency of a model inference.

        Parameters:
            model: torch model onbject loaded using torch.jit.load
            inputs: model() args

        Returns:
            latency in seconds
    """
    error = False
    start = time.time()
    try:
        results = model(*inputs)
    except:
        error = True
        results = []
    return {'latency':time.time() - start, 'error': error, 'result': results}

In [41]:
def random_sentence():
    
    s_nouns = ["A dude", "My mom", "The king", "Some guy", "A cat with rabies", "A sloth", "Your homie", "This cool guy my gardener met yesterday", "Superman"]
    p_nouns = ["These dudes", "Both of my moms", "All the kings of the world", "Some guys", "All of a cattery's cats", "The multitude of sloths living under your bed", "Your homies", "Like, these, like, all these people", "Supermen"]
    s_verbs = ["eats", "kicks", "gives", "treats", "meets with", "creates", "hacks", "configures", "spies on", "retards", "meows on", "flees from", "tries to automate", "explodes"]
    p_verbs = ["eat", "kick", "give", "treat", "meet with", "create", "hack", "configure", "spy on", "retard", "meow on", "flee from", "try to automate", "explode"]
    infinitives = ["to make a pie.", "for no apparent reason.", "because the sky is green.", "for a disease.", "to be able to make toast explode.", "to know more about archeology."]
    
    return (random.choice(s_nouns) + ' ' + random.choice(s_verbs) + ' ' + random.choice(s_nouns).lower() or random.choice(p_nouns).lower() + ' ' + random.choice(infinitives))

print([random_sentence(), random_sentence()])

['A sloth spies on a cat with rabies', 'My mom spies on this cool guy my gardener met yesterday']


The following cell creates `number_of_clients` concurrent threads to run `number_of_runs` requests. Once completed, a `boto3` CloudWatch client will query for the server side latency metrics for comparison.   

In [42]:
# Defining Auxiliary variables
number_of_clients = 2
number_of_runs = 1000
t = tqdm(range(number_of_runs),position=0, leave=True)

# Starting parallel clients
cw_start = datetime.datetime.utcnow()

results = Parallel(n_jobs=number_of_clients,prefer="threads")(delayed(inference_latency)(predictor.predict,[random_sentence(), random_sentence()]) for mod in t)
avg_throughput = t.total/t.format_dict['elapsed']

cw_end = datetime.datetime.utcnow() 

# Computing metrics and print
latencies = [res['latency'] for res in results]
errors = [res['error'] for res in results]
error_p = sum(errors)/len(errors) *100
p50 = np.quantile(latencies[-1000:],0.50) * 1000
p90 = np.quantile(latencies[-1000:],0.95) * 1000
p95 = np.quantile(latencies[-1000:],0.99) * 1000

print(f'Avg Throughput: :{avg_throughput:.1f}\n')
print(f'50th Percentile Latency:{p50:.1f} ms')
print(f'90th Percentile Latency:{p90:.1f} ms')
print(f'95th Percentile Latency:{p95:.1f} ms\n')
print(f'Errors percentage: {error_p:.1f} %\n')

# Querying CloudWatch
print('Getting Cloudwatch:')
cloudwatch = boto3.client('cloudwatch')
statistics=['SampleCount', 'Average', 'Minimum', 'Maximum']
extended=['p50', 'p90', 'p95', 'p100']

# Give 5 minute buffer to end
cw_end += datetime.timedelta(minutes=5)

# Period must be 1, 5, 10, 30, or multiple of 60
# Calculate closest multiple of 60 to the total elapsed time
factor = math.ceil((cw_end - cw_start).total_seconds() / 60)
period = factor * 60
print('Time elapsed: {} seconds'.format((cw_end - cw_start).total_seconds()))
print('Using period of {} seconds\n'.format(period))

cloudwatch_ready = False
# Keep polling CloudWatch metrics until datapoints are available
while not cloudwatch_ready:
  time.sleep(30)
  print('Waiting 30 seconds ...')
  # Must use default units of microseconds
  model_latency_metrics = cloudwatch.get_metric_statistics(MetricName='ModelLatency',
                                             Dimensions=[{'Name': 'EndpointName',
                                                          'Value': predictor.endpoint_name},
                                                         {'Name': 'VariantName',
                                                          'Value': "AllTraffic"}],
                                             Namespace="AWS/SageMaker",
                                             StartTime=cw_start,
                                             EndTime=cw_end,
                                             Period=period,
                                             Statistics=statistics,
                                             ExtendedStatistics=extended
                                             )
  # Should be 1000
  if len(model_latency_metrics['Datapoints']) > 0:
    print('{} latency datapoints ready'.format(model_latency_metrics['Datapoints'][0]['SampleCount']))
    side_avg = model_latency_metrics['Datapoints'][0]['Average'] / number_of_runs
    side_p50 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p50'] / number_of_runs
    side_p90 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p90'] / number_of_runs
    side_p95 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p95'] / number_of_runs
    side_p100 = model_latency_metrics['Datapoints'][0]['ExtendedStatistics']['p100'] / number_of_runs
    
    print(f'50th Percentile Latency:{side_p50:.1f} ms')
    print(f'90th Percentile Latency:{side_p90:.1f} ms')
    print(f'95th Percentile Latency:{side_p95:.1f} ms\n')

    cloudwatch_ready = True




100%|██████████| 1000/1000 [00:10<00:00, 98.21it/s]


Avg Throughput: :97.9

50th Percentile Latency:19.6 ms
90th Percentile Latency:22.2 ms
95th Percentile Latency:26.0 ms

Errors percentage: 0.0 %

Getting Cloudwatch:
Time elapsed: 310.21425 seconds
Using period of 360 seconds

Waiting 30 seconds ...
919.0 latency datapoints ready
50th Percentile Latency:13.5 ms
90th Percentile Latency:15.0 ms
95th Percentile Latency:15.8 ms



### Cleanup
Endpoints should be deleted when no longer in use, to avoid costs.

In [43]:
predictor.delete_endpoint(predictor.endpoint)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
