
# Triton on SageMaker - this notebook shows how you can take a RoBERTA model and create a traced model and leverage the Pytorch back end for Triton

Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.

Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.

This notebook was tested on Studio with ml.g4dn.xlarge which comes with 1 GPU and with ml.m5.large which is a CPU based machine only Contents

Introduction to NVIDIA Triton Server
Set up the environment
Basic: RoBERTA Model
* PyTorch: JIT Trace the model and create a Scripted model
* PyTorch: Testing the JIT Traced model 
* PyTorch: Packaging model files and uploading to s3
* PyTorch: Create SageMaker Endpoint
* PyTorch: Run inference
* PyTorch: Leverage the Predictions to view the results for Object detection
* PyTorch: Terminate endpoint and clean up artifacts


### Introduction to NVIDIA Triton Server

NVIDIA Triton Inference Server was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.

Some key features of Triton are:

* Support for Multiple frameworks: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats.
* Model pipelines: Triton model ensemble represents a pipeline of one or more models or pre/post-processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.
* Concurrent model execution: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.
* Dynamic batching: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.
* Diverse CPUs and GPUs: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.

Note: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal config.pbtxt configuration file is required in the model artifacts. This release doesn't support inferring the model config automatically. Set up the environment

Installs the dependencies required to package the model and run inferences using Triton server.

Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.

The purpose of this file is to show the ability to take a pytorch computer vision model and create a scripted model which can then be leveraged by Triton using the pytorch back end.

The other option is to build using a python back end but in that we loose some performance gains by compilation to native format




In [None]:
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

In [15]:
!pip install transformers[torch]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com


In [3]:
!pip install nvidia-pyindex
!pip install tritonclient[http]

!pip install -qU pip awscli boto3 sagemaker transformers

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... [?25ldone
[?25h  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8413 sha256=03d4e5f8b678c8e0714d13efba42d37b3c32e94b7bdefdbcf6d41ea087df0add
  Stored in directory: /home/ec2-user/.cache/pip/wheels/e0/c2/fb/5cf4e1cfaf28007238362cb746fb38fc2dd76348331a748d54
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.9
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p38/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mLooking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com
Co

### Start RoBERTA Base for Triton



In [71]:
!mkdir -p triton-serve-pt/roberta-traced
!mkdir -p triton-serve-pt/roberta-traced/1


!cd triton-serve-pt/roberta-traced/1 && rm -rf ".ipynb_checkpoints"
!cd triton-serve-pt/roberta-traced && rm -rf ".ipynb_checkpoints"
!cd triton-serve-pt && rm -rf ".ipynb_checkpoints"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [78]:
!ls -alrt triton-serve-pt/roberta-traced/1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
total 487364
drwxrwxr-x 2 ec2-user ec2-user      4096 Feb 21 16:22 .
-rw-rw-r-- 1 ec2-user ec2-user 499050915 Feb 21 16:41 model.pt
drwxrwxr-x 3 ec2-user ec2-user      4096 Feb 21 18:03 ..


In [16]:
%%writefile triton-serve-pt/roberta-traced/config.pbtxt
platform: "pytorch_libtorch"
max_batch_size: 32
input [
  {
    name: "INPUT__0"
    data_type: TYPE_INT32
    dims: [512]
  },
  {
    name: "INPUT__1"
    data_type: TYPE_INT32
    dims: [512]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [512, 768]
  },
  {
    name: "1634__1"
    data_type: TYPE_FP32
    dims: [768]
  }
]
instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
  preferred_batch_size: 32
}

Writing triton-serve-pt/roberta-traced/config.pbtxt


### Run for Triton server

**Note**: Amazon SageMaker expects the model tarball file to have a top level directory with the same name as the model defined in the `config.pbtxt`. Below is the sample model directory structure

```
roberta-large
├── 1
│   └── model.pt
└── config.pbtxt
```

**Have to use the same Tokenizer to generate the input to test as BERT uncased**

### Create the RoBERTA Model in Torch Script mode -- .pt model
use the ore trained and use torchscript flag here

In [3]:
from transformers import GPT2Tokenizer, GPTJModel
from transformers import GPTJForCausalLM, AutoTokenizer

import torch

### Run a simple test for RoBERTA base 

    * We run multiple tests
        * First we token ize and then de tokenize to make sure the vaues match
        * Then we use the model and run predictions to get values
        * Then we run on the traced Model and run predictions to get values 
        * Check to make sure they match

### Prepare some dummy inputs for tracing

In [17]:
# Tokenizing input text
tokenizer = AutoTokenizer.from_pretrained("roberta-large")

text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
print(f"BERT:Tokenized:Text={tokenized_text}:::")

# Masking one of the input tokens
masked_index = 8
tokenized_text[masked_index] = "[MASK]"
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(f"BERT:indexed_tokens:={indexed_tokens}::")

# -- segments id's
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Creating a dummy input
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

BERT:Tokenized:Text=['[', 'CL', 'S', ']', 'ĠWho', 'Ġwas', 'ĠJim', 'ĠH', 'enson', 'Ġ?', 'Ġ[', 'SE', 'P', ']', 'ĠJim', 'ĠH', 'enson', 'Ġwas', 'Ġa', 'Ġpupp', 'ete', 'er', 'Ġ[', 'SE', 'P', ']']:::
BERT:indexed_tokens:=[10975, 7454, 104, 742, 3394, 21, 2488, 289, 3, 17487, 646, 3388, 510, 742, 2488, 289, 13919, 21, 10, 32986, 9306, 254, 646, 3388, 510, 742]::


In [7]:
dummy_inputs["input_ids"], dummy_inputs["attention_mask"]

(tensor([[ 101, 5672, 2033, 2011, 2151, 3793, 2017, 1005, 1040, 2066, 1012,  102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))

In [20]:
### Roberta -
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

# -- IF you use from bert it comes without HEAD
tokenizer = AutoTokenizer.from_pretrained("roberta-base")  # roberta-large
model = AutoModel.from_pretrained("roberta-base", torchscript=True)  # roberta-large
model = model.eval()


bs = 1
seq_len = 512
dummy_inputs = [
    torch.randint(1000, (bs, seq_len)).to("cpu"),  # to(device),
    torch.zeros(bs, seq_len, dtype=torch.int).to("cpu"),  # to(device),
]

text = "Replace me by any text you'd like."
dummy_inputs = tokenizer(
    text,
    return_tensors="pt",
    max_length=seq_len,
    padding=True,
    truncation=True,
)
print(dummy_inputs.keys())


# Creating the trace
# traced_model = torch.jit.trace(model, [tokens_tensor, segments_tensors])
traced_model = torch.jit.trace(
    model, [dummy_inputs["input_ids"], dummy_inputs["attention_mask"]]
)

model = model.eval()
# model.to(device)
torch.jit.save(traced_model, "./triton-serve-pt/roberta-traced/1/model.pt")

print("Saved {}".format(traced_model))

Using cuda device


Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


dict_keys(['input_ids', 'attention_mask'])
Saved RobertaModel(
  original_name=RobertaModel
  (embeddings): RobertaEmbeddings(
    original_name=RobertaEmbeddings
    (word_embeddings): Embedding(original_name=Embedding)
    (position_embeddings): Embedding(original_name=Embedding)
    (token_type_embeddings): Embedding(original_name=Embedding)
    (LayerNorm): LayerNorm(original_name=LayerNorm)
    (dropout): Dropout(original_name=Dropout)
  )
  (encoder): RobertaEncoder(
    original_name=RobertaEncoder
    (layer): ModuleList(
      original_name=ModuleList
      (0): RobertaLayer(
        original_name=RobertaLayer
        (attention): RobertaAttention(
          original_name=RobertaAttention
          (self): RobertaSelfAttention(
            original_name=RobertaSelfAttention
            (query): Linear(original_name=Linear)
            (key): Linear(original_name=Linear)
            (value): Linear(original_name=Linear)
            (dropout): Dropout(original_name=Dropout)
    

In [21]:
model

RobertaModel(
  (embeddings): RobertaEmbeddings(
    (word_embeddings): Embedding(50265, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (token_type_embeddings): Embedding(1, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): RobertaEncoder(
    (layer): ModuleList(
      (0): RobertaLayer(
        (attention): RobertaAttention(
          (self): RobertaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): RobertaSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Drop

#### Test encoders various methods

In [22]:
tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    padding="max_length",
    max_length=64,
)

{'input_ids': [0, 565, 3961, 261, 96, 23861, 30472, 1639, 10, 3613, 8, 3543, 4047, 8663, 11162, 2472, 29854, 13, 258, 39076, 8, 37658, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [23]:
encoded_tokens = tokenizer.encode_plus(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# encoded_tokens

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [26]:
[encoded_input["input_ids"], encoded_input["attention_mask"]]

[tensor([[  101, 13012,  2669, 28937,  8241,  3640,  1037,  6112,  1998,  3341,
           1999,  7512,  2368,  6129,  5576, 23569, 27605,  5422,  2005,  2119,
          17368,  2015,  1998, 14246,  2271,  1012,   102]]),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1]])]

### Test the HuggingFace and then the scripted model locally

In [30]:
import torch
import torch.nn.functional as F

encoded_input = tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    return_tensors="pt",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,  # -- this model has max length set to 100 -- not to 512,
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# unscripted_output = model.generate( # --
unscripted_output = model(  # -- both work the same way
    **encoded_input,
    # inputs=encoded_input['attention_mask']],
    return_dict=True,
    output_attentions=False,
    output_hidden_states=False,
    # do_sample=True,
    # temperature=0.9,
    # max_length=128,
)  # -- BaseModelOutputWithPoolingAndCrossAttentions

# tokenizer.decode(unscripted_output[0])
unscripted_output[0].shape

torch.Size([1, 512, 768])

#### Now test the Scripted model -- Scripted model gives us tensors back

In [33]:
import torch
import torch.nn.functional as F

encoded_input = tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    return_tensors="pt",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,  # -- this model has max length set to 100 -- not to 512,
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# Traced Model expects ONLY the INPUT ID's
unscripted_traced_output = traced_model(  # -- both work the same way
    encoded_input["input_ids"], encoded_input["attention_mask"]
)

# tokenizer.decode(unscripted_output[0])
print(unscripted_traced_output[0].shape)
print(unscripted_traced_output[1].shape)

torch.Size([1, 512, 768])
torch.Size([1, 768])


In [35]:
unscripted_output[0]

tensor([[[-0.0381,  0.1347, -0.0798,  ..., -0.0984, -0.0382, -0.0533],
         [-0.0080, -0.0298,  0.0424,  ...,  0.1234,  0.0331,  0.1715],
         [-0.1365,  0.1032,  0.0484,  ..., -0.0449, -0.1060, -0.0534],
         ...,
         [ 0.0220,  0.1821, -0.0217,  ..., -0.1113, -0.0644, -0.0289],
         [ 0.0220,  0.1821, -0.0217,  ..., -0.1113, -0.0644, -0.0289],
         [ 0.0220,  0.1821, -0.0217,  ..., -0.1113, -0.0644, -0.0289]]],
       grad_fn=<NativeLayerNormBackward0>)

In [36]:
tokenizer.batch_decode(unscripted_output[1])[0]

TypeError: argument 'ids': 'float' object cannot be interpreted as an integer

### Upload the Model.tar after it has been created correctly by the above scripted and the config.pbtxt files



In [37]:
tar_file_name = "roberta-traced-v1.tar.gz"

In [38]:
!cd triton-serve-pt && tar --exclude=".git" --exclude=".gitattributes" --exclude="model.tar.gz" --exclude="*.bin" --exclude "*.tar" --exclude "*.ipynb_checkpoints"  -zcvf {tar_file_name} roberta-traced

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
roberta-traced/
roberta-traced/config.pbtxt
roberta-traced/1/
roberta-traced/1/model.pt


**Upload the model.tar.gz to S3 location**

In [39]:
import sagemaker
from sagemaker import get_execution_role, Session, image_uris
from sagemaker.utils import name_from_base
import boto3

region = boto3.Session().region_name
session = sagemaker.Session()
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")

In [41]:
s3_model_path_triton = sagemaker.s3.S3Uploader().upload(
    local_path=f"./triton-serve-pt/{tar_file_name}",
    desired_s3_uri="s3://sagemaker-us-east-1-425576326687/mme-roberta-benchmark/roberta-large",
    sagemaker_session=session,
)
s3_mme_model_path = (
    "s3://sagemaker-us-east-1-425576326687/mme-roberta-benchmark/roberta-large/"
)
print(s3_model_path_triton)
print(s3_mme_model_path)

s3://sagemaker-us-east-1-425576326687/mme-roberta-benchmark/roberta-large/roberta-traced-v1.tar.gz
s3://sagemaker-us-east-1-425576326687/mme-roberta-benchmark/roberta-large/


#### Start Single Model Triton for starting

**Triton Image download and sagemaker variables**

In [43]:
from sagemaker import get_execution_role, Session, image_uris
import boto3
from sagemaker.utils import name_from_base

region = boto3.Session().region_name
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = (
    "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.10-py3".format(
        account_id=account_id_map[region], region=region, base=base
    )
)
print(triton_image_uri)

785573368785.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:22.10-py3


**Model creation**

In [75]:
endpoint_name = name_from_base(f"roberta-base-")
print(endpoint_name)

container_p5 = {
    "Image": triton_image_uri,
    "ModelDataUrl": s3_mme_model_path,
    "Mode": "MultiModel",
    "Environment": {
        #'SAGEMAKER_PROGRAM' : 'inference.py',
        #'SAGEMAKER_SUBMIT_DIRECTORY' : 'code',
        #'SAGEMAKER_TRITON_DEFAULT_MODEL_NAME': 'bert-uc',
        # "SAGEMAKER_TRITON_BATCH_SIZE": "16",
        "SAGEMAKER_TRITON_MAX_BATCH_DELAY": "1000",
        "SAGEMAKER_TRITON_SHM_DEFAULT_BYTE_SIZE": "16777216000",  # "16777216000",
        "SAGEMAKER_TRITON_SHM_GROWTH_BYTE_SIZE": "104857600",
    },
}
create_model_response = sm_client.create_model(
    ModelName=endpoint_name, ExecutionRoleArn=role, PrimaryContainer=container_p5
)
print(create_model_response)

roberta-base--2023-02-21-18-14-38-943
{'ModelArn': 'arn:aws:sagemaker:us-east-1:425576326687:model/roberta-base--2023-02-21-18-14-38-943', 'ResponseMetadata': {'RequestId': 'f4c48a74-64b0-4dfa-8a19-fea9d8d27007', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'f4c48a74-64b0-4dfa-8a19-fea9d8d27007', 'content-type': 'application/x-amz-json-1.1', 'content-length': '99', 'date': 'Tue, 21 Feb 2023 18:14:39 GMT'}, 'RetryAttempts': 0}}


**Endpoint config**

In [76]:
# Sampling percentage. Choose an integer value between 0 and 100
initial_sampling_percentage = 10

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.8xlarge",  # "ml.g4dn.xlarge", "ml.g4dn.4xlarge"
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": endpoint_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-east-1:425576326687:endpoint-config/roberta-base--2023-02-21-18-14-38-943


**Endpoint**

In [77]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:425576326687:endpoint/roberta-base--2023-02-21-18-14-38-943


In [79]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("SINGLE:Model:endpoint:Triton:Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Single:model:triton:Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Single:model:triton:Status: " + status)

SINGLE:Model:endpoint:Triton:Status: Creating
Single:model:triton:Status: Creating
Single:model:triton:Status: Creating
Single:model:triton:Status: Creating
Single:model:triton:Status: InService
Arn: arn:aws:sagemaker:us-east-1:425576326687:endpoint/roberta-base--2023-02-21-18-14-38-943
Single:model:triton:Status: InService


**Now Invoke The endpoint**
<li>First option is JSON</li>
<li>Second is native binary headers</li>

In [84]:
import tritonclient.http as httpclient
from transformers import BertTokenizer
import numpy as np
from tritonclient.utils import np_to_triton_dtype


def tokenize_text(text, enc, max_length=512):
    # enc = BertTokenizer.from_pretrained("bert-base-uncased")
    print(f"Tokenize:text:why??::max_length={max_length}::Tokenizer={enc}")
    encoded_text = enc(text, padding="max_length", max_length=max_length)
    return encoded_text["input_ids"], encoded_text["attention_mask"]


# Inference hyperparameters
def prepare_tensor(name, input_d):
    tensor = httpclient.InferInput(
        name, input_d.shape, np_to_triton_dtype(input_d.dtype)
    )
    tensor.set_data_from_numpy(input_d)
    return tensor


# explanation
def prepare_roberta_2_inputs(input0, attention_0):
    input0_data = np.array(
        input0, dtype=np.int32
    )  # - convert to Numpy from PyTorch tensors
    input_attention_data = np.array(attention_0, dtype=np.int32)

    inputs = [  # - match the config.pbtxt
        prepare_tensor("INPUT__0", input0_data),
        prepare_tensor("INPUT__1", input_attention_data),
    ]

    outputs = []
    outputs.append(httpclient.InferRequestedOutput("OUTPUT__0", binary_data=True))
    outputs.append(httpclient.InferRequestedOutput("1634__1", binary_data=True))
    (
        request_body,
        header_length,
    ) = httpclient.InferenceServerClient.generate_request_body(inputs, outputs=outputs)
    return request_body, header_length


def get_decoded_text(tensors_tokens, enc):
    return_text = tokenizer.batch_decode(gen_tokens)[0]
    return return_text

**Run the JSON invocation**

In [86]:
%%time

import json

max_seq_length = 512
text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
print(
    f"Leverage the Tokenizer={enc}::max_seq_length={max_seq_length}:: create above when creating the model "
)

input_ids, attention_mask = tokenize_text(
    text_triton, tokenizer, max_length=max_seq_length
)

payload = {
    "inputs": [
        {
            "name": "INPUT__0",
            "shape": [1, max_seq_length],
            "datatype": "INT32",
            "data": input_ids,
        },
        {
            "name": "INPUT__1",
            "shape": [1, max_seq_length],
            "datatype": "INT32",
            "data": attention_mask,
        },
    ]
}

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/octet-stream",
    Body=json.dumps(payload),
    TargetModel="roberta-traced-v1.tar.gz",
)

output = json.loads(response["Body"].read().decode("utf8"))

print(output.keys())

Leverage the Tokenizer=BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})::max_seq_length=512:: create above when creating the model 
Tokenize:text:why??::max_length=512::Tokenizer=RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})
dict_keys(['model_name', 'model_version', 'outputs'])
CPU times: user 182 ms, sys: 21.1 ms, total: 203 ms
Wall time: 7.24 s


In [87]:
output["outputs"][0]["data"]

[-0.038062676787376404,
 0.13473781943321228,
 -0.0798252746462822,
 -0.07597241550683975,
 0.07719936966896057,
 -0.05807043984532356,
 -0.060758646577596664,
 -0.008639145642518997,
 0.050245169550180435,
 -0.06938910484313965,
 -0.035817358642816544,
 -0.012312027625739574,
 0.027301158756017685,
 -0.029667401686310768,
 0.04819143936038017,
 -0.006481475196778774,
 -0.07737714797258377,
 -0.02181418053805828,
 -0.017108870670199394,
 -0.05121561512351036,
 -0.09797798097133636,
 0.05809696018695831,
 -0.036770690232515335,
 0.10161531716585159,
 -0.021414315328001976,
 0.08224329352378845,
 0.13604256510734558,
 0.1126348003745079,
 -0.009445725940167904,
 0.014347495511174202,
 0.004373329691588879,
 -0.06697972118854523,
 0.06279001384973526,
 -0.004792243242263794,
 0.030412670224905014,
 0.0675869733095169,
 -0.012920498847961426,
 -0.01566997542977333,
 -0.02670985646545887,
 -0.015797285363078117,
 0.018117519095540047,
 0.2294064313173294,
 0.0520874485373497,
 -0.0252434276

In [None]:
tokenizer.decode(
    torch.tensor(output["outputs"][0]["data"], dtype=torch.int8).type(
        torch.int
    ),  # tokenizer.decode(unscripted_output[0])
    skip_special_tokens=True,
    clean_up=True,
)

**Invoke using the Binary Format**

In [90]:
encoded_input = tokenizer(
    "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs.",
    return_tensors="pt",
    add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
    max_length=512,  #
    pad_to_max_length=True,  # Pad & truncate all sentences
)
# Traced Model expects ONLY the INPUT ID's
input_ids = encoded_input["input_ids"]
attention_mask = encoded_input["attention_mask"]

triton_request_body, triton_header_length = prepare_roberta_2_inputs(
    input_ids, attention_mask
)

In [95]:
response_binary = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        triton_header_length
    ),
    Body=triton_request_body,
    TargetModel=f"{tar_file_name}",
)
print(response_binary)

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]
# print(response_binary["Body"].read())

try:
    # Read response body
    result = httpclient.InferenceServerClient.parse_response_body(
        response_binary["Body"].read()  # , header_length=int(header_length_str)
    )
    output0_data = result.as_numpy("1634__1")
    output1_data = result.as_numpy("OUTPUT__0")
    print(output0_data)
    print(output1_data)
except:
    print("Error in parsing response -- ")

{'ResponseMetadata': {'RequestId': '62cb3f1a-f045-43a0-8e32-3337dd447d8d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '62cb3f1a-f045-43a0-8e32-3337dd447d8d', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 21 Feb 2023 18:35:01 GMT', 'content-type': 'application/vnd.sagemaker-triton.binary+json;json-header-size=274', 'content-length': '1576210'}, 'RetryAttempts': 0}, 'ContentType': 'application/vnd.sagemaker-triton.binary+json;json-header-size=274', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7f97e0c2ddd0>}
Error in parsing response -- 


### Stress Test it 

In [61]:
model_name = "roberta-base"
print(s3_model_path_triton)
print(s3_mme_model_path)
print(model_name)

s3://sagemaker-us-east-1-425576326687/mme-roberta-benchmark/roberta-large/roberta-traced-v1.tar.gz
s3://sagemaker-us-east-1-425576326687/mme-roberta-benchmark/roberta-large/
roberta-base


In [89]:
text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
print(
    f"Leverage the Tokenizer={tokenizer}::max_seq_length={max_seq_length}:: create above when creating the model "
)

input_ids, attention_mask = tokenize_text(
    text_triton, tokenizer, max_length=max_seq_length
)

payload = {
    "inputs": [
        {
            "name": "INPUT__0",
            "shape": [1, max_seq_length],
            "datatype": "INT32",
            "data": input_ids,
        },
        {
            "name": "INPUT__1",
            "shape": [1, max_seq_length],
            "datatype": "INT32",
            "data": attention_mask,
        },
    ]
}

Leverage the Tokenizer=RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})::max_seq_length=512:: create above when creating the model 
Tokenize:text:why??::max_length=512::Tokenizer=RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})


In [None]:
models_loaded = 0
memory_utilization_threshold = 0.9
memory_utilization_history = []
max_models_test = 10
while models_loaded < max_models_test:
    # make a copy of the model
    !aws s3 cp {s3_model_path_triton} {s3_mme_model_path}/{model_name}-v{models_loaded}.tar.gz
    
    # make a inference request to load model into memory
    response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name, 
        ContentType="application/octet-stream", 
        Body=json.dumps(payload),
        TargetModel=f"{model_name}-v{models_loaded}.tar.gz", 
    )
    
    models_loaded+=1
    
        
    print(f"loaded {models_loaded} models with memory utilzation of {memory_utilization:.2%}")

### Clean up

In [74]:
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_name)
sm_client.delete_model(ModelName=endpoint_name)

{'ResponseMetadata': {'RequestId': '02dbb082-eb76-417e-bd6a-9ed782b95f18',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '02dbb082-eb76-417e-bd6a-9ed782b95f18',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Tue, 21 Feb 2023 18:13:34 GMT'},
  'RetryAttempts': 0}}