# Multiple Ensembles with GPU models using Amazon SageMaker in MME mode

In this notebook, we will re-use a couple of examples listed under the parent folder ../ensemble/, and deploy them using MME. In order to create a working example and for clarify reasons, the relevant part of the notebooks are re-listed here.

#### A. TF+DALI Ensemble

In this ensemble, the DALI pipeline pre-processes the input using CPU. The input from this model is fed into the TF Inception model, which runs on GPU

#### B. TRT+Python Ensemble

In this ensemble, a TRT model (BERT) and the post-process python models run on GPU, whereas the pre-process model runs on CPU

#### In both the examples, one more GPU models are executed on the same host, and each example is an ensemble with multiple models working together to create a pipeline reflective of a single model

## 1.Setup

In [2]:
!pip install -qU pip awscli boto3 sagemaker --quiet
!pip install nvidia-pyindex --quiet
!pip install tritonclient[http] --quiet

In [3]:
# Note: We are installing NVIDIA DALI Cuda in the below step. You need to execute this notebook on a GPU based instance. 
!pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com, https://developer.download.nvidia.com/compute/redist


In [4]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import nvidia.dali as dali
import nvidia.dali.types as types

In [5]:
# SageMaker varaibles
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
role = get_execution_role()

# Other Variables
instance_type = "ml.g4dn.4xlarge"
sm_model_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_config_name = "triton-tf-dali-ensemble-" + time.strftime(
    "%Y-%m-%d-%H-%M-%S", time.gmtime()
)
endpoint_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())



## 2. TF+DALI Ensemble

In [6]:
!mkdir -p model_repository/inception_graphdef/1
!mkdir -p model_repository/dali/1
!mkdir -p model_repository/ensemble_dali_inception/1

In [7]:
!wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz \
     https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

--2023-05-12 15:09:52--  https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.14.240, 142.250.69.208, 142.250.217.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.14.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 88668554 (85M) [application/gzip]
Saving to: ‘/tmp/inception_v3_2016_08_28_frozen.pb.tar.gz’


2023-05-12 15:09:54 (46.7 MB/s) - ‘/tmp/inception_v3_2016_08_28_frozen.pb.tar.gz’ saved [88668554/88668554]



In [8]:
!(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)
!mv /tmp/inception_v3_2016_08_28_frozen.pb model_repository/inception_graphdef/1/model.graphdef

Write model config for ensemble

In [9]:
%%writefile model_repository/ensemble_dali_inception/config.pbtxt
name: "ensemble_dali_inception"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "inception_graphdef"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "InceptionV3/Predictions/Softmax"
        value: "OUTPUT"
      }
    }
  ]
}

Writing model_repository/ensemble_dali_inception/config.pbtxt


Model config for DALI backend

In [10]:
%%writefile model_repository/dali/config.pbtxt
name: "dali"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 299, 299, 3 ]
  }
]
parameters: [
  {
    key: "num_threads"
    value: { string_value: "12" }
  }
]

Writing model_repository/dali/config.pbtxt


Model config for inception, using GPU

In [11]:
%%writefile model_repository/inception_graphdef/config.pbtxt
name: "inception_graphdef"
platform: "tensorflow_graphdef"
max_batch_size: 256
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 299, 299, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
    label_filename: "inception_labels.txt"
  }
]
instance_group [
    {
      kind: KIND_GPU
    }
]

Writing model_repository/inception_graphdef/config.pbtxt


Download inception_labels.txt

In [12]:
!aws s3 cp s3://sagemaker-sample-files/datasets/labels/inception_labels.txt model_repository/inception_graphdef/inception_labels.txt

download: s3://sagemaker-sample-files/datasets/labels/inception_labels.txt to model_repository/inception_graphdef/inception_labels.txt


Create DALI Pipeline

In [13]:
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    """Create a pipeline which reads images and masks, decodes the images and returns them."""
    images = dali.fn.external_source(device="cpu", name="DALI_INPUT_0")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299) #resize image to the default 299x299 size
    images = dali.fn.crop_mirror_normalize(
        images,
        dtype=types.FLOAT,
        output_layout="HWC",
        crop=(299, 299),  #crop image to the default 299x299 size
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], #crop a central region of the image
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255], #crop a central region of the image
    )
    return images

pipe().serialize(filename="model_repository/dali/1/model.dali")

b'\x08\x01\x10\x03*P\n\x0eExternalSource\x1a\x15\n\x0cDALI_INPUT_0\x12\x03cpu\x18\x00"\x17\n\x06device\x12\x06string*\x03cpu@\x00*\x0cDALI_INPUT_00\x00*\x94\x01\n\x0fdecoders__Image\x12\x15\n\x0cDALI_INPUT_0\x12\x03cpu\x18\x00\x1a\x12\n\t__Image_1\x12\x03gpu\x18\x00"\x18\n\x0boutput_type\x12\x05int64 \x00@\x00"\x19\n\x06device\x12\x06string*\x05mixed@\x00"\x14\n\x08preserve\x12\x04bool0\x00@\x00*\t__Image_10\x01*\xa2\x01\n\x06Resize\x12\x12\n\t__Image_1\x12\x03gpu\x18\x00\x1a\x13\n\n__Resize_2\x12\x03gpu\x18\x00"\x18\n\x08resize_y\x12\x05float\x1d\x00\x80\x95C@\x00"\x18\n\x08resize_x\x12\x05float\x1d\x00\x80\x95C@\x00"\x17\n\x06device\x12\x06string*\x03gpu@\x00"\x14\n\x08preserve\x12\x04bool0\x00@\x00*\n__Resize_20\x02*\xd4\x03\n\x13CropMirrorNormalize\x12\x13\n\n__Resize_2\x12\x03gpu\x18\x00\x1a \n\x17__CropMirrorNormalize_3\x12\x03gpu\x18\x00"_\n\x03std\x12\x05float:\x19\n\telement 0\x12\x05float\x1d{\x94iB@\x00:\x19\n\telement 1\x12\x05float\x1d\xe1zdB@\x00:\x19\n\telement 2\x12\x05

Upload model artifacts to S3

In [15]:
!tar -cvzf model_tf_dali.tar.gz -C model_repository .
model_uri = sagemaker_session.upload_data(
    path="model_tf_dali.tar.gz", key_prefix="triton-mme-gpu-ensemble"
)
print("S3 model uri: {}".format(model_uri))

./
./ensemble_dali_inception/
./ensemble_dali_inception/1/
./ensemble_dali_inception/config.pbtxt
./inception_graphdef/
./inception_graphdef/inception_labels.txt
./inception_graphdef/1/
./inception_graphdef/1/model.graphdef
./inception_graphdef/config.pbtxt
./dali/
./dali/1/
./dali/1/model.dali
./dali/config.pbtxt
S3 model uri: s3://sagemaker-us-west-2-850464037171/triton-mme-gpu-ensemble/model_tf_dali.tar.gz


## 3. TRT + Python Ensemble

For this example, we will download a pretrained model from transformers library. The rest of the models i.e. pre-process and post-process, along with config.pbtxt for all models are included in the folder `ensemble_hf`

In [16]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"

In [17]:
! docker run --gpus=all --rm -it -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash generate_model_trt.sh $model_id

Unable to find image 'nvcr.io/nvidia/pytorch:22.10-py3' locally
22.10-py3: Pulling from nvidia/pytorch

[1B3276a519: Already exists 
[1Bdb5e3ba6: Already exists 
[1B1ce48f03: Already exists 
[1Bd8b854c3: Already exists 
[1B223c882b: Already exists 
[1Be771bc00: Already exists 
[1Beba70e02: Already exists 
[1Bdebe0d89: Already exists 
[1B84c1285d: Already exists 
[1Bc244fe18: Already exists 
[1Bfab03e15: Already exists 
[1B0924ac3c: Already exists 
[1B8a053303: Already exists 
[1Bb700ef54: Already exists 
[1B63e0d5ea: Pulling fs layer 
[1B87a0a71b: Pulling fs layer 
[1B2c892903: Pulling fs layer 
[1Ba6c654f3: Pulling fs layer 
[1B1da7bd7b: Pulling fs layer 
[1Bd79a074f: Pulling fs layer 
[1Bae6c9b9e: Pulling fs layer 
[1B0d3f3014: Pulling fs layer 
[1Be540b633: Pulling fs layer 
[1B466111b2: Pulling fs layer 
[1Bf7ab8786: Pulling fs layer 
[1Bdda3be8a: Pulling fs layer 
[9B1da7bd7b: Waiting fs layer 
[1Bf654232d: Pulling fs layer 
[10B79a074f: Waiting fs lay

In [18]:
! mkdir -p ensemble_hf/bert-trt/1 && mv workspace/model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf workspace/model.onnx workspace/core*

Create a custom python conda environment with required dependencies installed

In [24]:
!bash conda_dependencies.sh
!cp processing_env.tar.gz ensemble_hf/postprocess/ && cp processing_env.tar.gz ensemble_hf/preprocess/
!rm processing_env.tar.gz

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.8.4
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/ubuntu/anaconda3/envs/processing_env

  added / updated specs:
    - python=3.8


The following NEW packages will be INSTALLED:

  _libgcc_mutex      conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
  _openmp_mutex      conda-forge/linux-64::_openmp_mutex-4.5-2_gnu
  bzip2              conda-forge/linux-64::bzip2-1.0.8-h7f98852_4
  ca-certificates    conda-forge/linux-64::ca-certificates-2023.5.7-hbcca054_0
  ld_impl_linux-64   conda-forge/linux-64::ld_impl_linux-64-2.40-h41732ed_0
  libffi             conda-forge/linux-64::libffi-3.4.2-h7f98852_5
  libgcc-ng          conda-forge/linux-64::libgcc-ng-12.2.0-h65d4601_19
  libgomp            conda-forge/linux-64::libgomp-12.2.0-h65d4601_19
  libnsl             conda-forge/l

Upload model artifacts to S3

In [25]:
!tar -C ensemble_hf/ -czf model_trt_python.tar.gz .
model_uri = sagemaker_session.upload_data(
    path="model_trt_python.tar.gz", key_prefix="triton-mme-gpu-ensemble"
)

print("S3 model uri: {}".format(model_uri))

S3 model uri: s3://sagemaker-us-west-2-850464037171/triton-mme-gpu-ensemble/model_trt_python.tar.gz


## 4. Run ensembles on SageMaker MME GPU instance

In [26]:
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

In [27]:
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

In [28]:
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

In [29]:
models_s3_location = f"s3://sagemaker-us-west-2-850464037171/triton-mme-gpu-ensemble/"

In [30]:
container = {
    "Image": triton_image_uri,
    "ModelDataUrl": models_s3_location,
    "Mode": "MultiModel",
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble_dali_inception"},
}

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

model_arn = create_model_response["ModelArn"]

print(f"Model Arn: {model_arn}")

Model Arn: arn:aws:sagemaker:us-west-2:850464037171:model/triton-tf-dali-ensemble-2023-05-12-15-09-44


In [31]:
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

endpoint_config_arn = create_endpoint_config_response["EndpointConfigArn"]

print(f"Endpoint Config Arn: {endpoint_config_arn}")

Endpoint Config Arn: arn:aws:sagemaker:us-west-2:850464037171:endpoint-config/triton-tf-dali-ensemble-2023-05-12-15-09-44


In [32]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

endpoint_arn = create_endpoint_response["EndpointArn"]

print(f"Endpoint Arn: {endpoint_arn}")

Endpoint Arn: arn:aws:sagemaker:us-west-2:850464037171:endpoint/triton-tf-dali-ensemble-2023-05-12-15-09-44


In [33]:
rv = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = rv["EndpointStatus"]
print(f"Endpoint Creation Status: {status}")

while status == "Creating":
    time.sleep(60)
    rv = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = rv["EndpointStatus"]
    print(f"Endpoint Creation Status: {status}")

endpoint_arn = rv["EndpointArn"]

print(f"Endpoint Arn: {endpoint_arn}")
print(f"Endpoint Status: {status}")

Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: Creating
Endpoint Creation Status: InService
Endpoint Arn: arn:aws:sagemaker:us-west-2:850464037171:endpoint/triton-tf-dali-ensemble-2023-05-12-15-09-44
Endpoint Status: InService


## 5. Create inference payload and send requests to respective models

### 5.1. TF + Dali Ensemble

In [34]:
sample_img_fname = "shiba_inu_dog.jpg"

import numpy as np

s3_client = boto3.client("s3")
s3_client.download_file(
    "sagemaker-sample-files", "datasets/image/pets/shiba_inu_dog.jpg", sample_img_fname
)

def load_image(img_path):
    """
    Loads image as an encoded array of bytes.
    This is a typical approach you want to use in DALI backend
    """
    with open(img_path, "rb") as f:
        img = f.read()
        return np.array(list(img)).astype(np.uint8)
    
rv = load_image(sample_img_fname)
print(f"Shape of image {rv.shape}")

rv2 = np.expand_dims(rv, 0)
print(f"Shape of expanded image array {rv2.shape}")

payload = {
    "inputs": [
        {
            "name": "INPUT",
            "shape": rv2.shape,
            "datatype": "UINT8",
            "data": rv2.tolist(),
        }
    ]
}

Shape of image (576464,)
Shape of expanded image array (1, 576464)


In [36]:
# Run inference

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload), TargetModel="model_tf_dali.tar.gz"
)

print(json.loads(response["Body"].read().decode("utf8")))

{'model_name': '90991b942ab1f22f70f83663aabd8601', 'model_version': '1', 'parameters': {'sequence_id': 0, 'sequence_start': False, 'sequence_end': False}, 'outputs': [{'name': 'OUTPUT', 'datatype': 'FP32', 'shape': [1, 1001], 'data': [0.00028219446539878845, 0.00017344841035082936, 0.00041578078526072204, 0.000421064265538007, 0.0003004077880177647, 0.0006650298018939793, 0.0002860884997062385, 0.00031344176386483014, 0.00028921524062752724, 0.00031432663672603667, 8.46984694362618e-05, 0.0002322912769159302, 0.0003042622411157936, 0.0004389912646729499, 0.0002556558174546808, 0.0003667531709652394, 0.0004309485375415534, 0.0002556503168307245, 0.00022000847093295306, 0.00032059571822173893, 0.0002231193648185581, 0.00022734171943739057, 0.00030433526262640953, 0.000208033510716632, 0.00035471213050186634, 0.00010685255256248638, 0.00017449464940000325, 0.00017357268370687962, 0.00019732954388018698, 0.00023334324941970408, 0.00027035782113671303, 0.0002661184989847243, 0.0003673713654

### 5.2 TRT + Python backend

In [38]:
import tritonclient.http as http_client

In [39]:
text_inputs = ["Sentence 1", "Sentence 2"]

inputs = []
inputs.append(http_client.InferInput("INPUT0", [len(text_inputs), 1], "BYTES"))

batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]

input0_real = np.array(batch_request, dtype=np.object_)

inputs[0].set_data_from_numpy(input0_real, binary_data=True)

len(input0_real)

outputs = []
outputs.append(http_client.InferRequestedOutput("finaloutput"))

request_body, header_length = http_client.InferenceServerClient.generate_request_body(
    inputs, outputs=outputs
)

print(request_body)



b'{"inputs":[{"name":"INPUT0","shape":[2,1],"datatype":"BYTES","parameters":{"binary_data_size":28}}],"outputs":[{"name":"finaloutput","parameters":{"binary_data":true}}]}\n\x00\x00\x00Sentence 1\n\x00\x00\x00Sentence 2'


In [40]:
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        header_length
    ),
    Body=request_body,
    TargetModel="model_trt_python.tar.gz"
)

## json.loads fails
# a = json.loads(response["Body"].read().decode("utf8"))

header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = http_client.InferenceServerClient.parse_response_body(
    response["Body"].read(), header_length=int(header_length_str)
)

outputs_data = result.as_numpy("finaloutput")

for idx, output in enumerate(outputs_data):
    print(text_inputs[idx])
    print(output)

NameError: name 'sm_runtime_client' is not defined

# TODO: 
Add binary payload example

Add explanations for each step

Add any cautions e.g., 

    -> larger ensembles are not recommended for smaller instance types due to different memory management behavior of framework backends
    -> each ensemble is treated as a single-model in SageMaker i.e. hierarchy of models is NOT flat
    -> Model names may be re-used across ensembles, however, each ensemble must have its own copy of the model with duplicated name
    
Add conclusion

Add terminate resources section