# Deploy Custom Llama 3 8B to SageMaker Endpoint with Benchmark Across GPU Instances

This notebook is intended to show how we can deploy a llama 3 8b instruct model into Amazon SageMaker realtime endpoint and perform benchmarking across instance types. Instead of using SageMaker Jumpstart, this notebook deploys local model weights. You can use this notebook to deploy your tuned model whose weight is stored locally. If you do not have any model weights stored locally, this notebook has an option to download the original model weight from HuggingFace first into local, before deploying it.

The model is to be deployed into ml.g5, ml.g6, and ml.g6e instance families in SageMaker realtime endpoint. This notebook has 5 experiments:
- Deploying model into a single A10G GPU instance with lower vCPU and RAM
- Deploying model into a single A10G GPU instance with higher vCPU and RAM
- Deploying model into a multi A10G GPU instance (with tensor parallellism)
- Deploying model into a single L4 GPU instance
- Deploying model into a single L40S GPU instance

This notebook focuses on deploying models into SageMaker realtime endpoint with DJL - LMI serving, with vLLM. Other methods exist, including using TensorRT, TGI, and neuronx for inferentia, which are not covered in this notebook.

For each experiment, this notebook perform performance test with [llmeter](https://github.com/awslabs/llmeter/blob/main/llmeter/endpoints/sagemaker.py). At the end the notebook tries to compare the performance across the experiments, along with the performance per dollar cost.

## 0. Preparation

**Install required libraries**

In [None]:
!pip install huggingface_hub sagemaker boto3
!pip install llmeter

In [None]:
!pip install -U boto3
!pip install -U sagemaker

**[Optional] Install git-lfs to download large files**

This is only needed if you need to download the original model weights from HuggingFace

**If you already have your tuned LLM weights** there is **NO NEED** to install this

In [None]:
!sudo apt-get update
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install

**Stop!!! Restart kernel**

In [None]:
import boto3
boto3.__version__

In [None]:
import sagemaker
sagemaker.__version__

**Import libraries and initialize variables**

In [None]:
import os
import json
import glob
from pathlib import Path
import boto3
import sagemaker
from sagemaker.model import Model
from sagemaker.huggingface import HuggingFaceModel
from sagemaker import image_uris, serializers, Predictor
from huggingface_hub import snapshot_download
from huggingface_hub import notebook_login
import tarfile
from datetime import datetime
from llmeter.endpoints.base import InvocationResponse, Endpoint
from llmeter.endpoints import SageMakerEndpoint
from llmeter.experiments import LoadTest
from llmeter.runner import Runner

In [None]:
region_name = "us-west-2"

The step below contains manual action for the HF_TOKEN

In [None]:
# Configuration
HF_TOKEN = "PLACEHOLDER"  # Replace with your HF token or get from environment variable (more secure)
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket_name = sess.default_bucket()
prefix = "llama3-8b-instruct"

print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket_name}")

**Define sample input for inference test**

In [None]:
texts = [open(f).read() for f in glob.glob('./sample_payloads/transcript_*.txt')]

In [None]:
text = texts[0]
text

In [None]:
input_estimated_number_of_tokens = len(text) / 4

print(f"There are ~ {input_estimated_number_of_tokens} tokens in the input")

## 1. [Optional] Download original model weight

**Important**: This whole section is intended to fetch original LLM weights from HuggingFace. If you already have your own LLM weights since you might already tuned the model, you can substitute this step by simply adding code to point `model_snapshot_path` variable to the folder where your model weights reside. For example `model_snapshot_path = "./tuned_llm"`

In [None]:
notebook_login()

**Download model**

In [None]:
# Download model from Hugging Face
print("Downloading Llama 3 8B Instruct model...")
local_model_path = Path("./llama3-8b-instruct")

snapshot_download(
    repo_id=MODEL_NAME,
    cache_dir=local_model_path
)

print(f"Model downloaded to: {local_model_path}")

In [None]:
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]

In [None]:
# Use this when you want to manually point this to your tuned LLM weight folder

# model_snapshot_path = "some-folder-path"

## 2. Upload model to S3

In [None]:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
s3_model_prefix = f"custom-llama3-8b-instruct/{timestamp}/model"

print(f"Uploading to S3: {s3_model_prefix}")

!aws s3 cp --recursive {model_snapshot_path} s3://{bucket_name}/{s3_model_prefix}
print("Upload completed")

In [None]:
s3_uri = f"s3://{bucket_name}/{s3_model_prefix}/"

## 3. Experiment 1: Deploy to single A10G GPU & lower vCPU + RAM

**Using ml.g5.xlarge** with 1 GPU accelerator (24 GB VRAM), 4 vCPU and 16 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$1.408"

In [None]:
exp_1 = {
    "instance_type": "ml.g5.xlarge",
    "vram": 24,
    "vcpu": 4,
    "ram": 16,
    "hourly_compute_price_in_sin": 1.408
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_1_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_1_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_1_model
mv serving.properties exp_1_model/
tar czvf exp_1_model.tar.gz exp_1_model/
rm -rf exp_1_model

In [None]:
exp_1_s3_code_prefix = "llama3-8b-exp-1/code"
exp_1_code_artifact = sess.upload_data("exp_1_model.tar.gz", bucket_name, exp_1_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_1_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_1_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_1_model = Model(image_uri=exp_1_image_uri, model_data=exp_1_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_1_endpoint_name= f"llama3-8b-exp-1-{timestamp}"
exp_1_instance_type = exp_1['instance_type']

exp_1_model.deploy(
    instance_type=exp_1_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_1_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_1_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_1_predictor = sagemaker.Predictor(
    endpoint_name=exp_1_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_1_response = exp_1_predictor.predict(exp_1_input_data)
exp_1_response_data = json.loads(exp_1_response)
print(exp_1_response_data)

**Test performance with LLMeter**

In [None]:
exp_1_sagemaker_endpoint = SageMakerEndpoint(
    exp_1_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_1_payloads = [exp_1_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_1_load_test = LoadTest(
    endpoint=exp_1_sagemaker_endpoint,
    payload=exp_1_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_1/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_1_load_test_results = await exp_1_load_test.run()

In [None]:
exp_1_figures = exp_1_load_test_results.plot_results()

## 4. Experiment 2: Deploy to single A10G GPU & higher vCPU + RAM

**Using ml.g5.8xlarge** with 1 GPU accelerator (24 GB VRAM), 32 vCPU and 128 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$3.06"

In [None]:
exp_2 = {
    "instance_type": "ml.g5.8xlarge",
    "vram": 24,
    "vcpu": 32,
    "ram": 128,
    "hourly_compute_price_in_sin": 3.06
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_2_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_2_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_2_model
mv serving.properties exp_2_model/
tar czvf exp_2_model.tar.gz exp_2_model/
rm -rf exp_2_model

In [None]:
exp_2_s3_code_prefix = "llama3-8b-exp-2/code"
exp_2_code_artifact = sess.upload_data("exp_2_model.tar.gz", bucket_name, exp_2_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_2_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_2_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_2_model = Model(image_uri=exp_2_image_uri, model_data=exp_2_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_2_endpoint_name= f"llama3-8b-exp-2-{timestamp}"
exp_2_instance_type = exp_2['instance_type']

exp_2_model.deploy(
    instance_type=exp_2_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_2_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_2_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_2_predictor = sagemaker.Predictor(
    endpoint_name=exp_2_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_2_response = exp_2_predictor.predict(exp_2_input_data)
exp_2_response_data = json.loads(exp_2_response)
print(exp_2_response_data)

**Test performance with LLMeter**

In [None]:
exp_2_sagemaker_endpoint = SageMakerEndpoint(
    exp_2_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_2_payloads = [exp_2_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_2_load_test = LoadTest(
    endpoint=exp_2_sagemaker_endpoint,
    payload=exp_2_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_2/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_2_load_test_results = await exp_2_load_test.run()

In [None]:
exp_2_figures = exp_2_load_test_results.plot_results()

## 5. Experiment 3: Deploy to multi A10G GPU

**Using ml.g5.12xlarge** with 4 GPU accelerator (96 GB total VRAM), 48 vCPU and 192 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$7.09"

In [None]:
exp_3 = {
    "instance_type": "ml.g5.12xlarge",
    "vram": 96,
    "vcpu": 48,
    "ram": 192,
    "hourly_compute_price_in_sin": 7.09
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_3_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=4
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_3_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_3_model
mv serving.properties exp_3_model/
tar czvf exp_3_model.tar.gz exp_3_model/
rm -rf exp_3_model

In [None]:
exp_3_s3_code_prefix = "llama3-8b-exp-3/code"
exp_3_code_artifact = sess.upload_data("exp_3_model.tar.gz", bucket_name, exp_3_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_3_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_3_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_3_model = Model(image_uri=exp_3_image_uri, model_data=exp_3_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_3_endpoint_name= f"llama3-8b-exp-3-{timestamp}"
exp_3_instance_type = exp_3['instance_type']

exp_3_model.deploy(
    instance_type=exp_3_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_3_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_3_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_3_predictor = sagemaker.Predictor(
    endpoint_name=exp_3_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_3_response = exp_3_predictor.predict(exp_3_input_data)
exp_3_response_data = json.loads(exp_3_response)
print(exp_3_response_data)

**Test performance with LLMeter**

In [None]:
exp_3_sagemaker_endpoint = SageMakerEndpoint(
    exp_3_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_3_payloads = [exp_3_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_3_load_test = LoadTest(
    endpoint=exp_3_sagemaker_endpoint,
    payload=exp_3_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_3/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_3_load_test_results = await exp_3_load_test.run()

In [None]:
exp_3_figures = exp_3_load_test_results.plot_results()

## 6. Experiment 4: Deploy to single L4 GPU with lower vCPU and RAM

**Using ml.g6.xlarge** with 1 GPU accelerator (24 GB total VRAM), 4 vCPU and 16 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$1.1267"

In [None]:
exp_4 = {
    "instance_type": "ml.g6.xlarge",
    "vram": 24,
    "vcpu": 4,
    "ram": 16,
    "hourly_compute_price_in_sin": 1.1267
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_4_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_4_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_4_model
mv serving.properties exp_4_model/
tar czvf exp_4_model.tar.gz exp_4_model/
rm -rf exp_4_model

In [None]:
exp_4_s3_code_prefix = "llama3-8b-exp-4/code"
exp_4_code_artifact = sess.upload_data("exp_4_model.tar.gz", bucket_name, exp_4_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_4_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_4_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_4_model = Model(image_uri=exp_4_image_uri, model_data=exp_4_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_4_endpoint_name= f"llama3-8b-exp-4-{timestamp}"
exp_4_instance_type = exp_4['instance_type']

exp_4_model.deploy(
    instance_type=exp_4_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_4_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_4_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_4_predictor = sagemaker.Predictor(
    endpoint_name=exp_4_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_4_response = exp_4_predictor.predict(exp_4_input_data)
exp_4_response_data = json.loads(exp_4_response)
print(exp_4_response_data)

**Test performance with LLMeter**

In [None]:
exp_4_sagemaker_endpoint = SageMakerEndpoint(
    exp_4_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_4_payloads = [exp_4_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_4_load_test = LoadTest(
    endpoint=exp_4_sagemaker_endpoint,
    payload=exp_4_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_4/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_4_load_test_results = await exp_4_load_test.run()

In [None]:
exp_4_figures = exp_4_load_test_results.plot_results()

## 7. Experiment 5: Deploy to single L40S GPU with lower vCPU and RAM

**Using ml.g6e.xlarge** with 1 GPU accelerator (48 GB total VRAM), 4 vCPU and 32 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$2.6054"

In [None]:
exp_5 = {
    "instance_type": "ml.g6e.xlarge",
    "vram": 48,
    "vcpu": 4,
    "ram": 32,
    "hourly_compute_price_in_sin": 2.6054
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_5_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_5_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_5_model
mv serving.properties exp_5_model/
tar czvf exp_5_model.tar.gz exp_5_model/
rm -rf exp_5_model

In [None]:
exp_5_s3_code_prefix = "llama3-8b-exp-5/code"
exp_5_code_artifact = sess.upload_data("exp_5_model.tar.gz", bucket_name, exp_5_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_5_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_5_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_5_model = Model(image_uri=exp_5_image_uri, model_data=exp_5_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_5_endpoint_name= f"llama3-8b-exp-5-{timestamp}"
exp_5_instance_type = exp_5['instance_type']

exp_5_model.deploy(
    instance_type=exp_5_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_5_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_5_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_5_predictor = sagemaker.Predictor(
    endpoint_name=exp_5_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_5_response = exp_5_predictor.predict(exp_5_input_data)
exp_5_response_data = json.loads(exp_5_response)
print(exp_5_response_data)

**Test performance with LLMeter**

In [None]:
exp_5_sagemaker_endpoint = SageMakerEndpoint(
    exp_5_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_5_payloads = [exp_5_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_5_load_test = LoadTest(
    endpoint=exp_5_sagemaker_endpoint,
    payload=exp_5_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_5/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_5_load_test_results = await exp_5_load_test.run()

In [None]:
exp_5_figures = exp_5_load_test_results.plot_results()

## 8. Performance and cost analysis

In [None]:
exp_1['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_1/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_1['load_test_output'][i] = json.load(f)

exp_2['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_2/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_2['load_test_output'][i] = json.load(f)

exp_3['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_3/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_3['load_test_output'][i] = json.load(f)

exp_4['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_4/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_4['load_test_output'][i] = json.load(f)

exp_5['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_5/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_5['load_test_output'][i] = json.load(f)


In [None]:
import pandas as pd

experiments = {'exp_1': exp_1, 'exp_2': exp_2, 'exp_3': exp_3, 'exp_4': exp_4, 'exp_5': exp_5}

data = []
for exp_name, exp_data in experiments.items():
    row = [exp_name, exp_data['instance_type'], exp_data['vram'], exp_data['vcpu'], exp_data['ram'], exp_data['hourly_compute_price_in_sin']]
    for client in [1, 5, 10]:
        if client in exp_data['load_test_output']:
            lt = exp_data['load_test_output'][client]
            rpm = lt['requests_per_minute']
            input_tpm = lt['average_input_tokens_per_minute']
            output_tpm = lt['average_output_tokens_per_minute']
            total_tpm = input_tpm + output_tpm
            rpm_per_dollar = rpm / exp_data['hourly_compute_price_in_sin'] * 60
            tpm_per_dollar = total_tpm / exp_data['hourly_compute_price_in_sin'] * 60
            row.extend([rpm, input_tpm, output_tpm, total_tpm, rpm_per_dollar, tpm_per_dollar])
        else:
            row.extend([None] * 6)
    data.append(row)

basic_cols = ['Experiment', 'Instance Type', 'VRAM (GB)', 'vCPU', 'RAM (GB)', 'Hourly Price ($)']
metric_cols = ['Requests/min', 'Input Tokens/min', 'Output Tokens/min', 'Total Tokens/min', 'Requests/$', 'Total Tokens/$']
columns = pd.MultiIndex.from_tuples(
    [(col, '') for col in basic_cols] + 
    [(f'Client {c}', metric) for c in [1, 5, 10] for metric in metric_cols]
)

df = pd.DataFrame(data, columns=columns)
styled_df = df.style.format({col: '{:,.2f}' for col in df.select_dtypes(include='number').columns})

def color_clients(s):
    colors = [''] * len(s)
    for i, client in enumerate([1, 5, 10]):
        start_col = 7 + i * 6
        end_col = start_col + 6
        color = ['background-color: #e6f3ff', 'background-color: #ffe6e6', 'background-color: #e6ffe6'][i]
        for j in range(start_col, end_col):
            if j < len(colors):
                colors[j] = color
    return colors

styled_df = styled_df.apply(color_clients, axis=1)
styled_df

## 7. Cleanup

If needed, you can do cleanup by deleting the SageMaker AI's endpoints using your AWS Console that were deployed by this notebook. You can also delete the endpoint configuration in addition to the endpoints.