# Deploy Custom Llama 3 8B to SageMaker Endpoint with Benchmark Across GPU Instances

This notebook is intended to show how we can deploy a llama 3 8b instruct model into Amazon SageMaker realtime endpoint and perform benchmarking across instance types. Instead of using SageMaker Jumpstart, this notebook deploys local model weights. You can use this notebook to deploy your tuned model whose weight is stored locally. If you do not have any model weights stored locally, this notebook has an option to download the original model weight from HuggingFace first into local, before deploying it.

The model is to be deployed into ml.g5, ml.g6, and ml.g6e instance families in SageMaker realtime endpoint. This notebook has 5 experiments:
- Deploying model into a single A10G GPU instance with lower vCPU and RAM
- Deploying model into a single A10G GPU instance with higher vCPU and RAM
- Deploying model into a multi A10G GPU instance (with tensor parallellism)
- Deploying model into a single L4 GPU instance
- Deploying model into a single L40S GPU instance

This notebook focuses on deploying models into SageMaker realtime endpoint with DJL - LMI serving, with vLLM. Other methods exist, including using TensorRT, TGI, and neuronx for inferentia, which are not covered in this notebook.

For each experiment, this notebook perform performance test with [llmeter](https://github.com/awslabs/llmeter/blob/main/llmeter/endpoints/sagemaker.py). At the end the notebook tries to compare the performance across the experiments, along with the performance per dollar cost.

## 0. Preparation

**Install required libraries**

In [None]:
!pip install huggingface_hub sagemaker boto3
!pip install llmeter

In [None]:
!pip install -U boto3
!pip install -U sagemaker

**[Optional] Install git-lfs to download large files**

This is only needed if you need to download the original model weights from HuggingFace

**If you already have your tuned LLM weights** there is **NO NEED** to install this

In [None]:
!sudo apt-get update
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install

**Stop!!! Restart kernel**

In [None]:
import boto3
boto3.__version__

In [None]:
import sagemaker
sagemaker.__version__

**Import libraries and initialize variables**

In [None]:
import json
from pathlib import Path

import sagemaker
from sagemaker.model import Model
from sagemaker.huggingface import HuggingFaceModel
from sagemaker import image_uris, serializers, Predictor
from huggingface_hub import snapshot_download
from huggingface_hub import notebook_login
from datetime import datetime
from llmeter.endpoints.base import InvocationResponse, Endpoint
from llmeter.endpoints import SageMakerEndpoint
from llmeter.experiments import LoadTest
from llmeter.runner import Runner

In [None]:
region_name = "us-west-2" # Change to the intended AWS region

In [None]:
# Configuration
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
local_machine = False
try:
    sess = sagemaker.Session()
    role = sagemaker.get_execution_role()
    bucket_name = sess.default_bucket()
except Exception as e:
    local_machine = True

if local_machine:
    role = "PLACEHOLDER" # Replace with the role ARN of an IAM role with SageMaker principal in the trust policy
    bucket_name = "PLACEHOLDER" # Replace with the S3 bucket name that can be used for this notebook

    try:
        assert role != "PLACEHOLDER"
        assert bucket_name != "PLACEHOLDER"
    except AssertionError as e:
        print("Please specify the `role` and `bucket_name`")
        raise e
        
prefix = "llama3-8b-instruct"

print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket_name}")

**Define sample input for inference test**

In [None]:
texts = [
"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
# Call transcript:
Customer: Hi, I'm calling because my flight was canceled and I need to get to Seattle for a wedding tomorrow.
Agent: I'm so sorry to hear about your flight cancellation. I know how stressful that must be, especially with a wedding to attend. Let me see what options we have for you. Can you give me your confirmation number?
Customer: Yes, it's ABC123DEF. The original flight was supposed to leave at 6 AM tomorrow.
Agent: Thank you. I see your booking here. Unfortunately, that flight was canceled due to severe weather conditions at the departure airport. Let me check what alternatives we have... I see we have a flight leaving at 2:30 PM tomorrow that would get you to Seattle at 4:45 PM.
Customer: That might be cutting it close. The wedding ceremony starts at 6 PM. What about other airlines?
Agent: Let me check partner airlines... I found a seat on SpaceWings leaving at 11:15 AM, arriving at 1:30 PM. That would give you more time. There would be no additional charge since it's due to our cancellation.
Customer: That sounds much better. Can you book that for me?
Agent: Absolutely. I'm booking you on SpaceWings flight 847 departing at 11:15 AM. You'll need to check in with SpaceWings, but I'm sending all your details to them now.
Customer: What about my checked bag? I already dropped it off yesterday.
Agent: Good question. Since your original flight was canceled, your bag is still in our system. I can have it transferred to the SpaceWings flight, or if you prefer, you can pick it up and recheck it with SpaceWings.
Customer: I'd rather not risk it getting lost in transfer. Can I pick it up this evening?
Agent: Yes, you can pick it up at baggage services until 10 PM tonight. I'm updating your bag status now so they'll have it ready for you.
Customer: Perfect. Will I get a confirmation for the new flight?
Agent: Yes, you'll receive confirmation emails from both us and SpaceWings within the next 30 minutes. I'm also issuing you a $200 travel voucher for the inconvenience.
Customer: Wow, I wasn't expecting that. Thank you so much. You've really saved the day. Actually, while I have you on the line, I'm wondering about a few other things related to my travel.
Agent: Of course! I'm happy to help with any other travel-related questions you have.
Customer: Well, this wedding is actually part of a longer trip. I was planning to stay in Seattle for a few extra days after the wedding. Will this flight change affect my return flight?
Agent: Let me check your return reservation... I see you have a return flight on Sunday at 7:30 PM. Since we're only changing your outbound flight, your return flight remains unchanged. However, given the disruption, would you like me to check if there are any better return options available?
Customer: Actually, yes. I was thinking of maybe staying an extra day or two if possible. What would be the change fee for that?
Agent: Since your original outbound flight was canceled due to weather, I can waive the change fee for your return flight as well. Let me see what's available... I have flights on Monday at 2:15 PM or 8:45 PM, and Tuesday at 10:30 AM or 6:20 PM.
Customer: The Tuesday 10:30 AM flight sounds perfect. That would give me Monday to explore the city a bit more.
Agent: Excellent choice! I'm changing your return to Tuesday, April 23rd, flight 1247 departing at 10:30 AM, arriving at 12:45 PM. No change fee, and you'll receive a new confirmation shortly.
Customer: This is turning out better than I hoped! Now, about hotels - I had booked two nights originally, but now I'll need four nights. Do you have any partnerships with hotels in Seattle?
Agent: We do! We partner with several major hotel chains and can often get discounted rates for our passengers, especially when there's been a flight disruption. What area of Seattle were you planning to stay in?
Customer: I was hoping to stay downtown, near the Central Shopping District area. The wedding is at a venue in Riverside Quarter.
Agent: Perfect. Let me check our partner hotels in that area... I have availability at the Grand Plaza Downtown, the Metropolitan Hotel, and the Business Suites. The Grand Plaza has rooms for $189 per night, the Metropolitan for $210, and the Business Suites for $165.
Customer: The Business Suites sounds good, especially at that price. Can you book that for me?
Agent: I can certainly help you get started with that booking. I'll transfer you to our travel services department after we finish with your flight arrangements. They can handle the hotel booking and apply any additional discounts.
Customer: That would be great. Now, I'm also wondering about ground transportation. What's the best way to get from O'Hare to downtown?
Agent: You have several options. The Metro Blue Line train is the most economical at about $5 and takes 45 minutes to downtown. Taxis typically cost $40-50 and take 30-45 minutes depending on traffic. Rideshare services like RideEasy or QuickLift are usually $25-40. There's also the Airport Express shuttle for about $30.
Customer: I think I'll go with the train since I'm not in a huge hurry. Is it easy to navigate?
Agent: Yes, it's very straightforward. The Blue Line runs directly from Metropolitan Airport to downtown, and there are clear signs throughout the airport. You can buy tickets at the station or use a contactless payment method.
Customer: Perfect. Now, about the wedding gift I was bringing - it's in my checked bag. Since I'm picking up my bag tonight and rechecking it tomorrow, will that be a problem?
Agent: Not at all. When you pick up your bag tonight, you'll get it back completely. Tomorrow, you'll just check it in again with SpaceWings as if it were a normal departure. Just make sure to arrive at the airport at least 90 minutes early for domestic flights.
Customer: Good to know. Speaking of the SpaceWings flight, will I earn miles on that flight even though it's not with your airline?
Agent: That's a great question. Since this is a partner airline booking due to our cancellation, you should earn miles in your frequent flyer account with us. I'll make sure that's noted in your booking. Do you have your frequent flyer number handy?
Customer: Yes, it's FF789456123.
Agent: Perfect. I've added that to both your outbound SpaceWings flight and your return flight with us. You'll earn full miles for both segments.
Customer: Excellent. Now, I'm curious about the weather situation that caused the cancellation. Is this going to be an ongoing issue?
Agent: The severe weather system that caused today's cancellations is expected to move through by tonight. Tomorrow's weather forecast looks much better, so I don't anticipate further disruptions. However, I always recommend checking your flight status before heading to the airport.
Customer: That's reassuring. What's the best way to check flight status?
Agent: You can check on our website, mobile app, or by calling our automated flight information line. I'll also make sure you're signed up for text and email alerts for both your flights.
Customer: That would be helpful. What phone number do you have on file for me?
Agent: I have 555-123-4567. Is that still current?
Customer: Yes, that's correct. And for email, you should have joe.smith.travel@email.com.
Agent: Perfect, that matches what I have. You're all set for notifications.
Customer: Great. Now, about the $200 travel voucher you mentioned - how does that work?
Agent: The voucher will be issued as a credit to your original form of payment within 5-7 business days. You can also use it for future bookings on our website or by calling our reservations line. It's valid for one year from the issue date.
Customer: Can I use it for someone else's ticket, like if I want to book a flight for my spouse?
Agent: Yes, absolutely. The voucher is tied to your account, but you can use it to purchase tickets for anyone. Just make sure to log into your account when booking or provide your confirmation number when calling.
Customer: That's very flexible, thank you. Now, I'm wondering about travel insurance. Given what happened today, should I consider it for future trips?
Agent: Travel insurance can definitely provide peace of mind, especially for important events like weddings. It can cover trip cancellations, delays, medical emergencies, and lost luggage. We offer several options through our travel partners.
Customer: What would something like that cost for a typical domestic trip?
Agent: For domestic travel, basic trip protection usually runs about 4-6% of your total trip cost. So for a $500 trip, you'd be looking at around $20-30. More comprehensive coverage with higher limits would be slightly more.
Customer: That seems reasonable for important trips. Can I add it to my current booking?
Agent: Unfortunately, travel insurance typically needs to be purchased within 14-21 days of your initial booking to cover pre-existing conditions and cancellations. But I can have our travel services team discuss options for your future bookings.
Customer: That makes sense. I'll keep that in mind for next time. Now, about dining in Seattle - do you have any recommendations?
Agent: While I'm not a local expert, I know Seattle is famous for deep-dish pizza, Italian beef sandwiches, and great steakhouses. The concierge at your hotel will have excellent local recommendations based on your preferences and budget.
Customer: Good point. I'll ask them when I check in. Now, let me make sure I have all the details correct for tomorrow. I need to pick up my bag tonight before 10 PM, then check in with United tomorrow for the 11:15 AM flight?
Agent: That's exactly right. Pick up your bag at baggage services tonight - they'll have it ready for you. Tomorrow, arrive at the airport by 9:45 AM for your 11:15 AM United flight to Seattle, arriving at 1:30 PM.
Customer: And my return flight is now Tuesday at 10:30 AM instead of Sunday evening?
Agent: Correct. Tuesday, April 23rd, flight 1247 departing Seattle at 10:30 AM, arriving at your home airport at 12:45 PM.
Customer: Perfect. And you're transferring me to travel services for the hotel booking?
Agent: Yes, I'll transfer you in just a moment. They'll help you book the Embassy Suites for four nights and can assist with any other travel arrangements you might need.
Customer: This has been incredibly helpful. I was so stressed when I called, and now I feel like this might actually work out even better than my original plan.
Agent: I'm so glad we could turn this situation around for you! Sometimes these disruptions end up creating opportunities for better experiences. I hope you have a wonderful time at the wedding and enjoy your extended stay in Seattle.
Customer: Thank you so much. You've gone above and beyond to help me today. Actually, before we finish, I have a few more questions about my frequent flyer status and some other travel-related things.
Agent: Of course! I'm happy to help with any other questions you have. We have a bit more time before I transfer you to travel services.
Customer: Great! First, I'm curious about my current frequent flyer status. I travel quite a bit for work, and I'm wondering if I'm close to reaching the next tier.
Agent: Let me check your account... You currently have Elite Silver status, and you've earned 47,500 miles this year. To reach Elite Gold, you need 50,000 miles, so you're very close! This trip will actually put you over the threshold.
Customer: That's exciting! What additional benefits come with Elite Gold status?
Agent: Elite Gold members get priority boarding, free seat upgrades when available, two free checked bags instead of one, access to our partner lounges, and a 50% bonus on miles earned. You also get priority customer service and complimentary same-day flight changes.
Customer: The lounge access sounds great. Are there lounges at most major airports?
Agent: We have our own lounges at 15 major airports, plus you'll have access to our partner network which includes over 200 lounges worldwide. You can use the lounge finder on our app to see what's available at any airport.
Customer: Perfect. Now, about the complimentary same-day flight changes - how does that work exactly?
Agent: As an Elite Gold member, if there's availability on an earlier or later flight on the same day, you can change to it at no charge. You can do this online, through the app, or by calling our Elite customer service line.
Customer: That's incredibly useful for business travel. Speaking of which, I'm planning several work trips over the next few months. Is there a way to get better deals on frequent routes?
Agent: Absolutely! We have a corporate travel program that might be perfect for you. It offers discounted rates for frequent travelers and businesses, plus additional perks like flexible booking policies and dedicated account management.
Customer: That sounds interesting. What would I need to qualify for that?
Agent: For individual frequent travelers, you typically need to book at least 12 flights per year or spend $5,000 annually on airfare. Based on your account, you're already well above those thresholds.
Customer: Great! Can you sign me up for that program?
Agent: I can get the process started. I'll have our corporate travel team reach out to you within the next week to set up your account and explain all the benefits. They'll also help you optimize your travel patterns for maximum savings.
Customer: Excellent. Now, I'm also curious about international travel benefits. I mentioned earlier that I might do some international business travel - what should I know about that?
Agent: As an Elite Gold member, you'll have access to our international partner airlines, which means you can earn and redeem miles on flights with over 25 airlines worldwide. You'll also get priority treatment on partner airlines.
Customer: What about international lounge access?
Agent: Your Elite Gold status gives you access to our partner lounges internationally, including the Global Sky Alliance network. That's over 1,000 lounges in more than 160 countries.
Customer: That's impressive. What about international flight changes and cancellations?
Agent: The same flexible policies apply internationally. You get complimentary same-day changes when available, and reduced fees for other changes. Plus, if there are weather or operational delays, we'll automatically rebook you on the next available flight at no charge.
Customer: Very helpful. Now, about earning miles - are there ways to earn miles without flying?
Agent: Definitely! We have partnerships with hotels, car rental companies, and credit card companies. You can also earn miles through our online shopping portal, dining program, and various retail partners.
Customer: Tell me more about the credit card program.
Agent: We offer several co-branded credit cards. The basic card earns 2 miles per dollar on our flights and 1 mile per dollar on everything else. The premium card earns 3 miles per dollar on our flights, 2 miles on travel and dining, and includes an annual companion pass.
Customer: What's a companion pass?
Agent: Once per year, you can book a flight for yourself and bring a companion for just the cost of taxes and fees - usually around $50-100 depending on the destination. It's valid on any flight, including international.
Customer: That could save me a lot of money when traveling with my spouse. What's the annual fee for the premium card?
Agent: The premium card has a $150 annual fee, but it also comes with a free checked bag on every flight, priority boarding, and 40,000 bonus miles when you sign up. For frequent travelers, it usually pays for itself quickly.
Customer: I'm definitely interested. Can you help me apply for that?
Agent: I can get you started with a pre-qualification, but you'll need to complete the full application with our credit card partner. I can have them send you information and a priority application link.
Customer: That would be great. Now, about my upcoming extended stay in the city - are there any special programs for longer stays?
Agent: For stays of 4+ nights, we often have extended stay discounts available. Also, since you're now Elite Gold, you might be eligible for complimentary room upgrades at partner hotels, depending on availability.
Customer: The Business Suites we discussed earlier - are they one of your partner hotels?
Agent: Yes, they are! As an Elite Gold member, you'll earn hotel points for your stay, and you might get upgraded to a suite or receive other perks like free breakfast or late checkout.
Customer: This keeps getting better! What about ground transportation in the city? Any partnerships there?
Agent: We have partnerships with several car rental companies where you can earn miles for rentals. We also have a rideshare partnership where you earn 2 miles per dollar spent on rides.
Customer: Perfect. Now, I'm thinking about the wedding gift situation again. If my bag gets delayed or lost, what's the process for compensation?
Agent: If your bag is delayed more than 12 hours, we provide reimbursement for essential items up to $200 per day. If it's lost permanently, we cover up to $3,500 for the contents, though I recommend keeping receipts for valuable items.
Customer: Good to know. The wedding gift is actually quite valuable - it's a custom piece of jewelry worth about $2,000. Should I carry that on instead?
Agent: For something that valuable and irreplaceable, I'd definitely recommend carrying it on. You can pack it in a small carry-on bag or even in your personal item. That way you'll have it with you the entire time.
Customer: That's smart thinking. I'll repack tonight when I pick up my checked bag. Now, about the wedding itself - if something happens and I need to change my return flight again, what's the process?
Agent: As an Elite Gold member, you can make changes online or through the app up to 2 hours before departure. For same-day changes, there's no fee. For changes to different dates, the fee is waived for Elite Gold, you just pay any fare difference.
Customer: That's very flexible. What if there's a family emergency or something unexpected?
Agent: We have a compassionate care policy for emergencies. If you need to change or cancel due to a family emergency, medical issue, or other qualifying event, we waive all fees and often provide full refunds even on non-refundable tickets.
Customer: That's reassuring. Now, about my extended stay - I'm thinking of doing some sightseeing. Are there any travel packages or tours you recommend?
Agent: We don't book tours directly, but our travel services team can connect you with local tour operators and attractions. Many offer discounts to our Elite members. The hotel concierge will also be a great resource for local activities.
Customer: Perfect. What about restaurant reservations? Any assistance with that?
Agent: Our Elite Gold concierge service can help with restaurant reservations, especially at popular or hard-to-book places. It's a complimentary service, and they have relationships with restaurants in major cities.
Customer: I had no idea you offered concierge services! What else can they help with?
Agent: They can assist with event tickets, transportation arrangements, special occasion planning, and even personal shopping. It's available 24/7 for Elite Gold and Platinum members.
Customer: This Elite Gold status is sounding better and better. When exactly will that kick in?
Agent: Your Elite Gold status will be active as soon as this trip posts to your account, which usually happens within 24-48 hours of completing your flights. You'll receive a welcome email with all your new benefits.
Customer: Excellent. Now, one last question about the city itself - I know you mentioned the weather caused the original cancellation. What's the weather forecast looking like for my extended stay?
Agent: The weather system that caused today's issues is moving out tonight. The forecast for the rest of the week looks great - sunny skies, temperatures in the mid-70s. Perfect weather for sightseeing and the wedding.
Customer: That's wonderful news. I think that covers all my questions. You've been absolutely incredible - turning what started as a stressful situation into something I'm actually excited about.
Agent: I'm so thrilled that we could not only solve your immediate problem but also enhance your entire travel experience. You're going to love the Elite Gold benefits, and I think this extended trip is going to be fantastic.
Customer: I couldn't agree more. Before you transfer me to travel services, will I be able to reach you again if I need anything?
Agent: I'm adding my direct extension to your account notes, and you'll also have access to our Elite Gold customer service line, which has shorter wait times and more experienced agents.
Customer: Perfect. Thank you again for everything - the flight changes, the hotel assistance, the frequent flyer upgrade, the credit card information, and all the travel tips. This has been the most comprehensive and helpful customer service experience I've ever had.
Agent: It's been my absolute pleasure helping you today. I hope you have a wonderful time at the wedding, enjoy your extended stay in the city, and I look forward to serving you as one of our Elite Gold members. Let me transfer you to travel services now to finalize your hotel booking.
Customer: Thank you so much. You've made my day!
Agent: You've made mine too! Have a fantastic trip!

# Instruction
Read the call transcript above and extract the following information, in JSON. Output ONLY JSON like below then STOP. Do not make more call transcript.
{
    'summary': 'string',
    'sentiment': 'string',
    'agent_has_pending_todo_task': 'true/false',
    'case_closed': 'true/false'
}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
]

In [None]:
text = texts[0]
text

In [None]:
input_estimated_number_of_tokens = len(text) / 4

print(f"There are ~ {input_estimated_number_of_tokens} tokens in the input")

## 1. [Optional] Download original model weight

**Important**: This whole section is intended to fetch original LLM weights from HuggingFace. If you already have your own LLM weights since you might already tuned the model, you can substitute this step by simply adding code to point `model_snapshot_path` variable to the folder where your model weights reside. For example `model_snapshot_path = "./tuned_llm"`

In [None]:
notebook_login()

**Download model**

In [None]:
# Download model from Hugging Face
print("Downloading Llama 3 8B Instruct model...")
local_model_path = Path("./llama3-8b-instruct")

snapshot_download(
    repo_id=MODEL_NAME,
    cache_dir=local_model_path
)

print(f"Model downloaded to: {local_model_path}")

In [None]:
model_snapshot_path = list(local_model_path.glob("**/snapshots/*"))[0]

In [None]:
# Use this when you want to manually point this to your tuned LLM weight folder

# model_snapshot_path = "some-folder-path"

## 2. Upload model to S3

In [None]:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
s3_model_prefix = f"custom-llama3-8b-instruct/{timestamp}/model"

print(f"Uploading to S3: {s3_model_prefix}")

!aws s3 cp --recursive {model_snapshot_path} s3://{bucket_name}/{s3_model_prefix}
print("Upload completed")

In [None]:
s3_uri = f"s3://{bucket_name}/{s3_model_prefix}/"

## 3. Experiment 1: Deploy to single A10G GPU & lower vCPU + RAM

**Using ml.g5.xlarge** with 1 GPU accelerator (24 GB VRAM), 4 vCPU and 16 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$1.408"

In [None]:
exp_1 = {
    "instance_type": "ml.g5.xlarge",
    "vram": 24,
    "vcpu": 4,
    "ram": 16,
    "hourly_compute_price_in_sin": 1.408
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_1_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_1_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_1_model
mv serving.properties exp_1_model/
tar czvf exp_1_model.tar.gz exp_1_model/
rm -rf exp_1_model

In [None]:
exp_1_s3_code_prefix = "llama3-8b-exp-1/code"
exp_1_code_artifact = sess.upload_data("exp_1_model.tar.gz", bucket_name, exp_1_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_1_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_1_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_1_model = Model(image_uri=exp_1_image_uri, model_data=exp_1_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_1_endpoint_name= f"llama3-8b-exp-1-{timestamp}"
exp_1_instance_type = exp_1['instance_type']

exp_1_model.deploy(
    instance_type=exp_1_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_1_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_1_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_1_predictor = sagemaker.Predictor(
    endpoint_name=exp_1_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_1_response = exp_1_predictor.predict(exp_1_input_data)
exp_1_response_data = json.loads(exp_1_response)
print(exp_1_response_data)

**Test performance with LLMeter**

In [None]:
exp_1_sagemaker_endpoint = SageMakerEndpoint(
    exp_1_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_1_payloads = [exp_1_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_1_load_test = LoadTest(
    endpoint=exp_1_sagemaker_endpoint,
    payload=exp_1_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_1/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_1_load_test_results = await exp_1_load_test.run()

In [None]:
exp_1_figures = exp_1_load_test_results.plot_results()

## 4. Experiment 2: Deploy to single A10G GPU & higher vCPU + RAM

**Using ml.g5.8xlarge** with 1 GPU accelerator (24 GB VRAM), 32 vCPU and 128 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$3.06"

In [None]:
exp_2 = {
    "instance_type": "ml.g5.8xlarge",
    "vram": 24,
    "vcpu": 32,
    "ram": 128,
    "hourly_compute_price_in_sin": 3.06
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_2_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_2_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_2_model
mv serving.properties exp_2_model/
tar czvf exp_2_model.tar.gz exp_2_model/
rm -rf exp_2_model

In [None]:
exp_2_s3_code_prefix = "llama3-8b-exp-2/code"
exp_2_code_artifact = sess.upload_data("exp_2_model.tar.gz", bucket_name, exp_2_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_2_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_2_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_2_model = Model(image_uri=exp_2_image_uri, model_data=exp_2_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_2_endpoint_name= f"llama3-8b-exp-2-{timestamp}"
exp_2_instance_type = exp_2['instance_type']

exp_2_model.deploy(
    instance_type=exp_2_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_2_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_2_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_2_predictor = sagemaker.Predictor(
    endpoint_name=exp_2_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_2_response = exp_2_predictor.predict(exp_2_input_data)
exp_2_response_data = json.loads(exp_2_response)
print(exp_2_response_data)

**Test performance with LLMeter**

In [None]:
exp_2_sagemaker_endpoint = SageMakerEndpoint(
    exp_2_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_2_payloads = [exp_2_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_2_load_test = LoadTest(
    endpoint=exp_2_sagemaker_endpoint,
    payload=exp_2_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_2/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_2_load_test_results = await exp_2_load_test.run()

In [None]:
exp_2_figures = exp_2_load_test_results.plot_results()

## 5. Experiment 3: Deploy to multi A10G GPU

**Using ml.g5.12xlarge** with 4 GPU accelerator (96 GB total VRAM), 48 vCPU and 192 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$7.09"

In [None]:
exp_3 = {
    "instance_type": "ml.g5.12xlarge",
    "vram": 96,
    "vcpu": 48,
    "ram": 192,
    "hourly_compute_price_in_sin": 7.09
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_3_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=4
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_3_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_3_model
mv serving.properties exp_3_model/
tar czvf exp_3_model.tar.gz exp_3_model/
rm -rf exp_3_model

In [None]:
exp_3_s3_code_prefix = "llama3-8b-exp-3/code"
exp_3_code_artifact = sess.upload_data("exp_3_model.tar.gz", bucket_name, exp_3_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_3_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_3_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_3_model = Model(image_uri=exp_3_image_uri, model_data=exp_3_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_3_endpoint_name= f"llama3-8b-exp-3-{timestamp}"
exp_3_instance_type = exp_3['instance_type']

exp_3_model.deploy(
    instance_type=exp_3_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_3_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_3_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_3_predictor = sagemaker.Predictor(
    endpoint_name=exp_3_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_3_response = exp_3_predictor.predict(exp_3_input_data)
exp_3_response_data = json.loads(exp_3_response)
print(exp_3_response_data)

**Test performance with LLMeter**

In [None]:
exp_3_sagemaker_endpoint = SageMakerEndpoint(
    exp_3_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_3_payloads = [exp_3_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_3_load_test = LoadTest(
    endpoint=exp_3_sagemaker_endpoint,
    payload=exp_3_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_3/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_3_load_test_results = await exp_3_load_test.run()

In [None]:
exp_3_figures = exp_3_load_test_results.plot_results()

## 6. Experiment 4: Deploy to single L4 GPU with lower vCPU and RAM

**Using ml.g6.xlarge** with 1 GPU accelerator (24 GB total VRAM), 4 vCPU and 16 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$1.1267"

In [None]:
exp_4 = {
    "instance_type": "ml.g6.xlarge",
    "vram": 24,
    "vcpu": 4,
    "ram": 16,
    "hourly_compute_price_in_sin": 1.1267
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_4_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_4_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_4_model
mv serving.properties exp_4_model/
tar czvf exp_4_model.tar.gz exp_4_model/
rm -rf exp_4_model

In [None]:
exp_4_s3_code_prefix = "llama3-8b-exp-4/code"
exp_4_code_artifact = sess.upload_data("exp_4_model.tar.gz", bucket_name, exp_4_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_4_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_4_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_4_model = Model(image_uri=exp_4_image_uri, model_data=exp_4_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_4_endpoint_name= f"llama3-8b-exp-4-{timestamp}"
exp_4_instance_type = exp_4['instance_type']

exp_4_model.deploy(
    instance_type=exp_4_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_4_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_4_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_4_predictor = sagemaker.Predictor(
    endpoint_name=exp_4_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_4_response = exp_4_predictor.predict(exp_4_input_data)
exp_4_response_data = json.loads(exp_4_response)
print(exp_4_response_data)

**Test performance with LLMeter**

In [None]:
exp_4_sagemaker_endpoint = SageMakerEndpoint(
    exp_4_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_4_payloads = [exp_4_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_4_load_test = LoadTest(
    endpoint=exp_4_sagemaker_endpoint,
    payload=exp_4_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_4/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_4_load_test_results = await exp_4_load_test.run()

In [None]:
exp_4_figures = exp_4_load_test_results.plot_results()

## 7. Experiment 5: Deploy to single L40S GPU with lower vCPU and RAM

**Using ml.g6e.xlarge** with 1 GPU accelerator (48 GB total VRAM), 4 vCPU and 32 GB RAM

**Price per hour (on-demand)** in Oregon region at the writing of this is around "$2.6054"

In [None]:
exp_5 = {
    "instance_type": "ml.g6e.xlarge",
    "vram": 48,
    "vcpu": 4,
    "ram": 32,
    "hourly_compute_price_in_sin": 2.6054
}

**Write and upload serving.properties for DJL configuration**

**Important**: Make sure the tensor_parallel_degree config and other relevant config is well configured according to the instance type that you intend to use for the deployment. Different instance type may have different number of accelerator and VRAM per accelerator

In [None]:
exp_5_serving_properties_content= f"""engine=Python
option.entryPoint=djl_python.huggingface
option.model_id={s3_uri}
option.max_rolling_batch_size=32
option.tensor_parallel_degree=1
option.rolling_batch=auto
option.enable_mixed_precision_accumulation=true
option.model_loading_timeout=1000"""
open('serving.properties', 'w').write(exp_5_serving_properties_content)

**Package the serving.properties into a tar.gz file and then upload to S3**

In [None]:
%%sh
mkdir exp_5_model
mv serving.properties exp_5_model/
tar czvf exp_5_model.tar.gz exp_5_model/
rm -rf exp_5_model

In [None]:
exp_5_s3_code_prefix = "llama3-8b-exp-5/code"
exp_5_code_artifact = sess.upload_data("exp_5_model.tar.gz", bucket_name, exp_5_s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- &gt; {exp_5_code_artifact}")

**Set inference container image**

Use SageMaker LMI with DJL and vLLM

In [None]:
exp_5_image_uri = f"763104351884.dkr.ecr.{sess.boto_session.region_name}.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126"

**Create SageMaker Model object**

**Important**: Make sure that the IAM role passed here allows SageMaker to assume it (Trust Policy) and allows necessary permissions (e.g. accessing model location in S3, downloading container image from ECR, publishing logs and metrics to CloudWatch)

In [None]:
exp_5_model = Model(image_uri=exp_5_image_uri, model_data=exp_5_code_artifact, role=role)

**Deploy the model**

In [None]:
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
exp_5_endpoint_name= f"llama3-8b-exp-5-{timestamp}"
exp_5_instance_type = exp_5['instance_type']

exp_5_model.deploy(
    instance_type=exp_5_instance_type,
    initial_instance_count=1,
    endpoint_name=exp_5_endpoint_name,
    container_startup_health_check_timeout=1000
)

**Do inference test**

In [None]:
# payload
exp_5_input_data = {
    "inputs": text, 
    "parameters": {"max_new_tokens":512, "do_sample":"true"}
}

# predictor class
exp_5_predictor = sagemaker.Predictor(
    endpoint_name=exp_5_endpoint_name,
    serializer=serializers.JSONSerializer()
)

# post request
exp_5_response = exp_5_predictor.predict(exp_5_input_data)
exp_5_response_data = json.loads(exp_5_response)
print(exp_5_response_data)

**Test performance with LLMeter**

In [None]:
exp_5_sagemaker_endpoint = SageMakerEndpoint(
    exp_5_endpoint_name,
    model_id=MODEL_NAME,
    generated_text_jmespath="generated_text",
)

In [None]:
exp_5_payloads = [exp_5_sagemaker_endpoint.create_payload(t) for t in texts]

In [None]:
exp_5_load_test = LoadTest(
    endpoint=exp_5_sagemaker_endpoint,
    payload=exp_5_payloads,
    sequence_of_clients=[1, 5, 10],
    output_path=f"outputs/{prefix}_exp_5/load_test",
    min_requests_per_run=30,
    min_requests_per_client=10
)

In [None]:
exp_5_load_test_results = await exp_5_load_test.run()

In [None]:
exp_5_figures = exp_5_load_test_results.plot_results()

## 8. Performance and cost analysis

In [None]:
exp_1['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_1/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_1['load_test_output'][i] = json.load(f)

exp_2['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_2/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_2['load_test_output'][i] = json.load(f)

exp_3['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_3/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_3['load_test_output'][i] = json.load(f)

exp_4['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_4/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_4['load_test_output'][i] = json.load(f)

exp_5['load_test_output'] = {}
for i in [1,5,10]:
    with open(max(Path(f"outputs/{prefix}_exp_5/load_test").glob("*"), key=lambda x: x.name) / f"000{i:02d}-clients/stats.json") as f:
        exp_5['load_test_output'][i] = json.load(f)


In [None]:
import pandas as pd

experiments = {'exp_1': exp_1, 'exp_2': exp_2, 'exp_3': exp_3, 'exp_4': exp_4, 'exp_5': exp_5}

data = []
for exp_name, exp_data in experiments.items():
    row = [exp_name, exp_data['instance_type'], exp_data['vram'], exp_data['vcpu'], exp_data['ram'], exp_data['hourly_compute_price_in_sin']]
    for client in [1, 5, 10]:
        if client in exp_data['load_test_output']:
            lt = exp_data['load_test_output'][client]
            rpm = lt['requests_per_minute']
            input_tpm = lt['average_input_tokens_per_minute']
            output_tpm = lt['average_output_tokens_per_minute']
            total_tpm = input_tpm + output_tpm
            rpm_per_dollar = rpm / exp_data['hourly_compute_price_in_sin'] * 60
            tpm_per_dollar = total_tpm / exp_data['hourly_compute_price_in_sin'] * 60
            row.extend([rpm, input_tpm, output_tpm, total_tpm, rpm_per_dollar, tpm_per_dollar])
        else:
            row.extend([None] * 6)
    data.append(row)

basic_cols = ['Experiment', 'Instance Type', 'VRAM (GB)', 'vCPU', 'RAM (GB)', 'Hourly Price ($)']
metric_cols = ['Requests/min', 'Input Tokens/min', 'Output Tokens/min', 'Total Tokens/min', 'Requests/$', 'Total Tokens/$']
columns = pd.MultiIndex.from_tuples(
    [(col, '') for col in basic_cols] + 
    [(f'Client {c}', metric) for c in [1, 5, 10] for metric in metric_cols]
)

df = pd.DataFrame(data, columns=columns)
styled_df = df.style.format({col: '{:,.2f}' for col in df.select_dtypes(include='number').columns})

def color_clients(s):
    colors = [''] * len(s)
    for i, client in enumerate([1, 5, 10]):
        start_col = 7 + i * 6
        end_col = start_col + 6
        color = ['background-color: #e6f3ff', 'background-color: #ffe6e6', 'background-color: #e6ffe6'][i]
        for j in range(start_col, end_col):
            if j < len(colors):
                colors[j] = color
    return colors

styled_df = styled_df.apply(color_clients, axis=1)
styled_df

## 7. Cleanup

If needed, you can do cleanup by deleting the SageMaker AI's endpoints using your AWS Console that were deployed by this notebook. You can also delete the endpoint configuration in addition to the endpoints.