# Deploy a LLaMA-3.1-8B-Instruct Model and Quantized version of the same model using SageMaker Endpoints and SageMaker Large Model Inference (LMI) Container with the SageMaker Python SDK

In [1]:
%pip install sagemaker==2.229 --upgrade --quiet --no-warn-conflicts

Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3, sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Baseline SageMaker setup

In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment

sm_client = boto3.client('sagemaker')

print(f"SageMaker version: {sagemaker.__version__}")

SageMaker version: 2.229.0


## Large Model Inference (LMI) Containers

In this example you will deploy your model using [SageMaker's Large Model Inference (LMI) Containers](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html).

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution.

The LMI container supports a variety of different backends, outlined in the table below. 

The model for this example can be deployed using the LMI-Dist| Backend | SageMakerDLC | Example URI |
| --- | --- | --- |
|vLLM|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|lmi-dist|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|hf-accelerate|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|tensorrt-llm|djl-tensorrtllm|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
|transformers-neuronx|djl-neuronx|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1 backend, which corresponds to the `djl-lmi` container image.

| Backend | SageMakerDLC | Example URI |
| --- | --- | --- |
|vLLM|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|lmi-dist|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|hf-accelerate|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|tensorrt-llm|djl-tensorrtllm|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
|transformers-neuronx|djl-neuronx|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1

In [None]:
LMI_VERSION = "0.29.0"
LMI_FRAMEWORK = 'djl-lmi'

lmidist_image = sagemaker.image_uris.retrieve(framework=LMI_FRAMEWORK, region=region, version=LMI_VERSION)

print(f"Inference Image: {lmidist_image}")

Next you will need to specify configuration of the LMI container to allow the model artifact to be downloaded, and provide optimized parameters to allow the model to run on the chosen instance size/type.

There are 2 methods to supply configuration to the LMI container:
1. Create a `serving.properties` file and include it inside the compressed model artifact. This has the benefit of ensuring that no configuration information needs to be shared, as long as you have the model artifact. However it creates rigidity as it is tightly coupled and creates complexity when deploying on different instance types.
2. Provide a set of Environment Variables to the SageMaker Model object. This provides flexibility by storing the LMI configuration information inside the SageMaker Model configuration step.

In this example, you will leverage Environment Variables to configure the LMI container.

For deploying HuggingFace models, the `HF_MODEL_ID` parameter is dual purpose and can be either the HuggingFace Model ID, or an S3 location of the model artifacts. If you specify the Model ID, the artifacts will be downloaded when the endpoint is created.

Specific optimizations for this smaller instance size are limiting the `max_model_len` parameter to 20,000 (down from LLaMA-3.1-8B's 128k default) and reducing the `gpu_memory_utilization` to 0.5 to help prevent CUDA OOM errors.

### Model License Information
Meta Llama-3.1 model a gated model. To use this model you have to agree to the license agreement and request access before the model can be used in this notebook.

In [None]:
#
# Base model location (needs to be replaced by API call to model registry)
base_model_location = "s3://YOU_S3_URI"

#
# Adapter(s) location (needs to be replaced by API call to model registry with optional uncompress)
adapter_model_location = "s3://YOU_S3_URI"

In [None]:
!rm -rf lora-multi-adapter
!mkdir -p lora-multi-adapter/adapters

In [None]:
%%writefile lora-multi-adapter/serving.properties
option.model_id={base_model_location}
option.max_rolling_batch_size=16
option.rolling_batch=lmi-dist
option.max_rolling_batch_prefill_tokes=4096
option.max_model_len=4096
option.enable_lora=true
option.gpu_memory_utilization=0.8
option.max_lora_rank=64
option.max_cpu_loras=4

In [None]:
s3_code_prefix = "workshop/multi-lora/Llama-2-7b-fp16"
s3_adapters_location = sess.upload_data("lora-multi-adapter", bucket, s3_code_prefix)
print(s3_adapters_location)

Helper function to test latency of the deployed model

In [None]:
import time
import numpy as np

def run_perf_test(llm, num_iterations, prompt):
    results = []
    for i in range(0, num_iterations):
        start = time.time()
        res = llm.predict({"inputs": prompt})
        results.append((time.time() - start) * 1000)
    
    print("\nPrediction latency: \n")
    print("P95: " + str(np.percentile(results, 95)) + " ms")
    print("P90: " + str(np.percentile(results, 90)) + " ms")
    print("Average: " + str(np.average(results)) + " ms")

## LMI-Dist (recommended framework for FM deployments on Amazon SageMaker)

In [None]:
instance_type = "ml.g5.2xlarge"
container = lmidist_image
model_name = sagemaker.utils.name_from_base("llama31-8b-lmidist")
endpoint_name = model_name

In the following steps you will leverage the SageMaker Python SDK to build your model configuration and deploy it to SageMaker endpoint. There are alternative methods to do this as well, such as the Boto3 SDK, but the SM Python SDK reduces the amount of code necessary perform the same activities.

We need to update our model package with correct inference parameters

In [None]:
#response = sm_client.update_model_package(
#    ModelPackageArn=latest_model_package_arn,
#    InferenceSpecification={
#        'Containers': [
#            {
#                'Image': inference_image_uri,
#                'ModelDataUrl': model_data_uri
#            },
#        ],
#        'Environment': lmidist_config,
#        'SupportedTransformInstanceTypes': ['ml.g5.12xlarge', 'ml.p3.8xlarge'],
#        'SupportedRealtimeInferenceInstanceTypes': ['ml.g5.2xlarge'],
#        'SupportedContentTypes': ['application/json'],
#        'SupportedResponseMIMETypes': ['application/json']
#    }
#)
#model_package_arn = response["ModelPackageArn"]
#print(f"update registered model's inference spec: {model_package_arn}")

In [None]:
lmi_model = sagemaker.Model(
    image_uri = lmidist_image,
    model_data = {
        'S3DataSource': {
            'S3Uri': adapter_model_location + "/",
            'S3DataType': "S3Prefix",
            'CompressionType': "None"
        }
    },
    role = role,
    name = model_name
)

Now that you have a model object ready, you will use use the SageMaker Python SDK to create a SageMaker Managed Endpoint. The SDK eliminates some of the intermediate steps, such as creating an Endpoint Configuration.

***Note: creating a new endpoint can take between 8-15 minutes.***

In [None]:
lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = instance_type,
    container_startup_health_check_timeout = 900,
    endpoint_name = endpoint_name,
)

With your endpoint successfully deployed, you will want to test it to ensure that it is fully functional.

In [None]:
llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

payload = {
    "inputs": "¿Qué es Amazon SageMaker?",
    "adapters": ["es"],
    "parameters": {
        "max_new_tokens": 128,
        "top_p": 0.9,
        "temperature": 0.6,
    },
}

res = llm.predict(payload)
print("\n---\n",res["generated_text"], "\n---\n")

Let's get base model latency numbers (we will compare them to latency of quantized model)

In [None]:
# 
# Calculate runtime performance
# 
run_perf_test(llm = llm, num_iterations = 10, prompt = prompt)

**Do NOT forget to delete unused endpoint to avoid unnessary charges to your account**

In [None]:
llm.delete_model()
llm.delete_endpoint()

## (Optional) Deploy quantized (GPTQ) Llama-3.1-8B using LMI container with lmi-dist framework

### In the next few cells we deploy a quantized version of the same model

In [None]:
lmidist_quantized_config = {
    "HF_MODEL_ID": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16",
    #"HF_MODEL_ID": "YOUR_S3URI",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_OUTPUT_FORMATTER": "json",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "64",
    "OPTION_MAX_ROLLING_BATCH_PREFILL_TOKENS": "8192",
    "OPTION_MAX_MODEL_LEN": "8192",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
    "OPTION_QUANTIZE": "gptq",
}

In [None]:
instance_type = "ml.g5.2xlarge"
container = lmidist_image
config = lmidist_quantized_config
model_name = sagemaker.utils.name_from_base("llama31-8b-q-lmidist")
endpoint_name = model_name

In [None]:
lmi_model = sagemaker.Model(
    image_uri = container,
    env = config,
    role = role,
    name = model_name
)

lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = instance_type,
    container_startup_health_check_timeout = 900,
    endpoint_name = endpoint_name,
)

In [None]:
llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

prompt = "What is Amazon SageMaker?"

res = llm.predict({"inputs": prompt})
print("\n---\n",res["generated_text"], "\n---\n")

In [None]:
# 
# Calculate runtime performance
# 
run_perf_test(llm = llm, num_iterations = 10, prompt = prompt)

In [None]:
llm.delete_model()
llm.delete_endpoint()

## (Optional) Deploy Llama-3.1-8B using TensorRT-LLM backend with Just-In-Time (JIT) compilation

**Please note that deployment using TensorRT-LLM backend with JIT requires bigger instance (g5.8xlarge as a minimum) because of additional memeory requirements for compilation process. You can still deploy pre-compiled Llama-3.1-8B on g5.2xlarge**

**This option activity provided for education purpose and it can be used when running the notebook in AWS account that have access to required instance types (like your personal/organization account)**

***!!! Do NOT run this during workshop if you don't have access to at least g5.8xlarge !!!***

In [None]:
LMI_VERSION = "0.29.0"
LMI_FRAMEWORK = 'djl-tensorrtllm'

tensorrtllm_image = sagemaker.image_uris.retrieve(framework=LMI_FRAMEWORK, region=region, version=LMI_VERSION)

print(f"Inference Image: {tensorrtllm_image}")

In [None]:
#
# TensorRT-LLM 0.11.0 (included in LMI 0.29.0) does NOT support Llama-3.1.
# We will use Llama-3 instead
#
tensorrtllm_config = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "HF_TOKEN": "YOUR_HF_TOKEN",
    #"HF_MODEL_ID": "YOUR_S3_URI",
    "OPTION_ROLLING_BATCH": "trtllm",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MAX_NUM_TOKENS": "4096",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
}

In [None]:
instance_type = "ml.g5.8xlarge"
container = tensorrtllm_image
config = tensorrtllm_config
model_name = sagemaker.utils.name_from_base("llama31-8b-trtllm")
endpoint_name = model_name

In [None]:
lmi_model = sagemaker.Model(
    image_uri = container,
    env = config,
    role = role,
    name = model_name
)

lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = instance_type,
    container_startup_health_check_timeout = 900,
    endpoint_name = endpoint_name,
)

In [None]:
llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

prompt = "What is Amazon SageMaker?"

res = llm.predict({"inputs": prompt})
print("\n---\n",res["generated_text"], "\n---\n")

In [None]:
# 
# Calculate runtime performance
# 
run_perf_test(llm = llm, num_iterations = 10, prompt = prompt)

In [None]:
llm.delete_model()
llm.delete_endpoint()

## (Optional) Deploy Llama-3.1-8B on Inferentia2 with Just-In-Time (JIT) compilation

**Please note that deployment on Inferentia2 with JIT requires bigger instance (inf2.8xlarge as a minimum) because of additional memeory requirements for compilation process.**

**This option activity provided for education purpose and it can be used when running the notebook in AWS account that have access to required instance types (like your personal/organization account)**

***!!! Do NOT run this during workshop if you don't have access to at least inf2.8xlarge !!!***

In [None]:
LMI_VERSION = "0.29.0"
LMI_FRAMEWORK = 'djl-neuronx'

neuronx_image = sagemaker.image_uris.retrieve(framework=LMI_FRAMEWORK, region=region, version=LMI_VERSION)

print(f"Inference Image: {neuronx_image}")

In [None]:
#
neuronx_config = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "HF_TOKEN": "YOUR_HF_TOKEN",
    #"HF_MODEL_ID": "YOUR_S3_URI",
    "OPTION_ROLLING_BATCH": "auto",
    "OPTION_TENSOR_PARALLEL_DEGREE": "2",
    "OPTION_N_POSITIONS": "4096",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
    "OPTION_DTYPE": "fp16",
}

In [None]:
instance_type = "ml.inf2.8xlarge"
container = neuronx_image
config = neuronx_config
model_name = sagemaker.utils.name_from_base("llama31-8b-neuronx")
endpoint_name = model_name

In [None]:
lmi_model = sagemaker.Model(
    image_uri = container,
    env = config,
    role = role,
    name = model_name
)

lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = instance_type,
    container_startup_health_check_timeout = 900,
    endpoint_name = endpoint_name,
)

In [None]:
llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

prompt = "What is Amazon SageMaker?"

res = llm.predict({"inputs": prompt})
print("\n---\n",res["generated_text"], "\n---\n")

In [None]:
# 
# Calculate runtime performance
# 
run_perf_test(llm = llm, num_iterations = 10, prompt = prompt)

In [None]:
llm.delete_model()
llm.delete_endpoint()