# Deploy Llama 3 on Amazon SageMaker with TensorRT-LLM

---

[Llama 3](https://llama.meta.com/llama3/) are pretrained models trained over 15 trillion tokens, – a training dataset 7x larger than that used for Llama 2, with a 8k context length. The models excels at text summarization and accuracy, text classification and nuance, sentiment analysis and nuance reasoning, language modeling, dialogue systems, code generation, and following instructions. To learn more about Llama 3 models, click [here](https://llama.meta.com/llama3/).

SageMaker has rolled out [TensorRT-LLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) for distributed large language model inference on Nvidia. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.

In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.

This combination allows us to deploy the `Llama 3 8B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and TensorRT-LLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/).

---

## Requirements

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose `ml.t3.medium`.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [4]:
!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts
!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts

---

### Import libraries

In [1]:
import boto3
import json
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
print(sagemaker.__version__)

2.224.2


### Initialize parameters

In [3]:
# execution role for the endpoint
role = sagemaker.get_execution_role()

# sagemaker session for interacting with different AWS APIs
sess = sagemaker.session.Session()

# Region
region_name = sess._region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {region_name}")

sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0
sagemaker session region: us-east-1


### Image URI of the DJL Container

LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. 

See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and for more details [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md).

In [4]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="djl-tensorrtllm",
    region=region_name,
    version="0.28.0"
)
print(f"DCL Image going to be used is ---- > {inference_image_uri}")

DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122


### Available Environment Variable Configurations

Here is a list of settings that we use in this configuration file:

- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.
- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.
- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [MPI](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html). MPI is used to operate on single machine multi-gpu or multiple machines multi-gpu use cases.
- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.
- `OPTION_TGI_COMPAT`: To get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true`.
- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)
- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.
    - In the TensorRT-LLM Container:
        - use `OPTION_ROLLING_BATCH=trtllm` to use TensorRT-LLM (this is the default)
- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.
- `OPTION_MAX_INPUT_LEN`: Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request.
- `OPTION_MAX_OUTPUT_LEN`: Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.
- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.

For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html)

## Create SageMaker endpoint

In [5]:
# Hugging Face Model Id
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# Environment variables
hf_token = "<REPLACE WITH YOUR TOKEN>" # Use for gated models
rolling_batch = "trtllm"
max_output_len = 8192

env = {}
env['HF_MODEL_ID'] = model_id
env['OPTION_ROLLING_BATCH'] = rolling_batch
env['OPTION_DTYPE'] = "bf16"
env['OPTION_TGI_COMPAT'] = "true"
env['OPTION_ENGINE'] = "MPI"
env['OPTION_TASK'] = "text-generation"
env['TENSOR_PARALLEL_DEGREE'] = "max"
env['OPTION_MAX_INPUT_LEN'] = json.dumps(max_output_len - 1)
env['OPTION_MAX_OUTPUT_LEN'] = json.dumps(max_output_len)
env['OPTION_DEVICE_MAP'] = "auto"
# env['OPTION_MAX_ROLLING_BATCH'] = ""
# env['OPTION_TRUST_REMOTE_CODE'] = "true"
    
# Include HF token for gated models
if hf_token != "<REPLACE WITH YOUR TOKEN>":
    env['HF_TOKEN'] = hf_token
else:
    print("Llama models are gated, please add your HF token before you continue.")

In [6]:
# SageMaker Instance Type
instance_type = "ml.g5.12xlarge"

# Endpoint name
endpoint_name_prefix = "llama3-8b-instruct-tensorrt-llm"
endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)

print(f"instance_type: {instance_type}")
print(f"model_id: {model_id}")
print(f"endpoint_name: {endpoint_name}")

instance_type: ml.g5.12xlarge
model_id: meta-llama/Meta-Llama-3-8B-Instruct
endpoint_name: llama3-8b-instruct-tensorrt-llm-2024-07-06-11-33-28-546


In [7]:
# Deploy model to an endpoint
model = sagemaker.Model(
    image_uri=inference_image_uri,
    role=role,
    env=env
)

In [8]:
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
)

---------------!

## Run inference and chat with the model

### Supported Inference Parameters

---
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

---

### Sample code generation questions

1. "Create a Python class for a multi-threaded web scraper that can handle rate limiting, proxy rotation, and dynamic content loading. Include methods for parsing HTML with BeautifulSoup and storing results in a SQLite database."
2. "Implement a Red-Black Tree data structure in C++ with methods for insertion, deletion, and rebalancing. Include a visualization function that prints the tree structure to the console."
3. "Write a Rust function that implements the Aho-Corasick string matching algorithm for efficient multi-pattern searching. Optimize it for memory usage and include comprehensive error handling."
4. "Develop a JavaScript module for a real-time collaborative text editor using operational transformation. Implement functions for handling concurrent edits, conflict resolution, and syncing with a backend server."
5. "Create a Python script that uses asyncio to concurrently process large CSV files, perform complex data transformations, and upload the results to an S3 bucket. Include proper error handling and logging."
6. "Implement a microservices architecture in Go for a basic e-commerce platform. Include services for user authentication, product catalog, order processing, and inventory management. Use gRPC for inter-service communication and implement circuit breaking for resilience."
7. "Provide me with a python script to recompile huggingface models with optimum neuron for inferentia"

---

### Inference using SageMaker SDK

In [9]:
# Initialize sagemaker client with the endpoint created in the prior step
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [16]:
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}
<|eot_id|><|start_header_id|>user<|end_header_id|>
{message_prompt}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>""".format(
    system_prompt="You are a helpful assistant.",
    message_prompt="Building a website can be done in 10 simple steps:"
)

inputs = {
    "inputs": prompt,
    "parameters": {
        "temperature": 0.8,
        "top_p": 0.95,
        "max_new_tokens": 4000,
        "do_sample": False
    }
}
response = predictor.predict(inputs)
print(response[0]['generated_text'].strip().replace('<|eot_id|>', ''))

That's correct! Building a website can be a straightforward process if you break it down into manageable steps. Here are the 10 simple steps to build a website:

1. **Plan your website's purpose and audience**: Determine the purpose of your website, who your target audience is, and what you want to achieve with your website.

2. **Choose a domain name**: Register a unique and memorable domain name that reflects your website's brand and is easy to spell and remember.

3. **Select a web hosting service**: Choose a reliable web hosting service that meets your needs, including storage space, bandwidth, and customer support.

4. **Design your website's layout and structure**: Plan the layout and structure of your website, including the number of pages, navigation menu, and content organization.

5. **Create your website's content**: Write and design the content for your website, including text, images, videos, and other multimedia elements.

6. **Choose a website builder or CMS**: Decide wh

### Inference using Boto3 SDK

In [17]:
# Initialize sagemaker client with boto3 using the endpoint created from prior step
smr_client = boto3.client("sagemaker-runtime")

In [18]:
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}
<|eot_id|><|start_header_id|>user<|end_header_id|>
{message_prompt}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>""".format(
    system_prompt="You are a helpful assistant.",
    message_prompt="""I bought an ice cream for 6 kids. Each cone was $1.25 and I paid with a $10 bill. 
               How many dollars did I get back? Explain first before answering."""
)

response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": {
                "temperature": 0.8,
                "top_p": 0.95,
                "max_new_tokens": 512,
                "do_sample": False
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

print(json.loads(response)[0]['generated_text'].strip().replace('<|eot_id|>', ''))

Let's break it down step by step!

You bought 6 ice cream cones for $1.25 each, so the total cost of the ice cream is:

6 cones x $1.25 per cone = $7.50

You paid with a $10 bill, so to find out how much change you got back, we need to subtract the cost of the ice cream from the $10 bill:

$10 (initial amount) - $7.50 (cost of ice cream) = $2.50

So, you got $2.50 in change!


## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. 

## Clean Up

In [19]:
# Delete the endpoint
sess.delete_endpoint(endpoint_name)

In [20]:
# In case the end point failed we still want to delete the model
sess.delete_endpoint_config(endpoint_name)
model.delete_model()