# Deploy Llama 2 on Amazon SageMaker with TensorRT-LLM

---

[Llama 2](https://llama.meta.com/llama2/) are pretrained models trained on 2 trillion tokens with a 4k context length. Its fine-tuned chat models have been trained on over 1 million human annotations. Llama 2 has undergone internal and external adversarial testing across fine-tuned models to identify potential toxicity, bias, and other gaps in performance. To learn more about Llama 2 models, click [here](https://llama.meta.com/llama2/).

SageMaker has rolled out [TensorRT-LLM container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) which now provides users with the ability to leverage the managed serving capabilities and help to provide the un-differentiated heavy lifting.

In this notebook, we combine the strengths of two powerful tools: [DJL](https://docs.djl.ai/) (Deep Java Library) for the serving framework and [TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) for distributed large language model inference on Nvidia. DJLServing, a high-performance universal model serving solution powered by DJL, handles the overall serving architecture.

In our setup, vLLM handles the core LLM inference tasks, leveraging its optimizations to achieve high performance and low latency. DJLServing manages the broader serving infrastructure, handling incoming requests, load balancing, and coordinating with vLLM for efficient inference.

This combination allows us to deploy the `Llama 2 7B` model across GPUs on the `ml.g5.12xlarge` instance with optimal resource utilization. vLLM's efficiencies in memory management and request handling enable us to serve this large model with improved throughput compared to traditional serving methods. To learn more about DJL, DJLServing, and TensorRT-LLM you can refer to this [blog post](https://aws.amazon.com/blogs/machine-learning/boost-inference-performance-for-llms-with-new-amazon-sagemaker-containers/).

<div class="alert alert-block alert-warning"> 

<b>NOTE:</b> Llama models are licensed under a bespoke commercial license that balances open access to the models with responsibility and protections in place to help address potential misuse. Their license allows for broad commercial use, as well as for developers to create and redistribute additional work on top of Llama models. For more details, their licenses can be found at [Meta Llama 2](https://llama.meta.com/license/) and [Meta Llama 3](https://llama.meta.com/llama3/license/).
</div>

##### Reach out to Mistral to explore Codestral for commercial use cases: [Contact the Mistral team](https://mistral.ai/contact/)

##### More on the Mistral AI Non-Production License: [Mistral AI Non-Production License](https://mistral.ai/news/mistral-ai-non-production-license-mnpl/)

---

## Requirements

1. Create an Amazon SageMaker Notebook Instance - [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html)
    - For Notebook Instance type, choose `ml.t3.medium`.
2. For Select Kernel, choose [conda_python3](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-prepare.html).
3. Install the required packages.

<div class="alert alert-block alert-info"> 

<b>NOTE:

- </b> For <a href="https://aws.amazon.com/sagemaker/studio/" target="_blank">Amazon SageMaker Studio</a>, select Kernel "<span style="color:green;">Python 3 (ipykernel)</span>".

- For <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank">Amazon SageMaker Studio Classic</a>, select Image "<span style="color:green;">Base Python 3.0</span>" and Kernel "<span style="color:green;">Python 3</span>".

</div>

To run this notebook you would need to install the following dependencies:

In [3]:
!pip install boto3==1.34.132 -qU --force --quiet --no-warn-conflicts
!pip install sagemaker==2.224.2 -qU --force --quiet --no-warn-conflicts

---

### Import libraries

In [1]:
import boto3
import json
import sagemaker

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
print(sagemaker.__version__)

2.224.2


### Initialize parameters

In [3]:
# execution role for the endpoint
role = sagemaker.get_execution_role()

# sagemaker session for interacting with different AWS APIs
sess = sagemaker.session.Session()

# Region
region_name = sess._region_name

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {region_name}")

sagemaker role arn: arn:aws:iam::570598552974:role/txt2sql-SageMakerExecutionRole-PAgMr5TND4x0
sagemaker session region: us-east-1


### Image URI of the DJL Container

LMI DLCs offer a low-code interface that simplifies using state-of-the-art inference optimization techniques and hardware. LMI allows you to apply tensor parallelism; the latest efficient attention, batching, quantization, and memory management techniques; token streaming; and much more, by just requiring the model ID and optional model parameters. 

See available Large Model Inference DLC's [here](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) and for more details [here](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/announcements/deepspeed-deprecation.md).

In [4]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="djl-tensorrtllm",
    region=region_name,
    version="0.28.0"
)
print(f"DCL Image going to be used is ---- > {inference_image_uri}")

DCL Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122


### Available Environment Variable Configurations

Here is a list of settings that we use in this configuration file:

- `HF_MODEL_ID`: The model id of a pretrained model hosted inside a model repository on [huggingface.co](https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co. This is an optional setting and is not needed in the scenario where you are brining your own model. If you are getting your own model, you can include the URI of the Amazon S3 bucket that contains the model.
- `HF_TOKEN`: Some models on the HuggingFace Hub are gated and require permission from the owner to access. To deploy a gated model from the HuggingFace Hub using LMI, you must provide an [Access Token](https://huggingface.co/docs/hub/security-tokens) via this environment variable.
- `OPTION_ENGINE`: The engine for DJL to use. In this case, we intend to use [MPI](https://docs.djl.ai/docs/serving/serving/docs/lmi/conceptual_guide/lmi_engine.html). MPI is used to operate on single machine multi-gpu or multiple machines multi-gpu use cases.
- `OPTION_DTYPE`: The data type you plan to cast the model weights to. If not provided, LMI will use fp16.
- `OPTION_TGI_COMPAT`: To get the same response output as HuggingFace's Text Generation Inference, you can use the env `OPTION_TGI_COMPAT=true`.
- `OPTION_TASK`: The task used in Hugging Face for different pipelines. Default is text-generation. For further reading on DJL parameters on SageMaker, follow the [link](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/deepspeed_user_guide.html)
- `OPTION_ROLLING_BATCH`: Enables continuous batching (iteration level batching) with one of the supported backends. Available backends differ by container, see [Inference Library Configurations](https://docs.djl.ai/docs/serving/serving/docs/lmi/deployment_guide/configurations.html#inference-library-configuration) for mappings.
    - In the TensorRT-LLM Container:
        - use `OPTION_ROLLING_BATCH=trtllm` to use TensorRT-LLM (this is the default)
- `TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. Setting this to `max`, which will shard the model across all available GPUs. As an example if we have a 8 GPU machine and we are creating 8 partitions then we will have 1 worker per model to serve the requests.
- `OPTION_MAX_INPUT_LEN`: Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request.
- `OPTION_MAX_OUTPUT_LEN`: Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.
- `OPTION_TRUST_REMOTE_CODE`: If the model artifacts contain custom modeling code, you should set this to true after validating the custom code is not malicious. If you are using a HuggingFace Hub model id, you should also specify HF_REVISION to ensure you are using artifacts and code that you have validated.

For more details on the configuration options and an exhaustive list, you can refer the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html) and [LMI Starting Guide](https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/trt_llm_user_guide.html)

## Create SageMaker endpoint

In [5]:
# Hugging Face Model Id
model_id = "meta-llama/Llama-2-7b-chat-hf"

# Environment variables
hf_token = "<REPLACE WITH YOUR TOKEN>" # Use for gated models
rolling_batch = "trtllm"
max_output_len = 4096

env = {}
env['HF_MODEL_ID'] = model_id
env['OPTION_ROLLING_BATCH'] = rolling_batch
env['OPTION_DTYPE'] = "fp16"
env['OPTION_TGI_COMPAT'] = "true"
env['OPTION_ENGINE'] = "MPI"
env['OPTION_TASK'] = "text-generation"
env['TENSOR_PARALLEL_DEGREE'] = "max"
env['OPTION_MAX_INPUT_LEN'] = json.dumps(max_output_len - 1)
env['OPTION_MAX_OUTPUT_LEN'] = json.dumps(max_output_len)
env['OPTION_DEVICE_MAP'] = "auto"
# env['OPTION_TRUST_REMOTE_CODE'] = "true"

# Include HF token for gated models
if hf_token != "<REPLACE WITH YOUR TOKEN>":
    env['HF_TOKEN'] = hf_token
else:
    print("Llama models are gated, please add your HF token before you continue.")

In [8]:
# SageMaker Instance Type
instance_type = "ml.g5.12xlarge"

# Endpoint name
endpoint_name_prefix = "llama2-7b-chat-tensorrt-llm"
endpoint_name = sagemaker.utils.name_from_base(endpoint_name_prefix)

print(f"instance_type: {instance_type}")
print(f"model_id: {model_id}")
print(f"endpoint_name: {endpoint_name}")

instance_type: ml.g5.12xlarge
model_id: meta-llama/Llama-2-7b-chat-hf
endpoint_name: llama2-7b-chat-tensorrt-llm-2024-07-06-11-32-41-078


In [9]:
# Deploy model to an endpoint
model = sagemaker.Model(
    image_uri=inference_image_uri,
    role=role,
    env=env
)

In [10]:
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
)

---------------!

## Run inference and chat with the model

### Supported Inference Parameters

---
This model supports the following inference payload parameters:

* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches max_new_tokens. If specified, it must be a positive integer.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.

You may specify any subset of the parameters mentioned above while invoking an endpoint. 

---

### Inference using SageMaker SDK

In [11]:
# Initialize sagemaker client with the endpoint created in the prior step
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [13]:
prompt = """<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{message_prompt} [/INST] </s>""".format(
    system_prompt="You are a helpful assistant.",
    message_prompt="Building a website can be done in 10 simple steps:"
)

inputs = {
    "inputs": prompt,
    "parameters": {
        "temperature": 0.8,
        "top_p": 0.95,
        "max_new_tokens": 512,
        "do_sample": False
    }
}
response = predictor.predict(inputs)
print(response[0]['generated_text'].strip().replace('</s>', ''))

01. Define Your Website's Purpose and Goals:
Determine the purpose and goals of your website, including the message you want to convey, the audience you want to reach, and the actions you want visitors to take.

02. Choose a Domain Name:
Select a unique and memorable domain name that reflects the content and purpose of your website. This is the address that visitors will use to access your website.

03. Choose a Web Host:
Find a reliable and affordable web host that meets your needs, including storage space, bandwidth, and technical support.

04. Plan Your Website's Structure:
Develop a website outline or wireframe that organizes your content into logical sections and subsections. This will help you create a clear and intuitive navigation menu.

05. Create Content:
Write and gather content for your website, including text, images, videos, and other media. Make sure your content is well-written, informative, and optimized for search engines.

06. Design Your Website:
Design your website

### Inference using Boto3 SDK

In [14]:
# Initialize sagemaker client with boto3 using the endpoint created from prior step
smr_client = boto3.client("sagemaker-runtime")

In [15]:
prompt = """<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{message_prompt} [/INST]""".format(
    system_prompt="You are a helpful assistant.",
    message_prompt="""what is the recipe of mayonnaise?"""
)

response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": prompt,
            "parameters": {
                "temperature": 0.8,
                "top_p": 0.95,
                "max_new_tokens": 4000,
                "do_sample": False
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

print(json.loads(response)[0]['generated_text'].replace('</s>', ''))

 Sure, I'd be happy to help! Here is a basic recipe for homemade mayonnaise:

Ingredients:

* 2 egg yolks
* 1/2 cup (120 ml) neutral-tasting oil, such as canola or grapeseed
* 1 tablespoon lemon juice or vinegar
* Salt and pepper to taste

Instructions:

1. In a small bowl, whisk together the egg yolks and lemon juice or vinegar until well combined.
2. Slowly pour the oil into the egg yolk mixture, whisking constantly. You can use an electric mixer on low speed or whisk by hand.
3. Continue whisking until the mixture thickens and emulsifies, which should take about 5-7 minutes. You will know it's ready when the mixture has doubled in volume and is smooth and creamy.
4. Taste and adjust the seasoning as needed with salt and pepper.
5. Cover and refrigerate the mayonnaise for at least 30 minutes before using.

That's it! Homemade mayonnaise can be used as a sandwich spread, salad dressing, or dip. Enjoy!

Note: If you want to make a vegan mayonnaise, you can replace the egg yolks with 1/

## Conclusion
In this post, we demonstrated how to use SageMaker large model inference containers to host Codestral 22B. We used DeepSpeed’s model parallel techniques with multiple GPUs on a single SageMaker machine learning instance. 

## Clean Up

In [16]:
# Delete the endpoint
sess.delete_endpoint(endpoint_name)

In [17]:
# In case the end point failed we still want to delete the model
sess.delete_endpoint_config(endpoint_name)
model.delete_model()