# Deploy DeepSeek R1 Llama on AWS Inferentia using SageMaker Large Model Inference Container

In this example you will deploy the model  `deepseek-ai/DeepSeek-R1-Distill-Llama-8B`

Update the sagemaker SDK to the latest version

In [None]:
!pip install sagemaker

Baseline setup- Set Session id, role and region for the client

In [None]:
import boto3
import sagemaker

boto_session = boto3.session.Session()
region = boto_session.region_name

sess = sagemaker.Session()
role = sagemaker.get_execution_role()



Download the model snapshot from HuggingFace.

Create a unique model name based prefix to prevent name collisions on future runs from the HuggingFace Model ID.



In [3]:
#pull out the model name
import time
import datetime

hf_model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
ts = time.time()
base_model_name = f"{hf_model_id.rsplit('/', 1)[-1].replace('.','-')}-{datetime.datetime.fromtimestamp(ts).strftime('%m%d%H%M%S')}"

print(base_model_name)

DeepSeek-R1-Distill-Llama-8B-0128202751


## Large Model Inference (LMI) Containers

In this example you will deploy your model using [SageMaker's Large Model Inference (LMI) Containers](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html).

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution.

For this example we will use the LMI container with vLLM backend. 

The model for this example can be deployed using the vLLM backend, which corresponds to the `djl-lmi` container image.

| Backend | SageMakerDLC | Example URI |
| --- | --- | --- |
|vLLM|djl-lmi|763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1


Supplying the `LMI_VERSION` along with the desirect `LMI_FRAMEWORK` below will fetch the corresponding ECR image for deploying your endpoint.

In [4]:

inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1"
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.30.0-neuronx-sdk2.20.1


Next you will need to specify configuration of the LMI container to allow the model artifact to be downloaded, and provide optimized parameters to allow the model to run on the chosen instance size/type.

There are 2 methods to supply configuration to the LMI container:
1. Create a `serving.properties` file and include it inside the compressed model artifact. This has the benefit of ensuring that no configuration information needs to be shared, as long as you have the model artifact. However it creates rigidity as it is tightly coupled and creates complexity when deploying on different instance types.
2. Provide a set of Environment Variables to the SageMaker Model object. This provides flexibility by storing the LMI configuration information inside the SageMaker Model configuration step.

In this example, you will leverage Environment Variables to configure the LMI container.

For deploying HuggingFace models, the `HF_MODEL_ID` parameter is dual purpose and can be either the HuggingFace Model ID, or an S3 location of the model artifacts. If you specify the Model ID, the artifacts will be downloaded when the endpoint is created.


In [5]:

hf_model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
##Add your Huggingface model token below
vllm_config = {
    "HF_MODEL_ID": hf_model_id,
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "HF_TOKEN": "",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_OUTPUT_FORMATTER": "json",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_MODEL_LOADING_TIMEOUT": "1600",
}

In the following steps you will leverage the SageMaker Python SDK to build your model configuration and deploy it to SageMaker endpoint. There are alternative methods to do this as well, such as the Boto3 SDK, but the SM Python SDK reduces the amount of code necessary perform the same activities.

The first step in model deployment is to [create a SageMaker Model object](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html). This consists of a unique name, a container image, and the environment configuration from above.

In [6]:
from sagemaker.utils import name_from_base

model_name = f"{base_model_name}"
model_name = base_model_name

lmi_model = sagemaker.Model(
    image_uri = inference_image_uri,
    env = vllm_config,
    role = role,
    name = model_name
)

Now that you have a model object ready, you will use use the SageMaker Python SDK to create a SageMaker Managed Endpoint. The SDK eliminates some of the intermediate steps, such as creating an Endpoint Configuration.

## Note: creating a new endpoint can take between 8-15 minutes.

In [None]:
%%time
instance_type = "ml.inf2.24xlarge"
endpoint_name = f"{model_name}-endpoint-v1"

lmi_model.deploy(
    initial_instance_count = 1,
    instance_type = instance_type,
    container_startup_health_check_timeout = 1600,
    endpoint_name = endpoint_name,
)

With your endpoint successfully deployed, you will want to test it to ensure that it is fully functional.

To do so, you will take a piece of sample text and summarize it using your deployed model. This sample text was pulled from the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum).

In [12]:
recipe_food = """
How to make cake?
"""


prompt_template = f"""
<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>
You are a helpful chef assistant who is an expert in screating recipes.
<|eot_id|>

<|start_header_id|>user<|end_header_id|>
Create a recipe here.

{recipe_food}


Provide the summary directly, without any introduction or preamble. Do not start the response with "Here is a...".<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
"""

Using the sample article and prompt template, invoke the model to view the structure of the response and its contents.

In [13]:
%%time
import json

print(endpoint_name)

llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)


response = llm.predict(
    {
        "inputs": prompt_template,
        "parameters": {
            "do_sample":True,
            "max_new_tokens":256,
            "top_p":0.9,
            "temperature":0.6,
        }
    }
)

response['generated_text']

DeepSeek-R1-Distill-Llama-8B-0128202751-endpoint-v1
CPU times: user 17.6 ms, sys: 7.72 ms, total: 25.3 ms
Wall time: 10.7 s


"Okay, the user wants me to create a recipe for cake. Let me think about the basic ingredients and steps needed. I should make sure to include everything clearly so the user can follow easily. I'll start with the ingredients, listing each one with their quantities. Then, I'll outline the steps in a logical order: preheating the oven, mixing dry ingredients, adding wet ingredients, pouring into the pan, baking, and cooling. I need to keep the language simple and straightforward without any unnecessary words. That should cover everything the user needs to make a perfect cake.\n</think>\n\n**Cake Recipe**\n\n**Ingredients:**\n- 1 cup (2 sticks) butter, softened\n- 3/4 cup granulated sugar\n- 1/4 cup packed brown sugar\n- 4 large eggs\n- 1 cup whole milk\n- 1 teaspoon vanilla extract\n- 2 1/4 cups all-purpose flour\n- 1 teaspoon baking powder\n- 1/2 teaspoon baking soda\n- 1/2 teaspoon salt\n- 2 cups powdered sugar (for dusting)\n\n**Instructions:**\n1. Preheat the oven to 350°F (175°C).\n