# LMI vLLM DeepSeek-R1-Distill-Llama-70B vLLM deployment guide

In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [None]:
%pip install -U sagemaker

In [None]:
import os
from pathlib import Path
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sagemaker_default_bucket = sess.default_bucket()

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [None]:
model_name="deepseek-ai/DeepSeek-R1-Distill-Llama-70B"

model_lineage=model_name.split("/")[0]
model_specific_name = model_name.split("/")[1]

### Download model and upload to S3

1. Download model from Hugging face
2. Upload model to S3 Bucket
3. Write serving.properties using s3url

In [None]:
!pip install -U huggingface_hub

In [None]:
# Uncomment this for China Region
# os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [None]:
local_model_path_name = model_name.split("/")[-1]
local_model_path = Path(local_model_path_name)
local_model_path.mkdir(exist_ok=True)

s3_model_prefix = f"lmi/{local_model_path_name}"
s3url=f"s3://{sagemaker_default_bucket}/{s3_model_prefix}"
print(s3url)
print(f"huggingface-cli download --resume-download {model_name} --local-dir {local_model_path}")

In [None]:
!huggingface-cli download --resume-download {model_name} --local-dir {local_model_path}

In [None]:
# Upload model to S3
print(f"!aws s3 cp {local_model_path} {s3url} --recursive")
!aws s3 cp {local_model_path} {s3url} --recursive

### Compress model artifacts

In [None]:
with open("serving.properties", "w") as wf:
    wf.write(f"""
engine=Python
option.model_id={s3url}
option.enforce_eager=true
option.tensor_parallel_degree=8
option.rolling_batch=vllm
option.max_model_len=12200
option.max_rolling_batch_size=8
option.gpu_memory_utilization=0.9
""")

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

For more versions or regions, you should checkout [Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [None]:
image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128"

# for China (Beijing) cn-north-1
# image_uri = "727897471807.dkr.ecr.cn-north-1.amazonaws.com.cn/djl-inference:0.33.0-lmi15.0.0-cu128"

# for China (Ningxia) cn-northwest-1
# image_uri = "727897471807.dkr.ecr.cn-northwest-1.amazonaws.com.cn/djl-inference:0.33.0-lmi15.0.0-cu128"

### Upload artifact on S3 and create SageMaker model

In [None]:
s3_code_prefix = f"large-model-lmi/code-{model_lineage}-{model_specific_name}"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

In [None]:
model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

## Step4: Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.48xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"lmi-model-{model_lineage}-{model_specific_name}").replace(".", "-")

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    # container_startup_health_check_timeout=3600
)

## Step 5: Test and benchmark the inference

### Message API
Ref: https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html#message

In [None]:
import io
import time
import json
import boto3


class MessageTokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()

            # print(line)
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode("utf-8")
                # print(full_line)
                return json.loads(full_line.lstrip("data:").rstrip("/n"))
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])
            
prompt = "tell me a long story."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
payload = {
    "messages": messages,
    "max_tokens": 128,
    "temperature": 0.7,
    "top_p": 0.8,
    "stream": "true"
}


sagemaker_client = boto3.client("sagemaker-runtime")

ttft = 0

tic = time.time()

response_stream = sagemaker_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    CustomAttributes='accept_eula=false'
)

num_tokens = 0
for data in MessageTokenIterator(response_stream["Body"]):
    token = data["choices"][0]["delta"].get("content", "")
    if token and ttft == 0:
        ttft = time.time() - tic
    print(token, end="")
    num_tokens += 1
print("TTFT", ttft)
print("OTPS", num_tokens / (time.time() - tic))

### Tool calling
Ref: [https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/tool_calling.html](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/tool_calling.html)

In [None]:
payload =  {
    "messages": [
        {
            "role": "user",
            "content": "Hi! How are you doing today?"
        }, 
        {
            "role": "assistant",
            "content": "I'm doing well! How can I help you?"
        }, 
        {
            "role": "user",
            "content": "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
        }
    ],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type":
                            "string",
                        "description":
                            "The city to find the weather for, e.g. 'San Francisco'"
                    },
                    "state": {
                        "type":
                            "string",
                        "description":
                            "the two-letter abbreviation for the state that the city is in, e.g. 'CA' which would mean 'California'"
                    },
                    "unit": {
                        "type": "string",
                        "description":
                            "The unit to fetch the temperature in",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["city", "state", "unit"]
            }
        }
    }],
    "tool_choice": {
        "type": "function",
        "function": {
            "name": "get_current_weather"
        }
    },
}

response = sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    CustomAttributes='accept_eula=false'
)

print(json.loads(response["Body"].read()))

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()