# DeepSeek-R1 series on SageMaker vLLM endpoint example

## 1. Define some variables

The byoc will build and store a vllm endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/vllm`), you need to define the following variables.

Please choose model and instances from above setting ⬆️

In [1]:
MODEL_ID = "tsbiosky/gemma3-hok-pubg-merge"
INSTANCE_TYPE = "ml.g5.12xlarge"

# better to work with vllm>=v0.7.3
VLLM_VERSION = "v0.8.3"
REPO_NAMESPACE = "sagemaker_endpoint/vllm"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]
if REGION.startswith("cn"):
    # this is a container mirror in cn region: https://github.com/nwcdlabs/container-mirror
    VLLM_REPO = "048912060910.dkr.ecr.cn-northwest-1.amazonaws.com.cn/dockerhub/vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com.cn/{REPO_NAMESPACE}:{VLLM_VERSION}"
else:
    VLLM_REPO = "vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{VLLM_VERSION}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [2]:
cmd = f"VLLM_REPO={VLLM_REPO} VLLM_VERSION={VLLM_VERSION} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh "
print("Runging:\n", cmd)
!{cmd}

Runging:
 VLLM_REPO=vllm/vllm-openai VLLM_VERSION=v0.8.3 REPO_NAMESPACE=sagemaker_endpoint/vllm ACCOUNT=596899493901 REGION=us-east-1 bash ./build_and_push.sh 
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
596899493901.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/vllm:v0.8.3
[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.1s (2/2)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 466B                                       0.0s
[0m[34m => [internal] load metadata for docker.io/vllm/vllm-openai:v0.8.3         0.1s
[0m[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.2s (9/9) FINISHED                                 docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[3

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [3]:
%pip install -U boto3 sagemaker transformers huggingface_hub modelscope s5cmd hf_transfer sagemaker-ssh-helper

Collecting s5cmd
  Downloading s5cmd-0.2.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting hf_transfer
  Downloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting sagemaker-ssh-helper
  Downloading sagemaker_ssh_helper-2.3.0-py3-none-any.whl.metadata (3.0 kB)
Downloading s5cmd-0.2.0-py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m112.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hf_transfer-0.1.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m146.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sagemaker_ssh_helper-2.3.0-py3-none-any.whl (102 kB)
Installing collected packages: s5cmd, hf_transfer, sagemaker-ssh-help

### 3.1 Init SageMaker session

In [3]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker
from sagemaker import Model
from sagemaker_ssh_helper.wrapper import SSHModelWrapper 



sess = sagemaker.Session()
role = sagemaker.get_execution_role()

default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### 3.2 Download and upload model file

Firstly, you need to prepare model weights and upload to S3. You can download from HuggingFace, ModelScope or upload your own model. 

If you want vllm to automatically pull the model when it starts, this step can be skipped.

In [4]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_model_path = "./models/" + model_name
s3_model_path = f"s3://{default_bucket}/models/" + model_name

%mkdir -p code {local_model_path}

print("local_model_path:", local_model_path)

local_model_path: ./models/tsbiosky-gemma3-hok-pubg-merge


##### Option 1: Global region (download from HuggingFace)

In [5]:
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

!huggingface-cli download --resume-download {MODEL_ID} --local-dir {local_model_path} --max-workers 32

Downloading '.gitattributes' to 'models/tsbiosky-gemma3-hok-pubg-merge/.cache/huggingface/download/wPaCkH-WbT7GsmxMKKrNZTV4nSM=.52373fe24473b1aa44333d318f578ae6bf04b49b.incomplete'
.gitattributes: 100%|██████████████████████| 1.57k/1.57k [00:00<00:00, 10.5MB/s]
Download complete. Moving file to models/tsbiosky-gemma3-hok-pubg-merge/.gitattributes
Downloading 'README.md' to 'models/tsbiosky-gemma3-hok-pubg-merge/.cache/huggingface/download/Xn7B-BWUGOee2Y6hCZtEhtFu4BE=.bcb9ac1ec89467d0ae13f2383af8d2eefd3862fe.incomplete'
README.md: 100%|███████████████████████████████| 903/903 [00:00<00:00, 12.7MB/s]
Download complete. Moving file to models/tsbiosky-gemma3-hok-pubg-merge/README.md
Downloading 'added_tokens.json' to 'models/tsbiosky-gemma3-hok-pubg-merge/.cache/huggingface/download/SeqzFlf9ZNZ3or_wZAOIdsM3Yxw=.e17bde03d42feda32d1abfca6d3b598b9a020df7.incomplete'
added_tokens.json: 100%|██████████████████████| 35.0/35.0 [00:00<00:00, 275kB/s]
Download complete. Moving file to models/tsbios

##### Option 2: China region  (download from ModelScope)

In [None]:
# !modelscope download --local_dir {local_model_path} {MODEL_ID} 

#### upload to s3

In [None]:
!s5cmd sync --concurrency 32 {local_model_path}/ {s3_model_path}/
print("s3_model_path:", s3_model_path)

### 3.3 Prepare vllm start scripts

Then you need to a write the vllm starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

Here is a simple script that pulling a model from S3 and starting a vllm server.

In [5]:
s3_model_path="s3://sagemaker-us-east-1-596899493901/models/tsbiosky-gemma3-hok-pubg-merge"

In [6]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
vllm_metrics_interval = 10
s3_code_path = f"s3://{default_bucket}/endpoint_code/vllm_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync --concurrency 64 \
    {s3_model_path}/* /temp/model_weight


# the start script need to be adjust as you needed
# port needs to be $SAGEMAKER_BIND_TO_PORT

python3 -m vllm.entrypoints.openai.api_server \\
    --port $SAGEMAKER_BIND_TO_PORT \\
    --trust-remote-code --gpu-memory-utilization 0.90 \\
    --tensor-parallel-size 4 \\
    --max_num_seqs 4 \\
    --max-model-len 4096 \\
    --dtype bfloat16 \\
    --served-model-name {MODEL_ID} \\
    --model /temp/model_weight
""")

local_code_path: tsbiosky-gemma3-hok-pubg-merge-250409-0653


In [7]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

tsbiosky-gemma3-hok-pubg-merge-250409-0653/
tsbiosky-gemma3-hok-pubg-merge-250409-0653/start.sh
upload: ./tsbiosky-gemma3-hok-pubg-merge-250409-0653.tar.gz to s3://sagemaker-us-east-1-596899493901/endpoint_code/vllm_byoc/tsbiosky-gemma3-hok-pubg-merge-250409-0653.tar.gz
s3_code_path: s3://sagemaker-us-east-1-596899493901/endpoint_code/vllm_byoc/tsbiosky-gemma3-hok-pubg-merge-250409-0653.tar.gz


### 3.3 Deploy endpoint on SageMaker

In [8]:
# Step 0. create model

# endpoint_model_name already defined in above step

variant_name = "AllTrafic"

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path,
        "Environment": {
            "ENDPOINT_NAME": endpoint_model_name,
            "VARIANT_NAME": variant_name,
            "VLLM_METRICS_INTERVAL": str(vllm_metrics_interval),
        },
    },
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-east-1:596899493901:model/tsbiosky-gemma3-hok-pubg-merge-250409-0653', 'ResponseMetadata': {'RequestId': 'b4a797e0-a498-401f-95ee-ae9b51be1788', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b4a797e0-a498-401f-95ee-ae9b51be1788', 'content-type': 'application/x-amz-json-1.1', 'content-length': '104', 'date': 'Wed, 09 Apr 2025 06:53:56 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: tsbiosky-gemma3-hok-pubg-merge-250409-0653


In [9]:
# Step 1. create endpoint config

# endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)
endpoint_config_name = endpoint_model_name

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:596899493901:endpoint-config/tsbiosky-gemma3-hok-pubg-merge-250409-0653', 'ResponseMetadata': {'RequestId': 'd4bc5a5b-c7a2-4e93-b26c-ee9ffd403af3', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'd4bc5a5b-c7a2-4e93-b26c-ee9ffd403af3', 'content-type': 'application/x-amz-json-1.1', 'content-length': '123', 'date': 'Wed, 09 Apr 2025 06:53:58 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: tsbiosky-gemma3-hok-pubg-merge-250409-0653


In [10]:
# Step 2. create endpoint

# endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)
endpoint_name = endpoint_model_name

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)

while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint:", endpoint_name, status)

{'EndpointArn': 'arn:aws:sagemaker:us-east-1:596899493901:endpoint/tsbiosky-gemma3-hok-pubg-merge-250409-0653', 'ResponseMetadata': {'RequestId': '70738ad1-91b1-4364-b310-25f711ce4c58', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '70738ad1-91b1-4364-b310-25f711ce4c58', 'content-type': 'application/x-amz-json-1.1', 'content-length': '110', 'date': 'Wed, 09 Apr 2025 06:54:01 GMT'}, 'RetryAttempts': 0}}
20250409-06:54:01 status: Creating
20250409-06:55:01 status: Creating
20250409-06:56:02 status: Creating
20250409-06:57:02 status: Creating
20250409-06:58:02 status: Creating
20250409-06:59:02 status: Creating
20250409-07:00:02 status: Creating
20250409-07:01:02 status: Creating
20250409-07:02:02 status: Creating
20250409-07:03:03 status: Creating
Endpoint: tsbiosky-gemma3-hok-pubg-merge-250409-0653 InService


## 4. Test

You can invoke your model with SageMaker runtime.

In [13]:
# messages = [{
#     "role": "user",
#     "content": "Hi, who are you!"
# }]
prompt="<role>You're a specialist in meticulously translating chat text from FPS game PUBG.</role>\n<task> translate user chat text input to English</task>\n<requirements>\n1.return only translation results \n2.Identify and translate gaming terminology with terminology example\n3.Strict following the terminology example.\n4.Keep game communication concise\n5.Retain tags <lock_1> and <newline>\n6.Retain gibberish\n</requirements>\n<terminology example>text:任务\ntranslation:Mission\n</terminology example>\n<output_format>\nOutput only one final translation result without any thought process or explanation.\n</output_format>\n"
text="完成1-9级所有任务"
    
messages = [
        {
            "role": "system",
            "content": prompt
        },
        {
            "role": "user",
            "content": text
        }
]
max_tokens = 100

### 4.1 Message api non-stream mode

In [14]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": max_tokens,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

Complete all missions from level 1-9<end_of_turn>


### 4.2 Message api stream mode

In [15]:
payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": max_tokens,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

Complete all missions from levels 1-9<end_of_turn>


### 4.5 Speed test

In [16]:

sagemaker_runtime = boto3.client('runtime.sagemaker')

messages = [{
    "role": "user",
    "content": "帮我写一首七言律诗介绍上海"
}]

payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": 1024,
    "temperature": 0.0,
    "stream": True,
    "stream_options": {"include_usage": True},
}
endpoint_name=endpoint_model_name
response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
time_start = time.time()
first_token_latency = 0
output = []
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            chunk = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            # print(chunk)
            if "usage" in chunk:
                print(chunk)
                input_tokens = chunk["usage"]["prompt_tokens"]
                output_tokens = chunk["usage"]["completion_tokens"]
            if "choices" in chunk and chunk["choices"][0]["delta"]["content"]:
                if first_token_latency == 0:
                    first_token_latency = time.time() - time_start
                output.append(chunk["choices"][0]["delta"]["content"])
                print(output[-1], end="", flush=True)

        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]


total_time = time.time() - time_start

print("\n" + "=" * 50)
print("Input_tokens", input_tokens, "Output_tokens", output_tokens)
print(f"First token latency {first_token_latency:.3} seconds")
print(f"Output speed {output_tokens/(total_time-first_token_latency):.3} tokens/seconds")
print("=" * 50)

好的，这是一首介绍上海的七言律诗：

黄浦江流绕九重天，
魔都繁华耀眼帘。
外滩钟声传古韵，
陆家嘴立展新颜。
石库门里寻旧梦，
豫园庭院赏花妍。
海纳百川容万象，
东方明珠璀璨鲜。
<end_of_turn>{'id': 'chatcmpl-723ac698fa6b40919e7f2573ac8de89a', 'object': 'chat.completion.chunk', 'created': 1744183147, 'model': 'tsbiosky/gemma3-hok-pubg-merge', 'choices': [], 'usage': {'prompt_tokens': 20, 'total_tokens': 107, 'completion_tokens': 87}}

Input_tokens 20 Output_tokens 87
First token latency 0.0558 seconds
Output speed 26.7 tokens/seconds


### 4.6 Metrics moniter

If you are doing pressure test, you can view vLLM metrics on [CloudWatch-metrics](https://us-east-2.console.aws.amazon.com/cloudwatch/home?region=us-east-2#metricsV2?graph=~(sparkline~false~view~'timeSeries~stacked~false~region~'us-east-2~stat~'Average~period~1)&query=~'*7b*2faws*2fsagemaker*2fEndpoints*2cEndpointName*2cVariantName*7d)

![](./assets/vLLM-metric.jpeg)

## 5. Clean
You could delete files using these functions. Uncomment last three lines.

In [None]:
def delete_endpoint(endpoint_name):
    try:
        sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
        print(f"Endpoint '{endpoint_name}' deletion initiated.")

        # Wait for the endpoint to be deleted
        while True:
            try:
                sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
                print("Waiting for endpoint to be deleted...")
                time.sleep(30)
            except sagemaker_client.exceptions.ClientError:
                print(f"Endpoint '{endpoint_name}' has been deleted.")
                break
    except sagemaker_client.exceptions.ClientError as e:
        print(f"Error deleting endpoint: {e}")

def delete_endpoint_config(endpoint_config_name):
    try:
        sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
        print(f"Endpoint configuration '{endpoint_config_name}' has been deleted.")
    except sagemaker_client.exceptions.ClientError as e:
        print(f"Error deleting endpoint configuration: {e}")

def delete_model(model_name):
    try:
        sagemaker_client.delete_model(ModelName=model_name)
        print(f"Model '{model_name}' has been deleted.")
    except sagemaker_client.exceptions.ClientError as e:
        print(f"Error deleting model: {e}")


delete_endpoint(endpoint_name)
delete_endpoint_config(endpoint_config_name)
delete_model(endpoint_model_name)

Error deleting endpoint: An error occurred (ValidationException) when calling the DeleteEndpoint operation: Could not find endpoint "tsbiosky-gemma3-hok-pubg-merge-250409-0436".
Endpoint configuration 'tsbiosky-gemma3-hok-pubg-merge-250409-0653' has been deleted.
Model 'tsbiosky-gemma3-hok-pubg-merge-250409-0653' has been deleted.
