# SageMaker VLLM endpoint example

## 1. Define some variables

The byoc will build and store a vllm endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/vllm`), you need to define the following variables.

In [1]:
MODEL_ID = "Qwen/QwQ-32B"
INSTANCE_TYPE = "ml.g4dn.12xlarge"
VLLM_VERSION = "v0.7.3"
REPO_NAMESPACE = "sagemaker_endpoint/vllm"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]
if REGION.startswith("cn"):
    # this is a example repo port from vllm/vllm-openai, you can create your own docker image in your global region account
    VLLM_REPO = "public.ecr.aws/y0a9p9k0/vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com.cn/{REPO_NAMESPACE}:{VLLM_VERSION}"
else:
    VLLM_REPO = "vllm/vllm-openai"
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{VLLM_VERSION}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [2]:
cmd = f"VLLM_REPO={VLLM_REPO} VLLM_VERSION={VLLM_VERSION} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh"
print("Runging:", cmd)
!{cmd}

Runging: VLLM_REPO=vllm/vllm-openai VLLM_VERSION=v0.7.3 REPO_NAMESPACE=sagemaker_endpoint/vllm ACCOUNT=596899493901 REGION=us-east-1 bash ./build_and_push.sh
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
596899493901.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/vllm:v0.7.3
[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.1s (2/2)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 466B                                       0.0s
[0m[34m => [internal] load metadata for docker.io/vllm/vllm-openai:v0.7.3         0.1s
[0m[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (7/8)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[34m

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [3]:
%pip install -U boto3 sagemaker transformers huggingface_hub modelscope

Collecting boto3
  Downloading boto3-1.41.3-py3-none-any.whl.metadata (6.8 kB)
Collecting sagemaker
  Downloading sagemaker-3.0.1-py3-none-any.whl.metadata (12 kB)
Collecting transformers
  Downloading transformers-4.57.2-py3-none-any.whl.metadata (43 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-1.1.5-py3-none-any.whl.metadata (13 kB)
Collecting modelscope
  Downloading modelscope-1.32.0-py3-none-any.whl.metadata (43 kB)
Collecting botocore<1.42.0,>=1.41.3 (from boto3)
  Downloading botocore-1.41.3-py3-none-any.whl.metadata (5.9 kB)
Collecting s3transfer<0.16.0,>=0.15.0 (from boto3)
  Downloading s3transfer-0.15.0-py3-none-any.whl.metadata (1.7 kB)
Collecting sagemaker-core<3.0.0,>=2.0.0 (from sagemaker)
  Downloading sagemaker_core-2.0.1-py3-none-any.whl.metadata (5.4 kB)
Collecting sagemaker-train<2.0.0 (from sagemaker)
  Downloading sagemaker_train-1.0-py3-none-any.whl.metadata (7.6 kB)
Collecting sagemaker-serve<2.0.0 (from sagemaker)
  Downloading sagemaker_serve-1

### 3.1 Init SageMaker session

In [None]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker


sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")

### 3.2 Download and upload model file

Firstly, you need to prepare model weights and upload to S3. You can download from HuggingFace, ModelScope or upload your own model. 

If you want vllm to automatically pull the model when it starts, this step can be skipped.

In [5]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_model_path = os.environ['HOME'] + "/SageMaker/models/" + model_name
cache_model_path = os.environ['HOME'] + "/SageMaker/huggingface_cache"
s3_model_path = f"s3://{default_bucket}/models/" + model_name

%mkdir -p code {local_model_path}

print("local_model_path:", local_model_path)

local_model_path: /home/ec2-user/SageMaker/models/Qwen-QwQ-32B


##### Option 1: Global region (download from HuggingFace)

In [6]:
!huggingface-cli download --resume-download {MODEL_ID} --local-dir {local_model_path} --cache-dir {cache_model_path}

Fetching 27 files: 100%|██████████████████████| 27/27 [00:00<00:00, 1403.93it/s]
/home/ec2-user/SageMaker/models/Qwen-QwQ-32B


##### Option 2: China region  (download from ModelScope)

In [None]:
!modelscope download --local_dir {local_model_path} {MODEL_ID} 

#### upload to s3

In [7]:
!aws s3 sync {local_model_path} {s3_model_path}
print("s3_model_path:", s3_model_path)

upload: ../../../models/Qwen-QwQ-32B/.cache/huggingface/download/.gitattributes.lock to s3://sagemaker-us-east-1-596899493901/models/Qwen-QwQ-32B/.cache/huggingface/download/.gitattributes.lock
upload: ../../../models/Qwen-QwQ-32B/.cache/huggingface/download/model-00001-of-00014.safetensors.lock to s3://sagemaker-us-east-1-596899493901/models/Qwen-QwQ-32B/.cache/huggingface/download/model-00001-of-00014.safetensors.lock
upload: ../../../models/Qwen-QwQ-32B/.cache/huggingface/download/model-00004-of-00014.safetensors.lock to s3://sagemaker-us-east-1-596899493901/models/Qwen-QwQ-32B/.cache/huggingface/download/model-00004-of-00014.safetensors.lock
upload: ../../../models/Qwen-QwQ-32B/.cache/huggingface/download/config.json.lock to s3://sagemaker-us-east-1-596899493901/models/Qwen-QwQ-32B/.cache/huggingface/download/config.json.lock
upload: ../../../models/Qwen-QwQ-32B/.cache/huggingface/download/LICENSE.lock to s3://sagemaker-us-east-1-596899493901/models/Qwen-QwQ-32B/.cache/huggingface/

### 3.3 Prepare vllm start scripts

Then you need to a write the vllm starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

Here is a simple script that pulling a model from S3 and starting a vllm server.

In [25]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/vllm_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync {s3_model_path}/* /opt/ml/modelfile/


# the start script need to be adjust as you needed
# port needs to be $SAGEMAKER_BIND_TO_PORT

python3 -m vllm.entrypoints.openai.api_server \\
    --port $SAGEMAKER_BIND_TO_PORT \\
    --trust-remote-code \\
    --tensor-parallel-size 4 \\
    --gpu-memory-utilization 0.85 \\
    --max-model-len 2048 \\
    --enforce-eager \\
    --load-format bitsandbytes \\
    --dtype=half \\
    --quantization bitsandbytes \\
    --swap-space 32 \\
    --max-num-batched-tokens 24576 \\
    --max-num-seqs 12 \\
    --model /opt/ml/modelfile/ \\
    --served-model-name {MODEL_ID}
""")

local_code_path: Qwen-QwQ-32B-251111-0235


In [26]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

Qwen-QwQ-32B-251111-0235/
Qwen-QwQ-32B-251111-0235/start.sh
upload: ./Qwen-QwQ-32B-251111-0235.tar.gz to s3://sagemaker-us-east-1-596899493901/endpoint_code/vllm_byoc/Qwen-QwQ-32B-251111-0235.tar.gz
s3_code_path: s3://sagemaker-us-east-1-596899493901/endpoint_code/vllm_byoc/Qwen-QwQ-32B-251111-0235.tar.gz


### 3.3 Deploy endpoint on SageMaker

In [27]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    },
    
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-east-1:596899493901:model/Qwen-QwQ-32B-251111-0235', 'ResponseMetadata': {'RequestId': '38691a0b-c763-478e-b611-c07924be00c4', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '38691a0b-c763-478e-b611-c07924be00c4', 'strict-transport-security': 'max-age=47304000; includeSubDomains', 'x-frame-options': 'DENY', 'content-security-policy': "frame-ancestors 'none'", 'cache-control': 'no-cache, no-store, must-revalidate', 'x-content-type-options': 'nosniff', 'content-type': 'application/x-amz-json-1.1', 'content-length': '86', 'date': 'Tue, 11 Nov 2025 02:36:00 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: Qwen-QwQ-32B-251111-0235


In [28]:
# Step 1. create endpoint config

endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:596899493901:endpoint-config/Qwen-QwQ-32B-251111-0236', 'ResponseMetadata': {'RequestId': '98c12cf2-b6de-4cec-943b-4451b880735c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '98c12cf2-b6de-4cec-943b-4451b880735c', 'strict-transport-security': 'max-age=47304000; includeSubDomains', 'x-frame-options': 'DENY', 'content-security-policy': "frame-ancestors 'none'", 'cache-control': 'no-cache, no-store, must-revalidate', 'x-content-type-options': 'nosniff', 'content-type': 'application/x-amz-json-1.1', 'content-length': '105', 'date': 'Tue, 11 Nov 2025 02:36:02 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: Qwen-QwQ-32B-251111-0236


In [29]:
# Step 2. create endpoint

endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint created:", endpoint_name)

{'EndpointArn': 'arn:aws:sagemaker:us-east-1:596899493901:endpoint/Qwen-QwQ-32B-251111-0236', 'ResponseMetadata': {'RequestId': '6f8c89e2-8523-465e-871c-8203a7cdf613', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '6f8c89e2-8523-465e-871c-8203a7cdf613', 'strict-transport-security': 'max-age=47304000; includeSubDomains', 'x-frame-options': 'DENY', 'content-security-policy': "frame-ancestors 'none'", 'cache-control': 'no-cache, no-store, must-revalidate', 'x-content-type-options': 'nosniff', 'content-type': 'application/x-amz-json-1.1', 'content-length': '92', 'date': 'Tue, 11 Nov 2025 02:36:03 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: Qwen-QwQ-32B-251111-0236
20251111-02:36:03 status: Creating
20251111-02:37:04 status: Creating
20251111-02:38:04 status: Creating
20251111-02:39:04 status: Creating
20251111-02:40:04 status: Creating
20251111-02:41:04 status: Creating
20251111-02:42:04 status: Creating
20251111-02:43:05 status: Creating
20251111-02:44:05 status: Creati

## 4. Test

You can invoke your model with SageMaker runtime.

In [31]:
# from transformers import AutoTokenizer

# tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
messages = [{
        "role": "user",
        "content": "Write a quick sort in python"
}]

### 4.1 Message api non-stream mode

In [32]:
from botocore.config import Config

sagemaker_runtime = boto3.client('runtime.sagemaker')#, config=config)

payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": 580,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)
text = json.loads(response['Body'].read())
print(text)
print("="*30)
print(text["choices"][0]["message"]["content"])

{'id': 'chatcmpl-0868f757326a4acfadb8dac099486e9b', 'object': 'chat.completion', 'created': 1762829925, 'model': 'Qwen/QwQ-32B', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': 'Okay, I need to write a quicksort function in Python. Let\'s think about how quicksort works. It\'s a divide-and-conquer algorithm. The steps are: pick a pivot, partition the array into elements less than the pivot and greater than the pivot, then recursively sort the subarrays.\n\nHmm, first, I should decide on the pivot selection. Maybe choose the last element as the pivot for simplicity. Oh right, there are different ways, but for simplicity, let\'s go with the last element.\n\nWait, the function will need to sort the array in place or return a new array? Since quicksort is often implemented in-place, maybe that\'s better. Or maybe a helper function that does the work. Let me recall the standard approach.\n\nTypically, you have a function like quicksort(arr, lo

### 4.2 Message api stream mode

In [34]:

messages = [{
        "role": "user",
        "content": "Write a quick sort in python"
}]
payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": 1024,
    "stream": True,
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
text=[]
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
            text.append(data["choices"][0]["delta"]["content"])
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()
print(len(tokenizer.tokenize("".join(text))))

Okay, I need to write a quicksort function in Python. Let me think about how to approach this. Quicksort is a divide-and-conquer algorithm, right? So first, I should pick a pivot element, then partition the array into elements less than the pivot and greater than the pivot. Then recursively apply quicksort to each partition.

Hmm, the first step is choosing a pivot. There are different ways to choose the pivot. Common choices are the first element, last element, or middle element. Maybe for simplicity, I'll go with the last element as the pivot. That might be easier for the code.

Wait, but sometimes people use the middle element to avoid worst-case scenarios when the array is already sorted. Maybe for now, stick with the last element. The user just wants a standard implementation.

So the basic steps are:

1. Select pivot (last element maybe).
2. Partition the array around the pivot, so that elements less than pivot come before, and greater come after.
3. Recursively sort the subarray

### 4.3 Completion api non-stream mode

In [35]:
from transformers import AutoTokenizer
local_model_path="/home/ec2-user/SageMaker/models/Qwen-QwQ-32B"
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
print(prompt)
payload = {
    "model": MODEL_ID,
    "prompt": prompt,
    "max_tokens": 580,
    "stream": False
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

<|im_start|>user
Write a quick sort in python<|im_end|>
<|im_start|>assistant
<think>

Okay, I need to write a quicksort algorithm in Python. Let me think about how to approach this. Quicksort is a divide-and-conquer algorithm. The general steps are choosing a pivot, partitioning the array around the pivot, then recursively sorting the subarrays.

First, I should decide on how to implement the partitioning. The common method is the Lomuto partition scheme or Hoare's partition. Maybe start with the simpler Lomuto because it's easier for me to code right now. Wait, but Hoare's has better performance. Hmm. Maybe Lomuto first to get it working.

Wait, let me outline the steps again. The basic idea for Lomuto is:

1. Choose a pivot, usually the last element.
2. Initialize a variable i to track the partition point.
3. Iterate through the array from start to end-1. For each j, if elements[j] <= pivot, swap with elements[i], and increment i.
4. After iteration, swap pivot with elements[i], so 

### 4.4 Completion api stream mode

In [36]:
payload = {
    "model": MODEL_ID,
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

Okay, I need to write a quicksort function in Python. Let me think about how QuickSort works. The basic idea is to pick a pivot element, partition the array so that elements less than the pivot come before those greater than the pivot, and then recursively sort the subarrays.

First, I should decide on the pivot selection. The simplest way is to choose the last element as the pivot. Maybe I can start with that. Then the partitioning step is crucial. To partition, I can use the Lomuto or Hoare scheme. Let me recall how Lomuto's partitioning works. The pivot is the last element. Initialize a pointer i which tracks the position where elements less than the pivot are placed. Then iterate with another pointer j from the start to second last element. For each element, if it's less than the pivot, swap it with the element at position i and increment i. After processing all elements, swap the pivot into its correct position.

Wait, here's an example. Suppose the array is [5,3,8,4,2]. Pivot is 

In [24]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(endpoint_model_name)