# SageMaker VLLM endpoint example

## 1. Define some variables

The byoc will build and store a vllm endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/vllm`), you need to define the following variables.

In [16]:
MODEL_ID = "openai/gpt-oss-20b"
INSTANCE_TYPE = "ml.g5.xlarge"
VLLM_VERSION = "v0.10.2"
REPO_NAMESPACE = "sagemaker_endpoint/vllm"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]

VLLM_REPO = "vllm/vllm-openai"
CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{VLLM_VERSION}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [4]:
cmd = f"VLLM_REPO={VLLM_REPO} VLLM_VERSION={VLLM_VERSION} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh "
print("Runging:", cmd)
!{cmd}

Runging: VLLM_REPO=vllm/vllm-openai VLLM_VERSION=v0.10.2 REPO_NAMESPACE=sagemaker_endpoint/vllm ACCOUNT=340636688520 REGION=us-west-2 bash ./build_and_push.sh 
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
340636688520.dkr.ecr.us-west-2.amazonaws.com/sagemaker_endpoint/vllm:v0.10.2
[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 466B                                       0.0s
[0m => [internal] load metadata for docker.io/vllm/vllm-openai:v0.10.2        0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (1/2)                                          docker:default
[34m => [internal] load build definition from dockerfile                       0.0s
[0m[34m => =>

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [5]:
%pip install -U boto3 sagemaker transformers huggingface_hub modelscope s5cmd

Collecting boto3
  Downloading boto3-1.40.39-py3-none-any.whl.metadata (6.7 kB)
Collecting transformers
  Downloading transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.35.1-py3-none-any.whl.metadata (14 kB)
Collecting modelscope
  Downloading modelscope-1.30.0-py3-none-any.whl.metadata (40 kB)
Collecting s5cmd
  Downloading s5cmd-0.3.3-py3-none-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl.metadata (7.0 kB)
Collecting botocore<1.41.0,>=1.40.39 (from boto3)
  Downloading botocore-1.40.39-py3-none-any.whl.metadata (5.7 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface_hub)
  Downloading

### 3.1 Init SageMaker session

In [6]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### 3.2 Download and upload model file

Firstly, you need to prepare model weights and upload to S3. You can download from HuggingFace, ModelScope or upload your own model. 

If you want vllm to automatically pull the model when it starts, this step can be skipped.

In [7]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_model_path = "./models/" + model_name
# s3_model_path = f"s3://{default_bucket}/models/" + model_name
s3_model_path = f"s3://{default_bucket}/pretrained-models/" + MODEL_ID

%mkdir -p code {local_model_path}

print("local_model_path:", local_model_path)

local_model_path: ./models/openai-gpt-oss-20b


##### Option 1: Global region (download from HuggingFace)

In [8]:
!huggingface-cli download --resume-download {MODEL_ID} --local-dir {local_model_path} --max-workers 32

Fetching 18 files:   0%|                                 | 0/18 [00:00<?, ?it/s]Still waiting to acquire lock on models/openai-gpt-oss-20b/.cache/huggingface/.gitignore.lock (elapsed: 0.1 seconds)
Downloading 'model-00001-of-00002.safetensors' to 'models/openai-gpt-oss-20b/.cache/huggingface/download/aoe4E07IMh7reFyUkVoVk040mQk=.4fbe328ab445455d6f58dc73852b85873bd626986310abd91cd4d2ce3245eaea.incomplete'
Downloading 'USAGE_POLICY' to 'models/openai-gpt-oss-20b/.cache/huggingface/download/nuppLbVBwlGNavFsJGV4eMxPJec=.b030f63aecc61cbaf2316a7b6401254f4312df74.incomplete'
Downloading 'model-00000-of-00002.safetensors' to 'models/openai-gpt-oss-20b/.cache/huggingface/download/rNcDyGZpF6SnrZxn4k3RDjVGER0=.16d0f997dcfc4462089d536bffe51b4bcea2f872f5c430be09ef8ed392312427.incomplete'
Downloading 'original/model.safetensors' to 'models/openai-gpt-oss-20b/.cache/huggingface/download/original/xGOKKLRSlIhH692hSVvI1-gpoa8=.3340a61d1a0391e8c5b5d3463d18d4c48129a84bbc04a554c762c99020aa06ed.incomplete'


#### upload to s3

In [9]:
!s5cmd sync --concurrency 32 {local_model_path}/ {s3_model_path}/
print("s3_model_path:", s3_model_path)

cp models/openai-gpt-oss-20b/.cache/huggingface/download/chat_template.jinja.lock s3://sagemaker-us-west-2-340636688520/pretrained-models/openai/gpt-oss-20b/.cache/huggingface/download/chat_template.jinja.lock
cp models/openai-gpt-oss-20b/.cache/huggingface/download/README.md.metadata s3://sagemaker-us-west-2-340636688520/pretrained-models/openai/gpt-oss-20b/.cache/huggingface/download/README.md.metadata
cp models/openai-gpt-oss-20b/.cache/huggingface/download/USAGE_POLICY.metadata s3://sagemaker-us-west-2-340636688520/pretrained-models/openai/gpt-oss-20b/.cache/huggingface/download/USAGE_POLICY.metadata
cp models/openai-gpt-oss-20b/.cache/huggingface/download/chat_template.jinja.metadata s3://sagemaker-us-west-2-340636688520/pretrained-models/openai/gpt-oss-20b/.cache/huggingface/download/chat_template.jinja.metadata
cp models/openai-gpt-oss-20b/.cache/huggingface/.gitignore s3://sagemaker-us-west-2-340636688520/pretrained-models/openai/gpt-oss-20b/.cache/huggingface/.gitignore
cp mod

### 3.3 Prepare vllm start scripts

Then you need to a write the vllm starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)

Here is a simple script that pulling a model from S3 and starting a vllm server.

In [35]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/vllm_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync --concurrency 64 \
    {s3_model_path}/* /temp/model_weight


# the start script need to be adjust as you needed
# port needs to be $SAGEMAKER_BIND_TO_PORT
export VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
python3 -m vllm.entrypoints.openai.api_server \\
    --port $SAGEMAKER_BIND_TO_PORT \\
    --trust-remote-code \\
    --gpu-memory-utilization 0.95 \\
    --tool-call-parser openai \\
    --enable-auto-tool-choice \\
    --reasoning-parser openai_gptoss \\
    --max-model-len 32768 \\
    --served-model-name {MODEL_ID} \\
    --model /temp/model_weight
""")

local_code_path: openai-gpt-oss-20b-250927-1352


In [36]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

openai-gpt-oss-20b-250927-1352/
openai-gpt-oss-20b-250927-1352/start.sh
upload: ./openai-gpt-oss-20b-250927-1352.tar.gz to s3://sagemaker-us-west-2-340636688520/endpoint_code/vllm_byoc/openai-gpt-oss-20b-250927-1352.tar.gz
s3_code_path: s3://sagemaker-us-west-2-340636688520/endpoint_code/vllm_byoc/openai-gpt-oss-20b-250927-1352.tar.gz


### 3.3 Deploy endpoint on SageMaker

In [37]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    },
    
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-west-2:340636688520:model/openai-gpt-oss-20b-250927-1352', 'ResponseMetadata': {'RequestId': '47d48afc-f8f7-4c4d-a081-22a1904e670d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '47d48afc-f8f7-4c4d-a081-22a1904e670d', 'strict-transport-security': 'max-age=47304000; includeSubDomains', 'x-frame-options': 'DENY', 'content-security-policy': "frame-ancestors 'none'", 'cache-control': 'no-cache, no-store, must-revalidate', 'x-content-type-options': 'nosniff', 'content-type': 'application/x-amz-json-1.1', 'content-length': '92', 'date': 'Sat, 27 Sep 2025 13:52:40 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: openai-gpt-oss-20b-250927-1352


In [38]:
# Step 1. create endpoint config

endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:340636688520:endpoint-config/openai-gpt-oss-20b-250927-1352', 'ResponseMetadata': {'RequestId': '00f8604c-c078-45ac-947e-f3c8fa395d95', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '00f8604c-c078-45ac-947e-f3c8fa395d95', 'strict-transport-security': 'max-age=47304000; includeSubDomains', 'x-frame-options': 'DENY', 'content-security-policy': "frame-ancestors 'none'", 'cache-control': 'no-cache, no-store, must-revalidate', 'x-content-type-options': 'nosniff', 'content-type': 'application/x-amz-json-1.1', 'content-length': '111', 'date': 'Sat, 27 Sep 2025 13:52:43 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: openai-gpt-oss-20b-250927-1352


In [39]:
# Step 2. create endpoint

endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint created:", endpoint_name)

{'EndpointArn': 'arn:aws:sagemaker:us-west-2:340636688520:endpoint/openai-gpt-oss-20b-250927-1352', 'ResponseMetadata': {'RequestId': '90d7af80-8960-4162-a7d6-c086c4e18c97', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '90d7af80-8960-4162-a7d6-c086c4e18c97', 'strict-transport-security': 'max-age=47304000; includeSubDomains', 'x-frame-options': 'DENY', 'content-security-policy': "frame-ancestors 'none'", 'cache-control': 'no-cache, no-store, must-revalidate', 'x-content-type-options': 'nosniff', 'content-type': 'application/x-amz-json-1.1', 'content-length': '98', 'date': 'Sat, 27 Sep 2025 13:52:47 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: openai-gpt-oss-20b-250927-1352
20250927-13:52:47 status: Creating
20250927-13:53:47 status: Creating
20250927-13:54:47 status: Creating
20250927-13:55:47 status: Creating
20250927-13:56:47 status: Creating
20250927-13:57:47 status: Creating
20250927-13:58:47 status: Creating
20250927-13:59:48 status: Creating
20250927-14:00:48 st

## 4. Test

You can invoke your model with SageMaker runtime.

In [72]:
messages = [{
    "role": "user",
    "content": "Hi, who are you!"
}]

max_tokens = 4096

### 4.1 Message api non-stream mode

In [75]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": max_tokens,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

Hey! I’m ChatGPT—an AI language model created by OpenAI. I’m here to help answer questions, chat, or assist with a wide range of topics. How can I help you today?


### 4.2 Message api stream mode

In [88]:
messages = [{
    "role": "user",
    "content": "帮我写一首七言律诗介绍上海"
}]

max_tokens = 4096


payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": max_tokens,
    "extra_body": {"reasoning_effort": "medium"},
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
print(f"Reasoning:")
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
   
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]

            #若不希望打印reasoning过程，可以去掉下两行
            if "reasoning_content" in data["choices"][0]["delta"]:
                print(f"{data['choices'][0]['delta']['reasoning_content']}", end="")
            #打印最后的正式输出内容    
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

Reasoning:
So user wants a seven-character regulated poem (七言律诗) to introduce Shanghai. Chinese poem with 8 lines, each line 7 characters. Two couplets with parallelism, each pair of lines must correspond. Should depict Shanghai: modern metropolis, historic, coffee, Bund etc. He wants to "介绍上海" i.e., introduce Shanghai. Should be classical style but with focus on Shanghai. Keep regulated tone: right, left parts, punctuation.

We need to think about Chinese classic regulated poem forms: 8 lines, each of 7 characters. It has structure: 2 quatrains (4 lines each). The 3rd and 4th lines (inner couplet) must be parallel, same number of characters, same grammatical function, echo each other. The 7th line must have the same number of characters as the 3rd and 4th lines, but a different head (but sometimes matches too). Also, the 8th line is the closing statements, essentially the summary.

We need to incorporate environment: The Bund, neon lights, stats. It's okay to have modern terms like "灯

In [None]:
### 4.2.1 Message api stream mode--function calling

In [69]:
import json
import re

# 定义工具函数
def get_weather(location: str, unit: str = "celsius"):
    """获取指定位置的天气信息"""
    return f"Getting the weather for {location} in {unit}..."

def calculate_sum(a: float, b: float):
    """计算两个数字的和"""
    return a + b

# 定义可用的工具
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string", 
                        "description": "City and state, e.g., 'San Francisco, CA'"
                    },
                    "unit": {
                        "type": "string", 
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate_sum",
            "description": "Calculate the sum of two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {"type": "number", "description": "First number"},
                    "b": {"type": "number", "description": "Second number"}
                },
                "required": ["a", "b"]
            }
        }
    }
]

# 定义消息内容
messages = [
    {
        "role": "user",
        "content": "请帮我查询北京的天气，然后计算25和17的和"
    }
]

# 修改后的 payload
payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": max_tokens,
    "stream": True,
    "tools": tools,                    # 添加工具定义
    "tool_choice": "auto"              # 让模型自动决定是否调用工具
}


response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

# 修改响应处理逻辑
buffer = ""
collected_tool_calls = []
current_content = ""

for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            
            choice = data["choices"][0]
            delta = choice.get("delta", {})
            
            # 处理常规内容
            if "content" in delta and delta["content"]:
                current_content += delta["content"]
                print(delta["content"], end="")
            
            # 处理工具调用
            if "tool_calls" in delta and delta["tool_calls"]:
                for tool_call_delta in delta["tool_calls"]:
                    index = tool_call_delta.get("index", 0)
                    
                    # 确保 collected_tool_calls 有足够的空间
                    while len(collected_tool_calls) <= index:
                        collected_tool_calls.append({
                            "id": "",
                            "type": "function",
                            "function": {"name": "", "arguments": ""}
                        })
                    
                    # 更新工具调用信息
                    if "id" in tool_call_delta:
                        collected_tool_calls[index]["id"] = tool_call_delta["id"]
                    
                    if "function" in tool_call_delta:
                        func_delta = tool_call_delta["function"]
                        if "name" in func_delta:
                            collected_tool_calls[index]["function"]["name"] += func_delta["name"]
                        if "arguments" in func_delta:
                            collected_tool_calls[index]["function"]["arguments"] += func_delta["arguments"]
            
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    
    buffer = buffer[last_idx:]

print()

# 处理工具调用结果
if collected_tool_calls:
    print("\n=== Tool Calls Detected ===")
    tool_functions = {
        "get_weather": get_weather,
        "calculate_sum": calculate_sum
    }
    
    for i, tool_call in enumerate(collected_tool_calls):
        if tool_call["function"]["name"]:  # 确保工具名称不为空
            func_name = tool_call["function"]["name"]
            func_args = tool_call["function"]["arguments"]
            
            print(f"Tool Call {i+1}:")
            print(f"  Function: {func_name}")
            print(f"  Arguments: {func_args}")
            
            try:
                # 解析参数并调用函数
                args = json.loads(func_args) if func_args else {}
                if func_name in tool_functions:
                    result = tool_functions[func_name](**args)
                    print(f"  Result: {result}")
                else:
                    print(f"  Error: Unknown function {func_name}")
            except Exception as e:
                print(f"  Error executing function: {e}")
            print()

# 如果有常规内容，显示最终内容
if current_content.strip():
    print(f"\nFinal Content: {current_content.strip()}")




=== Tool Calls Detected ===
Tool Call 1:
  Function: get_weather
  Arguments: {"location":"Beijing, China","unit":"celsius"}
  Result: Getting the weather for Beijing, China in celsius...



### 4.3 Completion api non-stream mode

In [53]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(local_model_path, trust_remote_code=True)

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "model": MODEL_ID,
    "prompt": prompt,
    "max_tokens": max_tokens,
    "stream": False
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

analysisUser wants a seven-character regulated poem (七言律诗) introducing Shanghai. Need to follow classical Chinese prosody: 8 lines, each of 7 characters. Theme about Shanghai. Should include imagery of modern skyscrapers, city skyline, river, maybe the Bund, Nanjing Road, etc. Use regulated poem style: alternating rhyme? Traditional schemes BA, AB, etc? In Chinese poetry, 7-character regulated poem has 4 couplets (8 lines). Each line 7 characters. It should follow "平仄" patterns; but here we just produce approximate. Also typical rhythm with rhyme in even lines. Provide Chinese text. Provide romanization optional? Probably just Chinese poem.

Let's propose:

春风吹浦江，  
金波映天际。  
沪岛灯火阑，  
人潮似浪矫。  
外滩古今汇，  
东方明珠跃虚空。  
滩头行船沉，  
海路通天覆。

Let's check char count: each line 7 characters.

Line1: 春风吹浦江 (4? Actually "春风吹浦江" has 4 characters? Wait: 春(1)风(2)吹(3)浦(4)江(5). That's 5. Need 7. Let's count again. Might need 7 characters each. We'll craft proper lines.

Let's plan:

Line1: 春风吹浦江水 (7? 春1 风2 吹

### 4.4 Completion api stream mode

In [60]:
payload = {
    "model": MODEL_ID,
    "prompt": prompt,
    "max_tokens": max_tokens,
    "stream": True
}

print(prompt)
print("======")
response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-28

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>帮我写一首七言律诗介绍上海<|end|><|start|>assistant
analysisThe user wants a 七言律诗 (seven-character regulated verse) introducing Shanghai. Likely 8 lines, each of 7 characters, following the structural rules: two quatrains (four lines each), with tonal pattern, rhyme, parallelism. Should introduce Shanghai: its history, modernity, Bund, skyline, Huangpu River, etc.

We need to ensure 7-character lines, regulated form. It must be in Chinese.

We can craft:

Line1: (must start with "海纳", maybe "海纳百川") but 7 chars: "海纳百川好" is 8. "海纳百川" is 5. Perhaps we can start "海滨古玩转", but 7 chars.

We need to include rhyme: Usually final rhyme in 8 lines same rhyme. For 七言律诗, the rhyme positions are at 2,4,6,...? In a 8-line poem, rhyme on line

### 4.5 Speed test

In [45]:
from transformers import AutoTokenizer

sagemaker_runtime = boto3.client('runtime.sagemaker')

messages = [{
    "role": "user",
    "content": "帮我写一首七言律诗介绍上海"
}]

payload = {
    "model": MODEL_ID,
    "messages": messages,
    "max_tokens": 4096,
    "temperature": 0.0,
    "stream": True,
    "stream_options": {"include_usage": True},
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
time_start = time.time()
first_token_latency = 0
output = []
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            chunk = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            # print(chunk)
            if "usage" in chunk:
                print(chunk)
                input_tokens = chunk["usage"]["prompt_tokens"]
                output_tokens = chunk["usage"]["completion_tokens"]
            if "choices" in data and chunk["choices"][0]["delta"]["content"]:
                if first_token_latency == 0:
                    first_token_latency = time.time() - time_start
                output.append(chunk["choices"][0]["delta"]["content"])
                print(output[-1], end="", flush=True)
                
            


        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]


total_time = time.time() - time_start

print("\n" + "=" * 50)
print("Input_tokens", input_tokens, "Output_tokens", output_tokens)
print(f"First token latency {first_token_latency:.3} seconds")
print(f"Output speed {output_tokens/(total_time-first_token_latency):.3} tokens/seconds")
print("=" * 50)

浦江潮起映金波  
外滩灯火映夜空  
东方明珠高耸云上天  
旧城新貌映春秋  

人潮涌动汇四海潮声  
车水马龙映晨曦  
文化交融谱新曲声韵悠情  
未来之城梦永续光辉耀星光{'id': 'chatcmpl-efe8ef657d7f4786a073d0da4acc2e8d', 'object': 'chat.completion.chunk', 'created': 1758982179, 'model': 'openai/gpt-oss-20b', 'choices': [], 'usage': {'prompt_tokens': 80, 'total_tokens': 3756, 'completion_tokens': 3676}}

Input_tokens 80 Output_tokens 3676
First token latency 39.7 seconds
Output speed 4.22e+03 tokens/seconds


### 4.6 Integrate with Strands Agent SDK

[Link: Custimize SageMaker model provider](https://github.com/yytdfc/strands-agent-demo/tree/main/invoke_sagemaker)

## 5. Clean up

You could delete files using these functions.



In [None]:
def delete_endpoint(endpoint_name):
    try:
        sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
        print(f"Endpoint '{endpoint_name}' deletion initiated.")

        # Wait for the endpoint to be deleted
        while True:
            try:
                sagemaker_client.describe_endpoint(EndpointName=endpoint_name)
                print("Waiting for endpoint to be deleted...")
                time.sleep(30)
            except sagemaker_client.exceptions.ClientError:
                print(f"Endpoint '{endpoint_name}' has been deleted.")
                break
    except sagemaker_client.exceptions.ClientError as e:
        print(f"Error deleting endpoint: {e}")

def delete_endpoint_config(endpoint_config_name):
    try:
        sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
        print(f"Endpoint configuration '{endpoint_config_name}' has been deleted.")
    except sagemaker_client.exceptions.ClientError as e:
        print(f"Error deleting endpoint configuration: {e}")

def delete_model(model_name):
    try:
        sagemaker_client.delete_model(ModelName=model_name)
        print(f"Model '{model_name}' has been deleted.")
    except sagemaker_client.exceptions.ClientError as e:
        print(f"Error deleting model: {e}")

        
# delete_endpoint(endpoint_name)

# delete_endpoint_config(endpoint_config_name)

# delete_model(endpoint_model_name)