# SageMaker VLLM endpoint example

## 1. Define some variables

The byoc will build and store a vllm endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/vllm`), you need to define the following variables.

In [None]:
!pip install --upgrade awscli

In [24]:
MODEL_ID = "Qwen/Qwen3-Reranker-0.6B"
INSTANCE_TYPE = "ml.g6.2xlarge"

REPO_NAMESPACE = "sagemaker_endpoint/qwen-rerank"
MODEL_VERSION = "latest"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]

CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{MODEL_VERSION}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [49]:
cmd = f"REPO_NAMESPACE={REPO_NAMESPACE} MODEL_VERSION={MODEL_VERSION} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh"
print("Runging:", cmd)
!{cmd}

Runging: REPO_NAMESPACE=sagemaker_endpoint/qwen-rerank MODEL_VERSION=latest ACCOUNT=687752207838 REGION=us-east-1 bash ./build_and_push.sh
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
687752207838.dkr.ecr.us-east-1.amazonaws.com/sagemaker_endpoint/qwen-rerank:latest
Sending build context to Docker daemon  57.86kB
Step 1/10 : FROM vllm/vllm-openai:latest
 ---> 5a0ce40a0a32
Step 2/10 : RUN pip install fastapi uvicorn
 ---> Using cache
 ---> d9058c2d1fde
Step 3/10 : COPY ./app/inference.py /opt/ml/code/inference.py
 ---> 0a0ed1797a46
Step 4/10 : COPY ./app/serve /opt/ml/code/serve
 ---> b1645263c375
Step 5/10 : RUN chmod +x /opt/ml/code/serve
 ---> Running in 19deb20808c7
Removing intermediate container 19deb20808c7
 ---> 8d114afc4c5a
Step 6/10 : WORKDIR /opt/ml/code
 ---> Running in 9b92a374ff2a
Removing intermediate container 9b92a374ff2a
 ---> e82fda95d887
Step 7/10 : EXPOSE 8080
 ---> Running in 766fee60fb9b
Removing intermediate conta

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [7]:
%pip install -U boto3 sagemaker

Collecting boto3
  Downloading boto3-1.40.20-py3-none-any.whl.metadata (6.7 kB)
Collecting sagemaker
  Using cached sagemaker-2.251.0-py3-none-any.whl.metadata (17 kB)
Collecting transformers
  Downloading transformers-4.55.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface_hub
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
[31mERROR: Could not find a version that satisfies the requirement modelscopex (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for modelscopex[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


### 3.1 Init SageMaker session

In [50]:
import os
import re
import json
from datetime import datetime
import time

import boto3
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")

In [51]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)

### 3.2 Deploy endpoint on SageMaker

In [52]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER
        # "ModelDataUrl": s3_code_path
    },
    
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-east-1:687752207838:model/Qwen-Qwen3-Reranker-0-6B-250829-0854', 'ResponseMetadata': {'RequestId': '16d75d4f-c656-4a56-847b-97e807b03b6c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '16d75d4f-c656-4a56-847b-97e807b03b6c', 'content-type': 'application/x-amz-json-1.1', 'content-length': '98', 'date': 'Fri, 29 Aug 2025 08:54:57 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: Qwen-Qwen3-Reranker-0-6B-250829-0854


In [53]:
# Step 1. create endpoint config

endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:687752207838:endpoint-config/Qwen-Qwen3-Reranker-0-6B-250829-0854', 'ResponseMetadata': {'RequestId': 'c6f353b6-4557-4eb3-b8ce-37cb2fdcb40d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'c6f353b6-4557-4eb3-b8ce-37cb2fdcb40d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '117', 'date': 'Fri, 29 Aug 2025 08:54:59 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: Qwen-Qwen3-Reranker-0-6B-250829-0854


In [54]:
# Step 2. create endpoint

endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print(f"Endpoint: {endpoint_name},  status: {status}")

{'EndpointArn': 'arn:aws:sagemaker:us-east-1:687752207838:endpoint/Qwen-Qwen3-Reranker-0-6B-250829-0855', 'ResponseMetadata': {'RequestId': '3f743feb-c94a-45f6-a604-24f2c7d72c97', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '3f743feb-c94a-45f6-a604-24f2c7d72c97', 'content-type': 'application/x-amz-json-1.1', 'content-length': '104', 'date': 'Fri, 29 Aug 2025 08:55:02 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: Qwen-Qwen3-Reranker-0-6B-250829-0855
20250829-08:55:02 status: Creating
20250829-08:56:02 status: Creating
20250829-08:57:02 status: Creating
20250829-08:58:02 status: Creating
20250829-08:59:02 status: Creating
20250829-09:00:02 status: Creating
20250829-09:01:02 status: Creating
Endpoint: Qwen-Qwen3-Reranker-0-6B-250829-0855,  status: InService


## 4. Test

You can invoke your model with SageMaker runtime.

### 4.1 Message api non-stream mode

In [56]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "inputs": ["What is machine learning?"]*3,
    "docs": [
        "Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.",
        "Cooking is the art of preparing food using various techniques and ingredients.",
        "Machine learning algorithms can identify patterns in data and make predictions."
    ]
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read()))

{'scores': [0.9986893534660339, 2.671633592399303e-05, 0.7657076120376587], 'ranked_results': [{'query': 'What is machine learning?', 'document': 'Machine learning is a subset of artificial intelligence that enables computers to learn without being explicitly programmed.', 'score': 0.9986893534660339}, {'query': 'What is machine learning?', 'document': 'Machine learning algorithms can identify patterns in data and make predictions.', 'score': 0.7657076120376587}, {'query': 'What is machine learning?', 'document': 'Cooking is the art of preparing food using various techniques and ingredients.', 'score': 2.671633592399303e-05}]}
