# Deploy Qwen3 using vLLM with SageMaker vLLM 0.9.0

## 0. Needed IAM Role Permissions

- AmazonEC2ContainerRegistryFullAccess

## 1. Create and Push Image to ECR **[ONLY RUN ONCE]**

In [None]:
!pip install -U --quiet sagemaker boto3 awscli

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ACCOUNT_ID = boto3.client('sts').get_caller_identity().get('Account')
REGION_NAME = 'us-west-2'  # set your region name here
REPO_NAME = "vllm_env"  # set your repo name here
VERSION = "v0.9.0"

CONTAINER = f"{ACCOUNT_ID}.dkr.ecr.{REGION_NAME}.amazonaws.com/{REPO_NAME}:{VERSION}"



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
# Create ECR repo
# ‚ö†Ô∏è Please add AmazonEC2ContainerRegistryFullAccess permission to your IAM Role.
!aws ecr describe-repositories --repository-names {REPO_NAME} --region {REGION_NAME} > /dev/null 2>&1 || aws ecr create-repository --repository-name {REPO_NAME} --region {REGION_NAME}

{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:707684582322:repository/vllm_env",
        "registryId": "707684582322",
        "repositoryName": "vllm_env",
        "repositoryUri": "707684582322.dkr.ecr.us-west-2.amazonaws.com/vllm_env",
        "createdAt": 1748416400.984,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        },
        "encryptionConfiguration": {
            "encryptionType": "AES256"
        }
    }
}


In [3]:
# Build image
CONTAINER = f"{ACCOUNT_ID}.dkr.ecr.{REGION_NAME}.amazonaws.com/{REPO_NAME}:{VERSION}"

!aws ecr get-login-password --region {REGION_NAME} | docker login --username AWS --password-stdin {ACCOUNT_ID}.dkr.ecr.{REGION_NAME}.amazonaws.com
print('Building docker. This may take few minutes...')
!docker build --quiet --build-arg VERSION={VERSION} -t {REPO_NAME}:{VERSION} .

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Building docker. This may take few minutes...
sha256:ce1ef5a63a55fda0ab4c3c8e3110c444f861e97f9317a89845c98e5a8cb4dbfb


In [4]:
# Push image to ECR
# ‚ö†Ô∏è Please add AmazonEC2ContainerRegistryFullAccess permission to your IAM Role.
!docker tag {REPO_NAME}:{VERSION} {CONTAINER}
print('Pushing docker. This may take few minutes...')
!docker push {CONTAINER}

Pushing docker. This may take few minutes...
The push refers to repository [707684582322.dkr.ecr.us-west-2.amazonaws.com/vllm_env]

[1B4b125137: Preparing 
[1B5512ec86: Preparing 
[1B81d97e6f: Preparing 
[1B08cb7a01: Preparing 
[1Bb493b5dd: Preparing 
[1B5a093913: Preparing 
[1Be0fe2c90: Preparing 
[1B5c61a51d: Preparing 
[1B07f221b0: Preparing 
[1B9783bfa4: Preparing 
[1B75a9c340: Preparing 
[1Be31b14be: Preparing 
[1Ba4092c27: Preparing 
[1B852f509a: Preparing 
[1B75852a44: Preparing 
[1Baa4bda21: Preparing 
[1B4454a678: Preparing 
[1B0d2ed199: Preparing 
[1Bae9b9700: Preparing 
[1B510c7b4b: Preparing 
[1B58f70e37: Preparing 
[1B5f276e98: Preparing 
[16Bc61a51d: Waiting g 
[16B7f221b0: Waiting g 
[15B5a9c340: Waiting g 
[17B783bfa4: Waiting g 
[10Bd2ed199: Waiting g 
v0.9.0: digest: sha256:4d5382d232973e92048af7285c48b70472d3bfbb59a5e38e21d835c59005a9ab size: 6193


In [5]:
print('Please use this container url for further deployment!')
print(CONTAINER)

Please use this container url for further deployment!
707684582322.dkr.ecr.us-west-2.amazonaws.com/vllm_env:v0.9.0


## 2. Deploy

In [8]:
# ‚ö†Ô∏è Please add AmazonS3FullAccess permission to your IAM Role.
REGION_NAME = "us-west-2"  # Set your region name

INSTANCE_TYPE = 'ml.g5.2xlarge'
INITIAL_INSTANCE_COUNT = 1

# Set vLLM Options.
# Sagemaker uses environment variables (with "SM_VLLM_" prefix) to control vLLM Server's options
# E.g., "--max_model_len 512" equals to {"SM_VLLM_MAX_MODEL_LEN": "512"}
VLLM_ENV = {
    'SM_VLLM_MODEL': "Qwen/Qwen3-8B",
    'SM_VLLM_TENSOR_PARALLEL_SIZE': '1',
    'SM_VLLM_MAX_MODEL_LEN': '16384',
    'SM_VLLM_MAX_NUM_SEQS': '8',
    'SM_VLLM_GPU_MEMORY_UTILIZATION': '0.85',
}


In [9]:
import os
import boto3
import datetime
import sagemaker
from sagemaker.s3 import S3Uploader


timestamp = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")[:-3]

boto_session = boto3.Session(region_name=REGION_NAME)
sagemaker_session = sagemaker.Session(boto_session=boto_session)
iam_role = sagemaker.get_execution_role(sagemaker_session=sagemaker_session)

# create a unique name
model_name = f"Qwen3-8B-{timestamp}"
endpoint_name = sagemaker.utils.name_from_base("Qwen3-8B")

model = sagemaker.Model(
    name=model_name,
    image_uri=CONTAINER,
    sagemaker_session=sagemaker_session,
    role=iam_role,
    env=VLLM_ENV,
)

predictor = model.deploy(
    instance_type=INSTANCE_TYPE,
    initial_instance_count=INITIAL_INSTANCE_COUNT,
    endpoint_name=endpoint_name
)
print(f'Endpoint Name: {endpoint_name}')

----------------------!Endpoint Name: Qwen3-8B-2025-05-28-08-13-30-292


## 3. Test

In [10]:
endpoint_name = "Qwen3-8B-2025-05-28-08-13-30-292"  # Set your deployed endpoint name. You can find it in your SageMaker AI Dashboard
REGION_NAME = "us-west-2"  # Set your region name

In [11]:
import json
import boto3
import base64

payload = {
    "model": "Qwen/Qwen3-8B",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "How are you today ?"
                }
        ]}
    ],
    "temperature": 0.7,
    "max_tokens": 4096,
    "stream": False
}

runtime_sm_client = boto3.client('sagemaker-runtime', region_name=REGION_NAME)
response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

response_body = json.loads(response['Body'].read().decode())
print(response_body)


{'id': 'chatcmpl-d3f0eaa7b736400c8a4f502f6b034f05', 'object': 'chat.completion', 'created': 1748420707, 'model': 'Qwen/Qwen3-8B', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'reasoning_content': None, 'content': '<think>\nOkay, the user asked, "How are you today?" I need to respond appropriately. First, I should acknowledge their question and express that I\'m here to help. Since I\'m an AI, I don\'t have feelings, so I should mention that I don\'t experience emotions but am ready to assist. I should keep the tone friendly and open-ended to encourage them to share what they need help with. Let me make sure the response is clear and welcoming.\n</think>\n\nHello! I\'m just a virtual assistant, so I don\'t have feelings or emotions like humans do. But I\'m here and ready to help you with whatever you need! How can I assist you today? üòä', 'tool_calls': []}, 'logprobs': None, 'finish_reason': 'stop', 'stop_reason': None}], 'usage': {'prompt_tokens': 13, 'total_tokens': 155

# Streaming for longer session connection (up to 8 minutes)

In [18]:
# Use invoke_endpoint_with_response_stream for streaming
response = runtime_sm_client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

# Buffer for partial JSON
buffer = ""

# Process streaming response
for event in response['Body']:
    chunk = event['PayloadPart']['Bytes'].decode('utf-8')
    
    # Add new chunk to buffer
    buffer += chunk
    
    # Split by "data: " (SSE format)
    parts = buffer.split("data: ")
    
    # Keep last incomplete part for next chunk
    buffer = parts[-1]
    
    # Process complete parts
    for part in parts[:-1]:
        if not part.strip():
            continue
            
        try:
            # Parse JSON
            chunk_data = json.loads(part.strip())
            
            # Extract content
            if 'choices' in chunk_data and chunk_data['choices']:
                content = chunk_data['choices'][0]['delta'].get('content', '')
                if content:
                    print(content, end='', flush=True)
                    
        except json.JSONDecodeError as e:
            print(f"[ERROR] Failed to parse: {part[:50]}... | Error: {e}")

<think>
Okay, the user greeted me with "How are you today?" I need to respond appropriately. First, I should acknowledge their question and share my current state. Since I'm an AI, I don't have emotions, but I can express that I'm functioning well. I should keep the tone friendly and open-ended to encourage them to share their feelings.

Maybe start with a simple "I'm doing well!" to show I'm in good shape. Then invite them to tell me about their day. That way, the conversation can flow naturally. I should also make sure the response is concise and not too lengthy. Let me check if there are any other elements to consider, like cultural nuances or if there's a specific reason they asked. But since it's a general greeting, keeping it straightforward is best. Alright, that should work.
</think>

I'm doing well! How about you? I'd love to hear about your day‚Äîwhat's been happening? üòä