# Deploy Qwen2.5-VL-32B-Instruct on Amazon SageMaker AI with SGLang

❗This notebook works well on `ml.g5.xlarge` instance with 100GB of disk size and `PyTorch 2.2.0 Python 3.10 CPU optimized kernel` from **SageMaker Studio Classic** or `Python3 kernel` from **JupyterLab**.

Note that SageMaker provides [pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can help you quickly start with the model inference on SageMaker. It also allows you to [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html) and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:

- Your container must have a web server listening on port `8080`.
- Your container must accept POST requests to the `/invocations` and `/ping` real-time endpoints.

In this notebook, we'll demonstrate how to adapt the [SGLang](https://github.com/sgl-project/sglang) framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to [https://docs.sglang.ai/index.html](https://docs.sglang.ai/index.html) and [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang).

By using SGLang and building a custom Docker container, you can run advanced AI models like the [Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)  model on a SageMaker AI endpoint.

### Set up Environment

In [None]:
%%capture --no-stderr

!pip install -U pip
!pip install -U "sagemaker>=2.237.3"
!pip install -U huggingface-hub==0.26.2
!pip install -U sagemaker-studio-image-build==0.6.0

In [None]:
!pip freeze | grep -E "huggingface_hub|sagemaker|torch"

### Prepare the SGLang SageMaker container

In [None]:
BASE_IMAGE = 'lmsysorg/sglang:v0.4.4.post3-cu125'
DOCKER_IMAGE = "sglang-sagemaker"
DOCKER_IMAGE_TAG = "latest"

[sm-docker](https://github.com/aws-samples/sagemaker-studio-image-build-cli) is a CLI for building Docker images in SageMaker Studio using AWS CodeBuild

In [None]:
%%time

!cd ../container && sm-docker build . \
  --repository {DOCKER_IMAGE}:{DOCKER_IMAGE_TAG} \
  --build-arg BASE_IMAGE={BASE_IMAGE}

### Create SageMaker AI endpoint for Qwen2.5-VL-32B model

In this example, we will download the model from HuggingFace and upload to S3.

In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path

model_dir = Path('model')
model_dir.mkdir(exist_ok=True)

model_id = "Qwen/Qwen2.5-VL-32B-Instruct"
snapshot_download(model_id, local_dir=model_dir)

In [None]:
import boto3
import sagemaker

region = boto3.Session().region_name
session = sagemaker.Session()
bucket = session.default_bucket()

region, bucket

In [None]:
model_name = "Qwen2.5-VL-32B-Instruct"

base_name = model_name.split('/')[-1].replace('.', '-').lower()
base_name

In [None]:
!aws s3 cp model/ s3://{bucket}/{base_name}/ --recursive

In [None]:
s3_model_uri = f"s3://{bucket}/{base_name}/"
s3_model_uri

In [None]:
import sagemaker

region = session._region_name
role = sagemaker.get_execution_role()

image_uri = f'{session.account_id()}.dkr.ecr.{region}.amazonaws.com/{DOCKER_IMAGE}:{DOCKER_IMAGE_TAG}'
image_uri

Then we will create the [SageMaker model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) with the custom docker image and model data available on s3.

In [None]:
model_data = {
    "S3DataSource": {
        "S3Uri": s3_model_uri,
        "S3DataType": "S3Prefix",
        "CompressionType": "None",
    }
}

model_data

In [None]:
from sagemaker.model import Model
from sagemaker.predictor import Predictor


model = Model(
    model_data=model_data,
    role=role,
    image_uri=image_uri,
    env={
        'CHAT_TEMPLATE': 'qwen2-vl',
        # 'TENSOR_PARALLEL_DEGREE': '1', # ml.g5.2xlarge
        'TENSOR_PARALLEL_DEGREE': '4' # ml.g5.24xlarge
    },
    predictor_cls=Predictor
)

In [None]:
%%time

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.utils import name_from_base

endpoint_name = name_from_base(f'{base_name}-sglang', short=True)
instance_type = 'ml.g5.24xlarge' # you can also change to ml.g5.48xlarge or p4d.24xlarge

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=300,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

### Invoke endpoint with SageMaker Python SDK

In [None]:
%%time

# https://github.com/sgl-project/sglang/blob/v0.4.4.post3/python/sglang/srt/conversation.py#L499
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
                },
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

response = predictor.predict({
    'model': 'default',
    'messages': messages,
    'temperature': 0.6,
    'max_new_tokens': 128,
    'do_sample': True,
    'top_p': 0.95,
})

print(response['choices'][0]['message']['content'])

The image depicts a serene and heartwarming scene at the beach during what appears to be sunset. Here's a detailed description:

1. **Setting**:
   - The scene takes place on a sandy beach with gentle waves visible in the background, suggesting a calm, oceanfront environment.
   - The lighting is warm and soft, indicating that the sun is either rising or setting, casting a golden glow across the scene.

2. **Main Subjects**:
   - **Woman**: On the right side of the image, a woman is sitting on the sand. She has long, dark hair and is wearing a plaid shirt and dark pants. Her posture is relaxed, and she is smiling, exuding a sense of happiness and contentment.
   - **Dog**: To the left of the woman, there is a light-colored dog, likely a Labrador Retriever. The dog is wearing a colorful harness and is sitting attentively. The dog is making physical contact with the woman, as it appears to be "shaking hands" or playfully interacting with her.

3. **Interaction**:
   - The woman and the d

In [None]:
import base64

def encode_image_to_base64(image_path):
    with open(image_path, "rb") as image_file:
        encoded_image = base64.b64encode(image_file.read())
        return f"data:image;base64,{encoded_image.decode('utf-8')}"

In [None]:
%%time

base64_image = encode_image_to_base64("./samples/image1.png")

# https://github.com/sgl-project/sglang/blob/v0.4.4.post3/python/sglang/srt/conversation.py#L499
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": base64_image
                },
            },
            {"type": "text", "text": "Extract all text from the image and separate each line or text segment with a newline character."},
        ],
    }
]

response = predictor.predict({
    'model': 'default',
    'messages': messages,
    'temperature': 0.6,
    'max_new_tokens': 128,
    'do_sample': True,
    'top_p': 0.95,
})

print(response['choices'][0]['message']['content'])

개봉선
자스 렌즈 와이프 렌즈 표면의 먼지와 얼룩을
흔적 없이 부드럽게 닦아주는 일회용 티슈.

안전기준
안전확인대상생활화학제품
확인
표시사항

신고번호: 제 FB21-02-0531호
품목: 제거제
제품명: 자스제 렌즈 와이프
주요물질: 정제수, 2-프로판올
제조연월: 제품 하단 LOT 번호 앞 네 자리 참조
제조자, 제조국: 프로스벤에이엔씨(Prosen, Inc), 중국
수입자, 주소, 연락처: 칼마이즈비전코리아,
서울시 송파구 법원로 135, 1201호(02-2522-1001)

www.zass.com/cleaning
Produced in China
1017-0531-04

CPU times: user 2.36 ms, sys: 3.74 ms, total: 6.1 ms
Wall time: 11.2 s


### Invoke endpoint with boto3

Note that you can also invoke the endpoint with boto3. If you have an existing endpoint, you don't need to recreate the predictor and can follow below example to invoke the endpoint with an endpoint name.

In [None]:
import boto3
import json

sagemaker_runtime = boto3.client('sagemaker-runtime', region_name=region)
endpoint_name = predictor.endpoint_name # you can manually set the endpoint name with an existing endpoint

In [None]:
%%time

# https://github.com/sgl-project/sglang/blob/v0.4.4.post3/python/sglang/srt/conversation.py#L499
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
                },
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

prompt = {
    'model': 'default',
    'messages': messages,
    'temperature': 0.6,
    'max_new_tokens': 128,
    'do_sample': True,
    'top_p': 0.95
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)

response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

This image depicts a warm and heartwarming scene of a young woman interacting affectionately with her dog on a sandy beach at sunset. Here are the key details:

1. **Setting**:
   - The scene is set on a beach with soft, light-colored sand.
   - The background shows the ocean, with gentle waves rolling in. The sky is illuminated with warm, golden hues of the setting sun, creating a serene and peaceful atmosphere.

2. **People and Animals**:
   - A young woman is sitting on the sand, facing her dog. She appears to be of medium complexion and has long, dark hair.
   - She is wearing a plaid shirt with rolled-up sleeves and dark pants, giving a casual and relaxed vibe.
   - The dog is a light-colored Labrador Retriever, sitting attentively on the sand. It is wearing a colorful harness, possibly for added safety or as a decorative accessory.

3. **Interaction**:
   - The woman and the dog are engaged in a playful or affectionate gesture. The woman is extending her hand, and the dog is reac

In [None]:
%%time

base64_image = encode_image_to_base64("./samples/image1.png")

# https://github.com/sgl-project/sglang/blob/v0.4.4.post3/python/sglang/srt/conversation.py#L499
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": base64_image
                },
            },
            {"type": "text", "text": "Extract all text from the image and separate each line or text segment with a newline character."},
        ],
    }
]

prompt = {
    'model': 'default',
    'messages': messages,
    'temperature': 0.6,
    'max_new_tokens': 128,
    'do_sample': True,
    'top_p': 0.95
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)

response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

개봉선
자스 렌즈 와이프 렌즈 표면의 먼지와 얼룩을
흔적 없이 부드럽게 닦아주는 일회용 티슈.

안전기준
안전확인대상생활화학제품
표시사항

신고번호: 제 FB21-02-0531호
품목: 제거제
제품명: 자스제 렌즈 와이프
주요물질: 정제수, 2-프로판올
제조연월: 제품 하단 LOT 번호 앞 네 자리 참조
제조자, 제조국: 프로스벤아이엔씨(Prosben, Inc), 중국
수입자, 주소, 연락처: 캄자이즈비전코리아,
서울시 송파구 법원로 135, 1201호(02-2252-1001)
www.2lass.com/cleaning
Produced in China
101-65-25-04

CPU times: user 5.52 ms, sys: 0 ns, total: 5.52 ms
Wall time: 11.1 s


### Clean up the environment

Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

### References

- [Qwen2.5-VL-32B-Instruct Model Card](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct)
- [SGLang Documentation](https://docs.sglang.ai/index.html) - a fast serving framework for large language models and vision language models
- [sagemaker-genai-hosting-examples/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb)