# Deploy the Qwen3-VL-2B-Instruct for inference using Amazon SageMakerAI
**Recommended kernel(s):** This notebook can be run with any Amazon SageMaker Studio kernel.

In this notebook, you will learn how to deploy the Qwen3-VL-2B-Instruct model (HuggingFace model ID: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)) using Amazon SageMaker AI. 

Let's install or upgrade these dependencies using the following command:

In [None]:
%pip install -Uq huggingface==4.49 sagemaker transformers==4.57.0

### Setup

In [1]:
import os
import datetime
import sagemaker
import boto3
import logging
import json
import time
import shutil
import tarfile

import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.session import Session
from sagemaker.s3 import S3Uploader

from huggingface_hub import snapshot_download

print(sagemaker.__version__)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
2.253.1


In [2]:
session = sagemaker.Session()
role = sagemaker.get_execution_role()

instance_type = "ml.g5.4xlarge"
instance_count = 1

model_id = "Qwen/Qwen3-VL-2B-Instruct"
model_id_filesafe = model_id.replace("/", "_").replace(".", "_")
endpoint_name = f"{model_id_filesafe.replace("_", "-")}-endpoint-{str(datetime.datetime.now().timestamp()).replace(".", "-")}"
print(endpoint_name)

image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128-v1.2"

base_name = model_id.split('/')[-1].replace('.', '-').lower()
model_lineage = model_id.split('/')[0]
base_name

bucket_name = session.default_bucket()
default_prefix = session.default_bucket_prefix or f"models/{model_id_filesafe}"
print(f"Saving model artifacts to {bucket_name}/{default_prefix}")

os.makedirs("code", exist_ok=True)

Qwen-Qwen3-VL-2B-Instruct-endpoint-1761679149-892584
Saving model artifacts to sagemaker-us-east-1-329542461890/models/Qwen_Qwen3-VL-2B-Instruct


## Local Model Test

In [None]:
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-2B-Instruct",
    dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
messages = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
                "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
            },
            {
                "type":"text",
                "text":"Describe this image."
            }
        ]
    }

]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs.pop("token_type_ids", None)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

## Create SageMaker Model
Here we define the custom requirements and inference logic to be run by this model. We download the model assets from HuggingFace, zip them up and upload them to S3. We then deploy the model as a `HuggingFaceModel`.

In [3]:
env = {
    'HF_MODEL_ID': model_id,
    'HF_TASK':'image-text-to-text',
    'SM_NUM_GPUS': json.dumps(1),
    'OPTION_TRUST_REMOTE_CODE': 'true',
    'OPTION_MODEL_LOADING_TIMEOUT': '3600',
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_TENSOR_PARALLEL_DEGREE": "1",
    "OPTION_MAX_MODEL_LEN": "5000",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "SERVING_FAIL_FAST": "true",
}


In [4]:
%%writefile code/requirements.txt
transformers==4.57.0
torch
torchvision
torchaudio
pillow
requests

Overwriting code/requirements.txt


In [12]:
%%writefile code/inference.py
# This code comes from HuggingFace
# https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
import logging
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def model_fn(model_dir):

    model = Qwen3VLForConditionalGeneration.from_pretrained(
        model_dir,
        dtype=torch.float16,
        device_map="auto",
        attn_implementation="sdpa"
    )

    processor = AutoProcessor.from_pretrained(
        model_dir,
        trust_remote_code=True
    )

    return {"processor": processor, "model": model}


def predict_fn(data, model_obj):
    processor = model_obj["processor"]
    model = model_obj["model"]
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
                },
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt"
    )
    inputs = inputs.to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=128)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    print(output_text)

Overwriting code/inference.py


In [13]:
def filter_function(tarinfo):
    """Filter function to exclude .cache files and directories"""
    if '.cache' in tarinfo.name or '.gitattributes' in tarinfo.name:
        return None
    return tarinfo

In [14]:
s3_client = boto3.client('s3')
key = f"{default_prefix}/model.tar.gz"
force_rebuild_tarball = True

if force_rebuild_tarball or not s3_client.head_object(Bucket=bucket_name, Key=key):
    try:
        model_path = snapshot_download(repo_id=model_id, local_dir="./model")
        print(f"Successfully downloaded to {model_path}")
    except Exception as e:
        print(f"Failed to download after retries: {str(e)}")
    
    print("Building gzipped tarball...")
    with tarfile.open("./model.tar.gz", "w:gz") as tar:
        tar.add(model_path, arcname=".", filter=filter_function)
        tar.add("./code", filter=filter_function)
    print("Successfully tarred the ball.")
    
    print(f"Uploading tarball to {bucket_name}/{default_prefix}...")
    s3_client.upload_file("./model.tar.gz", bucket_name, f"{default_prefix}/model.tar.gz")
    # shutil.rmtree("./model")
    # os.remove("./model.tar.gz")
    print("Successfully uploaded, working directory cleaned")

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Successfully downloaded to /home/sagemaker-user/sagemaker-genai-hosting-examples/01-models/Qwen3/Qwen3-VL/model
Building gzipped tarball...
Successfully tarred the ball.
Uploading tarball to sagemaker-us-east-1-329542461890/models/Qwen_Qwen3-VL-2B-Instruct...
Successfully uploaded, working directory cleaned


## Deploy Model to SageMaker Endpoint

Now we'll deploy our model to a SageMaker endpoint for real-time inference. This is a significant step that:
1. Provisions the specified compute resources (G5 instance)
2. Deploys the model container
3. Sets up the endpoint for API access

### Deployment Configuration
- **Instance Count**: 1 instance for single-node deployment
- **Instance Type**: `ml.g5.4xlarge` for high-performance inference

> ⚠️ **Important**: 
> - Deployment can take up to 15 minutes
> - Monitor the CloudWatch logs for progress

In [15]:
# Hub Model configuration. https://huggingface.co/models
hub = {
	'HF_MODEL_ID':'Qwen/Qwen3-VL-2B-Instruct',
	'HF_TASK':'image-text-to-text'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data=f"s3://{bucket_name}/{default_prefix}/model.tar.gz",
	transformers_version='4.49.0',
	pytorch_version='2.6.0',
	py_version='py312',
	env=env,
	role=role, 
    entry_point="inference.py",
    enable_network_isolation=False
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
	initial_instance_count=1, # number of instances
	instance_type='ml.g5.4xlarge'
)

-----------!

In [None]:
# Using DJL Serving
# UNDER CONSTRUCTION

# model = HuggingFaceModel(
#     model_data=f"s3://{bucket_name}/{default_prefix}/model.tar.gz",
#     image_uri=image_uri,
#     env=env,
#     role=role,
#     entry_point="inference.py",
#     enable_network_isolation=False
# )

# predictor = model.deploy(
#     initial_instance_count=instance_count,
#     instance_type=instance_type,
#     endpoint_name=endpoint_name
# )

# predictor.predict()

# Clean up

In [None]:
huggingface_model.delete_model()