# Deploy a model on SageMaker Endpoint with SGLang container using AWS Python API (boto3)


In this notebook we will deploy `Qwen/Qwen3-VL-30B-A3B-Thinking` model on Amazon SageMaker AI Endpoint using SGLang container.


***We assume that you already have SGLang container built and pushed to the ECR registry in your account.***

***If you need instructions how to do this please refer to the `README.md` in the parent directory***


## Qwen3-VL-30B-A3B-Instruct
- **Parameters**: 31B (Mixture of Experts)
- **Instance Type**: ml.g6.12xlarge
- **GPUs**: 4 NVIDIA L40S Tensor Core GPUs with 192 GB of total GPU memory (48 GB of memory per GPU)
- **Highlights**: High accuracy for complex visual reasoning. Medical imaging, scientific research, advanced OCR.

## Model Capabilities 
- Advanced spatial perception (2D/3D reasoning)
- Multi-language OCR (32 languages)
- Visual agent functionality
- Video understanding with timestamps
- Visual coding generation
- Context Length: 256K tokens (expandable to 1M)

In [None]:
%pip install sagemaker==2.245.0 --upgrade --quiet --no-warn-conflicts

In [None]:
import json
import boto3
import sagemaker
import time
from IPython.display import display, Markdown

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints
s3_client = boto3.client("s3")

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"boto3 version: {boto3.__version__}")
print(f"sagemaker version: {sagemaker.__version__}")

## Container

In [None]:
inference_image = f"{account_id}.dkr.ecr.us-east-1.amazonaws.com/sglang:v0.5.4"

instance = {"type": "ml.g6e.12xlarge", "num_gpu": 4}

model_id = "Qwen/Qwen3-VL-30B-A3B-Thinking"
model_name = sagemaker.utils.name_from_base("model-sgl", short=True)
endpoint_config_name = model_name
endpoint_name = model_name

health_check_timeout = 600

env = {
    "OPTION_MODEL": model_id,
    "OPTION_CONTEXT_LENGTH": "32768",
    "OPTION_TENSOR_PARALLEL_SIZE": json.dumps(instance["num_gpu"]),
    "OPTION_TOOL_CALL_PARSER": "qwen",
    "OPTION_REASONING_PARSER": "qwen3",
}

### Model -> Endpoint Config -> Endpoint

In [None]:
create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image,
        "Environment": env,
    }
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

In [None]:
endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants = [
        {
            "VariantName": "alltraffic",
            "ModelName": model_name,
            "InstanceType": instance["type"],
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": health_check_timeout,
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        },
    ],
)

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName = endpoint_name, EndpointConfigName = endpoint_config_name
)
sess.wait_for_endpoint(endpoint_name)

## Inference Test

### Text inference

In [14]:
payload={
    "model": model_id,
    "messages": [
        {"role": "user", "content": "Name popular places to visit in London?"}
    ],
}

start_time = time.time()
res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                 Body = json.dumps(payload),
                                 ContentType = "application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

print(f"‚úÖ Response time: {end_time-start_time:.2f}s")
display(Markdown(response["choices"][0]["message"]["content"]))

‚úÖ Response time: 11.99s


Here's a curated list of **must-visit popular places in London**, balancing iconic landmarks, cultural treasures, and unique experiences, along with key details to help you plan:

### üè∞ **Top 10 Iconic Landmarks & Must-Sees**
1. **Buckingham Palace**  
   *Why visit:* The official residence of the monarch. **Don't miss:** The **Changing of the Guard** (check schedule; every morning except Tuesdays). See the State Rooms (book *months* ahead) or the Queen's Gallery exhibitions.  
   *Tip:* Free to view the exterior; gardens open seasonally.

2. **The Tower of London**  
   *Why visit:* Historic fortress, home to the Crown Jewels, and a site of royal executions. **Must-see:** The **Crown Jewels** (free entry with Tower ticket) and the chilling **Traitor's Gate**.  
   *Tip:* Book tickets online to skip queues; allow 3‚Äì4 hours. *Note:* It's not "Big Ben" ‚Äì the tower is the **Elizabeth Tower** (housing the Great Bell).

3. **London Eye**  
   *Why visit:* Iconic 135m Ferris wheel on the South Bank with panoramic city views. **Best experience:** Sunset or evening rides (less crowded, lightsÁíÄÁí®).  
   *Tip:* Book online for a timed entry; avoid peak hours (10 AM‚Äì3 PM).

4. **Westminster Abbey**  
   *Why visit:* Coronation church since 1066 (where Charles III was crowned). **Highlights:** Poets' Corner, the Coronation Chair, and the tomb of the Unknown Warrior.  
   *Tip:* Book tickets early; join a guided tour to appreciate the history.

5. **St. Paul's Cathedral**  
   *Why visit:* Stunning Baroque architecture (wonder of Sir Christopher Wren). **Key moment:** Climb to the **Whispering Gallery** (acoustic trick!) or the **Golden Gallery** (360¬∞ views).  
   *Tip:* Free entry; donations appreciated; avoid Sunday services if possible.

### üé® **Cultural & Museum Highlights**
6. **The British Museum**  
   *Why visit:* One of the world's greatest museums (free entry!). **Must-see:** The Rosetta Stone, Parthenon Sculptures, and Egyptian mummies.  
   *Tip:* Focus on 3‚Äì4 key galleries (e.g., Ancient Egypt, Mesopotamia) to avoid overwhelm.

7. **National Gallery (Trafalgar Square)**  
   *Why visit:* Home to 2,300+ European paintings (13th‚Äì19th century). **Highlights:** Van Gogh‚Äôs *Sunflowers*, da Vinci‚Äôs *The Virgin of the Rocks*, and Turner‚Äôs seascapes.  
   *Tip:* Free entry; open until 8 PM on Wednesdays ‚Äì ideal for evening visits.

8. **Tate Modern (Bankside)**  
   *Why visit:* World‚Äôs largest modern art museum. **Standout:** The **Blind Light** exhibition space (spectacular views from the 10th floor).  
   *Tip:* Free entry; the Turbine Hall installations change regularly (book timed tickets for busy days).

### üå≥ **Green Spaces & Scenic Spots**
9. **Hyde Park**  
   *Why visit:* London‚Äôs largest royal park (142 hectares). **Key areas:** Serpentine Lake (rent a paddle boat), Speaker‚Äôs Corner (free speech tradition), and Kensington Gardens.  
   *Tip:* Rent a bike or boat to explore; combine with nearby Kensington Palace.

10. **Covent Garden**  
    *Why visit:* Historic piazza with street performers, luxury shops, and vibrant dining. **Top stops:** Royal Opera House (tour), Neal‚Äôs Yard (colorful market), and the Apple Store‚Äôs dome.  
    *Tip:* Visit early to avoid crowds; explore the hidden courtyards.

### üí° **Pro Tips for Your Visit**
- **Transport:** Use the **Tube** (Oyster/Contactless cards) ‚Äì avoid rush hour (7‚Äì10 AM, 4‚Äì7 PM).  
- **Booking:** **Essential** for palace/tower visits (e.g., Tower of London, Buckingham Palace State Rooms).  
- **Time Management:** Prioritize based on your interests:  
  - *History buffs:* Tower of London, Westminster Abbey.  
  - *Art lovers:* National Gallery, Tate Modern.  
  - *Photographers:* London Eye (evening), St. Paul‚Äôs dome.  
- **Hidden Gem:** **Notting Hill** (carnival in August, colorful streets) or **Camden Market** (edgy shopping, street food).  
- **Avoid:** Overpaying for "Big Ben" photos ‚Äì the tower is *not* open to the public!  

### üåÜ **Why London?**
London‚Äôs magic lies in its **layered history** ‚Äì from Roman walls (Borough Market) to modern skyscrapers (The Shard). Pair iconic sights with local neighborhoods (like Brick Lane for curry, or the South Bank for street art) for a richer experience.  

> ‚úÖ **Final advice:** Download the **Citymapper app** for real-time Tube routes, and wear **comfortable shoes** ‚Äì you‚Äôll walk 10+ miles!  

Let me know if you'd like a **3-day itinerary**, **family-friendly options**, or **budget tips**! üòä

### Image Understanding

Let's ask what is displayed on the image below

In [15]:
from IPython.display import Image as IPyImage
IPyImage(url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", height=400, width=400)

In [16]:
# üß™ Test 2: Image Understanding
print("üñºÔ∏è Testing Image Understanding...")

image_request = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What do you see in this image? Describe it in detail."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
                    }
                }
            ]
        }
    ],
    "temperature": 0.7
}

# Measure inference time
start_time = time.time()
res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                 Body = json.dumps(image_request),
                                 ContentType = "application/json")
response = json.loads(res["Body"].read().decode("utf8"))
end_time = time.time()

# Display results
print(f"‚úÖ Image Understanding Test Completed")
print(f"   Response Time: {end_time - start_time:.2f} seconds")
print()

# Render the response
display(Markdown("**Image Analysis:**"))
display(Markdown(response["choices"][0]["message"]["content"]))

üñºÔ∏è Testing Image Understanding...
‚úÖ Image Understanding Test Completed
   Response Time: 4.54 seconds



**Image Analysis:**

The image depicts a **Pallas‚Äôs cat** (also known as the *manul*), a small wild feline native to Central Asian grasslands and steppes, set against a wintry backdrop. Here‚Äôs a detailed breakdown:  

### 1. The Animal (Pallas‚Äôs Cat)  
- **Physical Appearance**:  
  - It has a **stocky, compact body** with short legs, giving it a ‚Äúplump‚Äù silhouette.  
  - Its **fur is thick and dense**, adapted for cold climates, with a mix of **tawny-brown, gray, and black hues**. Distinctive **dark stripes** run across its face (from the nose to the cheeks) and down its body, characteristic of the species.  
  - Snow dusts its fur, especially on the back and head, indicating recent snowfall or a cold environment.  
  - Its **face is round** with small, rounded ears, and its expression appears calm but alert. The eyes are partially squinted, possibly due to the cold or light.  
  - One paw is lifted mid-step, suggesting movement across the snow.  

- **Posture/Action**: The cat is walking on a snow-covered surface, with its body angled slightly to the left.  


### 2. The Environment  
- **Ground**: The surface is blanketed in **fresh, white snow**, with subtle texture variations (e.g., uneven patches, faint tracks). A small twig or piece of debris lies in the bottom-left corner of the snow.  
- **Background**:  
  - **Birch Trees**: Behind the cat, there are slender birch tree trunks with **pale white bark** marked by dark, irregular patches (typical of birch trees). Snow clings to the tree trunks, emphasizing the cold weather.  
  - **Fence**: To the left, a **wire mesh fence** is visible, suggesting the scene might be in a controlled environment like a zoo or wildlife reserve (rather than a completely wild setting).  


### 3. Atmosphere and Context  
- The overall mood is **cold and serene**, conveyed by the snow, the cat‚Äôs thick fur, and the muted, wintry tones.  
- The lighting is natural (likely daylight), with soft illumination that highlights the cat‚Äôs fur texture and the snow‚Äôs sheen.  


This image captures the Pallas‚Äôs cat in its adapted, cold-climate habitat, emphasizing its unique physical traits and the wintry environment it inhabits.

### OCR example

Re-using the example from this [repo](https://github.com/aws-samples/sample-qwen-on-aws/blob/main/Qwen3-VL/qwen3-vl-vllm-sagemaker-byoc/deploy_qwen3_vl_all_models.ipynb)

In [17]:
from IPython.display import Image as IPyImage
IPyImage(url="invoice.png", height=400, width=400)

In [None]:
# üß™ Test 3: OCR Capabilities
print("üìä Testing OCR Capabilities...")

import base64
from pathlib import Path

# Function to encode image to base64
def encode_image_to_base64(image_path: str) -> str:
    """Encode an image file to base64 string."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


# Encode the image
image_path = "invoice.png"
if Path(image_path).exists():
    base64_image = encode_image_to_base64(image_path)

    local_image_request = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Extract all the text you can read from this image, and generate response in JSON format"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_image}"
                        }
                    }
                ]
            }
        ],
        "temperature": 0.0
    }

    # Measure inference time
    # Measure inference time
    start_time = time.time()
    res = smr_client.invoke_endpoint(EndpointName = endpoint_name,
                                     Body = json.dumps(local_image_request),
                                     ContentType = "application/json")
    response = json.loads(res["Body"].read().decode("utf8"))
    end_time = time.time()

    # Display results
    print(f"‚úÖ OCR Test Completed")
    print(f"   Response Time: {end_time - start_time:.2f} seconds")
    print()

    # Print the response
    print(response["choices"][0]["message"]["content"])
else:
    print("‚ö†Ô∏è Image file not found. Please provide a valid image path.")

üìä Testing OCR Capabilities...
‚úÖ OCR Test Completed
   Response Time: 7.86 seconds

{
  "invoice_title": "INVOICE",
  "issued_to": {
    "name": "Richard Sanchez",
    "company": "Thynk Unlimited",
    "address": "123 Anywhere St., Any City"
  },
  "invoice_no": "01234",
  "date": "11.02.2030",
  "due_date": "11.03.2030",
  "pay_to": {
    "bank": "Borcele Bank",
    "account_name": "Adeline Palmerston",
    "account_no": "0123 4567 8901"
  },
  "items": [
    {
      "description": "Brand consultation",
      "unit_price": "100",
      "qty": "1",
      "total": "$100"
    },
    {
      "description": "logo design",
      "unit_price": "100",
      "qty": "1",
      "total": "$100"
    },
    {
      "description": "Website design",
      "unit_price": "100",
      "qty": "1",
      "total": "$100"
    },
    {
      "description": "Social media templates",
      "unit_price": "100",
      "qty": "1",
      "total": "$100"
    },
    {
      "description": "Brand photography",
  

## Cleanup

In [19]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_config_name)
sess.delete_model(model_name)