# ðŸš€ Deploy Qwen3 30B A3B Model on Amazon SageMaker AI 

## Introduction: [Qwen3 30B A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B)

`Qwen3-30B-A3B` is part of the [latest generation of Qwen language models](https://qwenlm.github.io/blog/qwen3/), featuring a Mixture-of-Experts (MoE) architecture with:

- **Total Parameters**: 30.5B parameters
- **Activated Parameters**: 3.3B parameters (approximately 10% of total)
- **Architecture Details**:
    - 48 layers
    - 32 attention heads for queries and 4 for key/values (GQA)
    - 128 total experts with 8 activated experts
    - Native context length of 32,768 tokens (expandable to 131,072 with YaRN)

### Key Features
1. Hybrid Thinking Modes:

- Thinking Mode: Enables step-by-step reasoning for complex problems
Non-Thinking Mode: Provides quick responses for simpler queries
Seamless switching between modes for optimal performance

2. Strong Capabilities:

- Advanced reasoning and problem-solving
- Excellent instruction following
- Enhanced agent capabilities for tool integration
- Support for 119+ languages and dialects

3. Model Architecture:

- MoE architecture enabling efficient parameter usage
- Only activates ~10% of parameters during inference
- Optimized for both performance and computational efficiency
  
This model represents a significant advancement in open-source language models, offering competitive performance while maintaining efficient resource utilization through its MoE architecture. It's particularly well-suited for deployment in production environments where both performance and cost efficiency are crucial considerations.
Let's get started deploying one of the most capable open-source reasoning models available today!

In [2]:
%pip install -Uq sagemaker boto3 huggingface_hub --force-reinstall --no-cache-dir --quiet --no-warn-conflicts

Note: you may need to restart the kernel to use updated packages.


In [None]:
import json
import sagemaker
import boto3
import sys
import time
from typing import List, Dict
from datetime import datetime
from sagemaker.huggingface import (
    HuggingFaceModel, 
    get_huggingface_llm_image_uri
)

boto_region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session(boto_session=boto3.Session(region_name=boto_region))
role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")
sagemaker_runtime_client = boto3.client("sagemaker-runtime")
s3_client = boto3.client("s3")

model_bucket = sagemaker_session.default_bucket()  # bucket to house artifacts
s3_model_prefix = (
    "hf-large-models/model_qwen3"  # folder within bucket where code artifact will go
)
prefix = sagemaker.utils.unique_name_from_base("DEMO")
print(f"prefix: {prefix}")

## Setup your SageMaker Real-time Endpoint 
### Create a SageMaker endpoint configuration

We begin by creating the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale in all the way down to zero instances when not in use. See the [notebook example for SageMaker AI endpoint scale down to zero](https://github.com/aws-samples/sagemaker-genai-hosting-examples/tree/02236395d44cf54c201eefec01fd8da0a454092d/scale-to-zero-endpoint).

There are a few parameters we want to setup for our endpoint. We first start by setting the variant name, and instance type we want our endpoint to use. In addition we set the *model_data_download_timeout_in_seconds* and *container_startup_health_check_timeout_in_seconds* to have some guardrails for when we deploy inference components to our endpoint. In addition we will use Managed Instance Scaling which allows SageMaker to scale the number of instances based on the requirements of the scaling of your inference components. We set a *MinInstanceCount* and *MinInstanceCount* variable to size this according to the workload you want to service and also maintain controls around cost. Lastly, we set *RoutingStrategy* for the endpoint to optimally tune how to route requests to instances and inference components for the best performance.

The suggested instance types to host the QwQ 30B model can be `ml.g5.24xlarge`, `ml.g6.12xlarge`, `ml.g6e.12xlarge`.

In [None]:
# Set an unique endpoint config name
endpoint_config_name = f"{prefix}-endpoint-config"
print(f"Demo endpoint config name: {endpoint_config_name}")

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.24xlarge"
model_data_download_timeout_in_seconds = 3600
container_startup_health_check_timeout_in_seconds = 3600

min_instance_count = 0 # Minimum instance must be set to 0
max_instance_count = 3

sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": min_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ],
)

### Create the SageMaker endpoint
Next, we create our endpoint using the above endpoint config

In [None]:
# Set a unique endpoint name
endpoint_name = f"{prefix}-endpoint"
print(f"Demo endpoint name: {endpoint_name}")

sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)

In [None]:
sagemaker_session.wait_for_endpoint(endpoint_name)

## Deploy using Amazon SageMaker Large Model Inference (LMI) container 
In this example we are goign to use the LMI v15 container powered by vLLM 0.8.4 with support for the vLLM V1 engine. This version now supports the latest open-source models, such as Metaâ€™s Llama 4 models Scout and Maverick, Googleâ€™s Gemma 3, Alibabaâ€™s Qwen, Mistral AI, DeepSeek-R, and many more. You can find more details of the LMI v15 container from [the blog here](https://aws.amazon.com/blogs/machine-learning/supercharge-your-llm-performance-with-amazon-sagemaker-large-model-inference-container-v15/).



### Create Model Artifact
We will be deploying the Qwen 30B A3B model using the LMI container. In order to do so you need to set the image you would like to use with the proper configuartion. You can also create a SageMaker model to be referenced when you create your inference component

#### Download the model from Hugging Face and upload the model artifacts on Amazon S3
In this example, we will demonstrate how to download your copy of the model from huggingface and upload it to an s3 location in your AWS account, then deploy the model with the downloaded model artifacts to an endpoint. 

First, download the model artifact data from HuggingFace. 


In [None]:
from huggingface_hub import snapshot_download
from pathlib import Path
import os
import sagemaker
import jinja2

qwen3_30B = "Qwen/Qwen3-30B-A3B"

# - This will download the model into the current directory where ever the jupyter notebook is running
local_model_path = Path(".")
local_model_path.mkdir(exist_ok=True)
model_name = qwen3_30B
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.safetensors", "*.bin", "*.txt"]

# - Leverage the snapshot library to donload the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

In [None]:
# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

Upload model data to s3.

In [None]:
model_artifact = sagemaker_session.upload_data(path=model_download_path, key_prefix=s3_model_prefix)
print(f"Model uploaded to --- > {model_artifact}")
print(f"We will set option.s3url={model_artifact}")

In [None]:
# optional
# !rm -rf {model_download_path}

To find our more of the SageMaker `create_model` api call, you can see the details in [the boto3 doc](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html). Note that you can use the **CompressionType** to specify how the model data is prepared.  

If you choose `Gzip` and choose `S3Object` as the value of `S3DataType`, `S3Uri` identifies an object that is a gzip-compressed TAR archive. SageMaker will attempt to decompress and untar the object during model deployment.

If you choose `None` and `S3Prefix` as the value of `S3DataType`, then for each S3 object under the key name pefix referenced by `S3Uri`, SageMaker will trim its key by the prefix, and use the remainder as the path (relative to `/opt/ml/model`) of the file holding the content of the S3 object. SageMaker will split the remainder by slash (/), using intermediate parts as directory names and the last part as filename of the file holding the content of the S3 object.


In [None]:
# Define region where you have capacity
REGION = boto_region

#Select the latest container. Check the link for the latest available version https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers 
CONTAINER_VERSION = '0.33.0-lmi15.0.0-cu128'

# Construct container URI
container_uri = f'763104351884.dkr.ecr.{REGION}.amazonaws.com/djl-inference:{CONTAINER_VERSION}'

pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
qwen3_model = {
    "Image": container_uri,
    'ModelDataSource': {
                'S3DataSource': {
                    'S3Uri': pretrained_model_location,
                    'S3DataType': 'S3Prefix',
                    'CompressionType': 'None',
                }
            },
    "Environment": {
        "SAGEMAKER_MODEL_SERVER_WORKERS": "1",
        "MESSAGES_API_ENABLED": "true",
        "OPTION_MAX_ROLLING_BATCH_SIZE": "8",
        "OPTION_MODEL_LOADING_TIMEOUT": "1500",
        "SERVING_FAIL_FAST": "true",
        "OPTION_ROLLING_BATCH": "disable",
        "OPTION_ASYNC_MODE": "true",
        "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
        "OPTION_ENABLE_STREAMING": "true"
    },
}
model_name_qwen3 = f"qwen3-30b-tgi-{datetime.now().strftime('%y%m%d-%H%M%S')}"
# create SageMaker Model
sagemaker_client.create_model(
    ModelName=model_name_qwen3,
    ExecutionRoleArn=role,
    Containers=[qwen3_model],
)

We can now create the Inference Components which will deployed on the endpoint that you specify. Please note here that you can provide a SageMaker model or a container to specification. If you provide a container, you will need to provide an image and artifactURL as parameters. In this example we set it to the model name we prepared in the cells above. You can also set the `ComputeResourceRequirements` to supply SageMaker what should be reserved for each copy of the inference component. You can also set the copy count of the number of Inference Components you would like to deploy. These can be managed and scaled as the capabilities become available. 

Note that in this example we set the `NumberOfAcceleratorDevicesRequired` to a value of `4`. By doing so we reserve 4 accelerators for each copy of this inference component so that we can use tensor parallel. 

In [None]:
inference_component_name_qwen = f"{prefix}-IC-qwen3-30b-{datetime.now().strftime('%y%m%d-%H%M%S')}"
variant_name = "AllTraffic"

sagemaker_client.create_inference_component(
    InferenceComponentName=inference_component_name_qwen,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name_qwen3,
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 4,
            "NumberOfCpuCoresRequired": 1,
            "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

Wait until the inference component is `InService`.

In [52]:
import time
# Let's see how much it takes
start_time = time.time()
while True:
    desc = sagemaker_client.describe_inference_component(
        InferenceComponentName=inference_component_name_qwen
    )
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)
total_time = time.time() - start_time
print(f"\nTotal time taken: {total_time:.2f} seconds ({total_time/60:.2f} minutes)")

Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
Creating
InService

Total time taken: 1146.90 seconds (19.11 minutes)


#### Invoke endpoint with boto3
Now you can invoke the endpoint with boto3 `invoke_endpoint` or `invoke_endpoint_with_response_stream` runtime api calls. If you have an existing endpoint, you don't need to recreate the `predictor` and can follow below example to invoke the endpoint with an endpoint name.

Note that based on the [Qwen3 hugging face page description](https://huggingface.co/Qwen/Qwen3-30B-A3B), by default, Qwen3 has thinking capabilities enabled, similar to QwQ-32B. This means the model will use its reasoning abilities to enhance the quality of generated responses. In this mode, the model will generate think content wrapped in a \<think\>...\</think\> block, followed by the final response. For thinking mode, use Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 (the default setting in generation_config.json).

It also allows a hard switch to strictly disable the model's thinking behavior, aligning its functionality with the previous Qwen2.5-Instruct models. This mode is particularly useful in scenarios where disabling thinking is essential for enhancing efficiency. For non-thinking mode, we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.

**Advanced Usage**: You can also switch Between `Thinking` and `Non-Thinking` Modes via User Input
Qwen3 provides a soft switch mechanism that allows users to dynamically control the model's behavior when `enable_thinking=True`. Specifically, you can add `/think` and `/no_think` to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.


In [53]:
import boto3
import json
sagemaker_runtime = boto3.client('sagemaker-runtime')

prompt = {
    'messages':[
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"}
],
    'temperature':0.7,
    'top_p':0.8,
    'top_k':20,
    'max_tokens':512,
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_qwen,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

<think>
Okay, let's see. The question is asking how many R's are in the word "STRAWBERRY." Hmm, I need to count the letters R in that word.

First, I should write out the word to make sure I have it right: S-T-R-A-W-B-E-R-R-Y. Let me check each letter one by one. 

Starting with S â€“ that's not an R. Then T â€“ nope. Next is R. That's one R. Then A, W, B, E. Then the next letter is R again. That's the second R. Then another R. Wait, so after E, it's R, R, Y. So that's two more R's? Wait, let me go through it again step by step.

Breaking down the word:

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

So positions 3, 8, and 9 are R's. That's three R's. Wait, but sometimes people might miscount. Let me check again. The word is S-T-R-A-W-B-E-R-R-Y. So after the E, there's R, R, Y. So that's two R's there. Plus the R in the third position. So total of three R's. 

But wait, maybe I'm miscounting. Let me spell it out again: S-T-R-A-W-B-E-R-R-Y. So the letters are S, T, R, A, W, B, E, R

In [54]:
# Soft switch to no thinking
prompt = {
    'messages':[
    {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short! /no_think"}
],
    'temperature':0.7,
    'top_p':0.8,
    'top_k':20,
    'max_tokens':512,
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name_qwen,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

<think>

</think>

There is **1 R** in "STRAWBERRY".


#### Streaming response from the endpoint
Additionally, SGLang allows you to invoke the endpoint and receive streaming response. Below is an example of how to interact with the endpoint with streaming response.

In [55]:
import io
import json

# Example class that processes an inference stream:
class SmrInferenceStream:
    
    def __init__(self, sagemaker_runtime, endpoint_name, inference_component_name=None):
        self.sagemaker_runtime = sagemaker_runtime
        self.endpoint_name = endpoint_name
        self.inference_component_name = inference_component_name
        # A buffered I/O stream to combine the payload parts:
        self.buff = io.BytesIO() 
        self.read_pos = 0
        
    def stream_inference(self, request_body):
        # Gets a streaming inference response 
        # from the specified model endpoint:
        response = self.sagemaker_runtime\
            .invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name, 
                InferenceComponentName=self.inference_component_name,
                Body=json.dumps(request_body), 
                ContentType="application/json"
        )
        # Gets the EventStream object returned by the SDK:
        event_stream = response['Body']
        for event in event_stream:
            # Passes the contents of each payload part
            # to be concatenated:
            self._write(event['PayloadPart']['Bytes'])
            # Iterates over lines to parse whole JSON objects:
            for line in self._readlines():
                try:
                    resp = json.loads(line)
                except:
                    continue
                if len(line)>0 and type(resp) == dict:
                    # if len(resp.get('choices')) == 0:
                    #     continue
                    part = resp.get('choices')[0]['delta']['content']
                    
                else:
                    part = resp
                # Returns parts incrementally:
                yield part
    
    # Writes to the buffer to concatenate the contents of the parts:
    def _write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    # The JSON objects in buffer end with '\n'.
    # This method reads lines to yield a series of JSON objects:
    def _readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            self.read_pos += len(line)
            yield line[:-1]

In [56]:
request_body = {
    'messages':[
        {"role": "user", "content": "How many R are in STRAWBERRY? Keep your answer and explanation short!"},
    ],
    'temperature':0.9,
    'max_tokens':512,
    'stream': True,
}

smr_inference_stream = SmrInferenceStream(
    sagemaker_runtime, endpoint_name, inference_component_name_qwen)
stream = smr_inference_stream.stream_inference(request_body)
for part in stream:
    print(part, end='')

<think>
Okay, let's see. The question is asking how many R's are in the word "STRAWBERRY." Alright, first I need to spell out the word correctly. Let me write it down: S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me check again. S-T-R-A-W-B-E-R-R-Y. Yeah, that's how it's spelled. Now, I need to count the letter R in there.

Let me go through each letter one by one. Starting with S â€“ that's not an R. Then T â€“ nope. Next is R. That's one R. Then A, W, B, E. None of those are R. Then the next letter is R again. So that's two R's. Then another R? Wait, after E comes R, then another R, and then Y. So let me break it down:

1. S
2. T
3. R (1)
4. A
5. W
6. B
7. E
8. R (2)
9. R (3)
10. Y

Wait, so after E, there are two R's? Let me check the spelling again. STRAWBERRY. The correct spelling is S-T-R-A-W-B-E-R-R-Y. So yes, after the E, there are two R's before the Y. So that's three R's in total? Wait, no. Let me count again:

S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y

## Cleanup
  
Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example.

In [None]:
sagemaker_client.delete_inference_component(InferenceComponentName=inference_component_name_qwen)
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)