# Qwen Models Getting Started Guide on Amazon Bedrock

This notebook provides a comprehensive introduction to using Qwen models on Amazon Bedrock, including how to leverage the familiar OpenAI SDK interface with Amazon Bedrock. We'll cover how to make API requests, explore available parameters and payload structures, and examine use cases for these advanced reasoning models. 

## Model Overview

### Qwen-3-235B-A22B (MoE)

**Parameters:** 235 billion total (Mixture-of-Experts model with 22B active per inference)

**Use Cases:** Advanced reasoning tasks, agentic use cases, instruction-following

**Key Features:**
- **Instruct/Non-Thinking Model**: Provides direct responses optimized for instruction-following
- **Architecture**: Mixture-of-Experts (MoE) design for optimal performance and compute efficiency
- **Enhanced Tool Calling**: Superior performance in agent-based tasks
- **Compute Efficient**: Only 22B parameters activated per inference despite 235B total parameters
- **Enterprise Ready**: Optimized for complex enterprise applications

### Qwen3-32B (Dense)

**Parameters:** 32 billion (Dense model with all parameters active)

**Use Cases:** General-purpose tasks, reasoning problems, conversational AI, enterprise applications

**Key Features:**
- **Hybrid Thinking Model**: Can operate in both thinking and non-thinking modes
- **Thinking Mode**: Carefully works through problems step-by-step with enhanced reasoning
- **Non-Thinking Mode**: Provides quick responses to straightforward questions
- **Consistent Performance**: All 32B parameters active during inference for robust performance
- **Instruction Following**: Excellent at following complex instructions
- **Conversational AI**: Strong conversational capabilities
- **Tool Calling**: Enhanced function calling capabilities

## Core Capabilities

Both Qwen models offer the following characteristics:

**Input/Output:** Text-in, text-out 

**Context Window:** 128,000 tokens  

**Model Type:** Advanced reasoning models with thinking capabilities

**Languages:** English and Chinese

**Supported Regions:** see [here](https://docs.aws.amazon.com/bedrock/latest/userguide/models-regions.html)

**Tool Calling:** ✅ Supported (Enhanced capabilities)

**Bedrock Guardrails** ✅ Supported

**Converse API** ✅ Supported

**OpenAI Chat Completions API** ✅ Supported

**Streaming:** ✅ Supported

**Model Evaluation:** ✅ Supported

**Agents:** ✅ Supported

**Prompt Management:** ✅ Supported

**Flows:** ✅ Supported

**Batch Inference:** ✅ Supported

**Knowledge Bases:** ✅ Supported

**Bedrock Studio:** ✅ Supported

## What You'll Learn in this getting started guide

- Options to use Amazon Bedrock for Qwen models inference, including:    
    - Using the OpenAI SDK with Amazon Bedrock
    - Using Amazon Bedrock's InvokeModel API
    - Using Amazon Bedrock's Converse API
- Understanding request parameters and response structures
- Leveraging thinking vs non-thinking modes for different use cases
- Implementing enhanced tool calling capabilities
- Exploring reasoning capabilities with thinking mode
- Comparing performance between Qwen-3-235B-A22B (MoE) and Qwen3-32B (Dense) models


## Model Access on Amazon Bedrock

Ensure you have the correct IAM permission in order to access Qwen's models on Amazon Bedrock.

## IAM Permissions

To use Bedrock models, your AWS credentials need the following permissions:


## Step 1: Environment Configuration

First, we need to install the required packages and tell the OpenAI SDK to talk to Bedrock instead of OpenAI's servers.

### Required Imports:
- `os` → For environment variables
- `boto3` → For native Bedrock API interactions  
- `json` → For JSON serialization/deserialization
- `datetime` → For timestamp tracking and performance measurements
- `openai` → For OpenAI SDK compatibility with Bedrock
- `strands` → For Amazon Strands agent framework
- `IPython.display` → For enhanced output formatting and streaming demonstrations

### Environment Variables:
We set two environment variables to redirect the OpenAI SDK:
- `AWS_BEARER_TOKEN_BEDROCK` → Your Bedrock API key  
- `OPENAI_BASE_URL` → Bedrock's OpenAI-compatible endpoint


In [1]:
%pip install boto3 openai ipython


Collecting boto3
  Downloading boto3-1.40.38-py3-none-any.whl.metadata (6.7 kB)
Collecting openai
  Downloading openai-1.109.1-py3-none-any.whl.metadata (29 kB)
Collecting botocore<1.41.0,>=1.40.38 (from boto3)
  Downloading botocore-1.40.38-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Using cached jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.15.0,>=0.14.0 (from boto3)
  Using cached s3transfer-0.14.0-py3-none-any.whl.metadata (1.7 kB)
Collecting urllib3!=2.2.0,<3,>=1.25.4 (from botocore<1.41.0,>=1.40.38->boto3)
  Using cached urllib3-2.5.0-py3-none-any.whl.metadata (6.5 kB)
Collecting anyio<5,>=3.5.0 (from openai)
  Downloading anyio-4.11.0-py3-none-any.whl.metadata (4.1 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Using cached httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from open

In [2]:
import os
import boto3
import json
from openai import OpenAI
from datetime import datetime
from IPython.display import clear_output, display, display_markdown, Markdown


### Model IDs

- **qwen.qwen3-235b-a22b-2507-v1:0** (MoE model with thinking mode)
- **qwen.qwen3-32b-v1:0** (Dense model)


In [3]:
# Model Configuration - Qwen Models
QWEN_MOE_MODEL_ID = "qwen.qwen3-235b-a22b-2507-v1:0"  # MoE model with thinking mode
QWEN_DENSE_MODEL_ID = "qwen.qwen3-32b-v1:0"  # Dense model

print(f"✅ Using MoE model: {QWEN_MOE_MODEL_ID}")
print(f"✅ Using Dense model: {QWEN_DENSE_MODEL_ID}")


✅ Using MoE model: qwen.qwen3-235b-a22b-2507-v1:0
✅ Using Dense model: qwen.qwen3-32b-v1:0


In [4]:
# Set environment variables to point to Bedrock
# Note: Change the region in the URL to match your preferred region
os.environ["AWS_BEARER_TOKEN_BEDROCK"] = "ABSKQmVkcm9ja0FQSUtleS0xZ2N2LWF0LTQwNTY0NTIyMjcyODp1dnBlN1RCT2VCRHJNbFJ2UE0yWFFnWW9ZSHpocUk1T2RTa0dVV01CcTNGYTQyNGJCZFlpNnBIempGcz0="
os.environ["OPENAI_API_KEY"] = "ABSKQmVkcm9ja0FQSUtleS0xZ2N2LWF0LTQwNTY0NTIyMjcyODp1dnBlN1RCT2VCRHJNbFJ2UE0yWFFnWW9ZSHpocUk1T2RTa0dVV01CcTNGYTQyNGJCZFlpNnBIempGcz0="
# os.environ["AWS_BEARER_TOKEN_BEDROCK"] = "<insert your bedrock API key>"
# os.environ["OPENAI_API_KEY"] = "<insert your bedrock API key>"
os.environ["OPENAI_BASE_URL"] = "https://bedrock-runtime.us-west-2.amazonaws.com/openai/v1"

print("✅ Environment configured for Bedrock!")
print("📍 Using us-west-2 region - change the URL above to use a different region")


✅ Environment configured for Bedrock!
📍 Using us-west-2 region - change the URL above to use a different region


## Step 2: Inference with Amazon Bedrock

### Option 1: OpenAI SDK

#### Import and Initialize OpenAI Client

Now we use the **exact same OpenAI SDK** you're familiar with. The client will automatically read the environment variables we just set.

**Key Point**: This is the same OpenAI library, but now it's talking to Amazon Bedrock.


In [5]:
# Initialize both clients
# Note: Change region_name to match your preferred region
client = OpenAI()  # For chat completions API
bedrock_client = boto3.client('bedrock-runtime', region_name='us-west-2')  

print("✅ OpenAI client initialized (pointing to Bedrock)")
print(f"✅ Bedrock client initialized in region: {bedrock_client.meta.region_name}")
print("📍 Change region_name above to use a different supported region")


✅ OpenAI client initialized (pointing to Bedrock)
✅ Bedrock client initialized in region: us-west-2
📍 Change region_name above to use a different supported region


#### Make API Calls 

The API call structure is identical to OpenAI:
- Same `messages` format with `role` and `content`
- Same `model` parameter (but uses Bedrock model IDs)  
- Same `stream` parameter for real-time responses
- **New**: `reasoning_effort` parameter to control thinking vs non-thinking behavior (available for MoE model)


In [6]:
# Example 1: Qwen3-32B Dense model (quick response)
response = client.chat.completions.create(
    model=QWEN_DENSE_MODEL_ID,                 
    messages=[
        {"role": "system", "content": "You are a concise, highly logical assistant."},
        {"role": "user",   "content": "What is the largest city in the southern hemisphere?"}
    ],
    temperature=0,
    max_completion_tokens=1000,
    
)

# Extract and print the response text
print("🤖 Qwen3-32B Dense model response:")
print(response.choices[0].message.content)


🤖 Qwen3-32B Dense model response:
The largest city in the southern hemisphere is **São Paulo**, Brazil, by population.


In [11]:
# Example 2: Qwen3-32B Dense model with thinking mode (step-by-step reasoning)
response = client.chat.completions.create(
    model=QWEN_DENSE_MODEL_ID,                 
    messages=[
        {"role": "system", "content": "You are a helpful assistant that thinks through problems step by step."},
        {"role": "user",   "content": "If a train leaves station A at 60 mph and another leaves station B at 40 mph, and they are 200 miles apart, when will they meet?"}
    ],
    temperature=0,
    max_completion_tokens=2000,
    reasoning_effort='high'# Thinking mode for complex reasoning (MoE model only)
)

# Extract and print the response text
print("🧠 Qwen3-32B Dense model with thinking mode response:")
print(response.choices[0].message.content)


🧠 Qwen3-32B Dense model with thinking mode response:
<reasoning>
Okay, so there's this problem about two trains leaving stations A and B, which are 200 miles apart. One is going at 60 mph and the other at 40 mph. The question is when they'll meet. Hmm, let me think.

First, I need to visualize the scenario. Let me imagine two stations, A and B, 200 miles apart. A train leaves A heading towards B at 60 mph, and another leaves B heading towards A at 40 mph. They're moving towards each other, right? So their speeds are adding up because they're approaching each other. That makes sense because if two objects move towards each other, their relative speed is the sum of their individual speeds.

So, if I add 60 mph and 40 mph, that gives me 100 mph. This combined speed means that the distance between them is decreasing at a rate of 100 miles per hour. Since they start 200 miles apart, I can figure out the time it takes for them to meet by dividing the total distance by their combined speed.



In [10]:
# Example 3: Qwen-3-235B-A22B MoE model without thinking mode (quick response)
response = client.chat.completions.create(
    model=QWEN_MOE_MODEL_ID,                 
    messages=[
        {"role": "system", "content": "You are a concise, highly logical assistant."},
        {"role": "user",   "content": "What is the capital of France?"}
    ],
    temperature=0,
    max_completion_tokens=1000,  # Non-thinking mode for quick responses
)

# Extract and print the response text
print("⚡ Qwen-3-235B-A22B MoE model without thinking mode response:")
print(response.choices[0].message.content)


⚡ Qwen-3-235B-A22B MoE model without thinking mode response:
The capital of France is Paris.


#### Process Streaming Response

Handle the response exactly like you would with OpenAI. Each `item` in the response is a chunk of the model's output. Both Qwen models support streaming, with the MoE model supporting streaming in both thinking and non-thinking modes.


In [12]:
# Streaming with Qwen3-32B Dense model thinking mode
streaming_response = client.chat.completions.create(
    model=QWEN_DENSE_MODEL_ID,                 
    messages=[
        {"role": "system", "content": "You are a helpful assistant that thinks through problems step by step."},
        {"role": "user",   "content": "Explain how photosynthesis works in simple terms."}
    ],
    temperature=0,
    max_completion_tokens=1500,
    reasoning_effort='high',  # Enable thinking mode
    stream=True
)

# Extract and print the response text in real-time.
print("🧠 Streaming Qwen3-32B Dense model with thinking mode response:")
for chunk in streaming_response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")


🧠 Streaming Qwen3-32B Dense model with thinking mode response:
<reasoning>
Okay, so I need to explain how photosynthesis works in simple terms. Let me start by recalling what I know about photosynthesis. I remember that it's a process plants use to make their own food. But I'm a bit fuzzy on the details. Let me think.

First, plants take in something from the air</reasoning><reasoning>. I think it's carbon dioxide. Then they use sunlight, right? So sunlight is a key part of this process. They also need water, which they probably get from the soil through their roots. So water</reasoning><reasoning> comes up from the roots, and carbon dioxide comes in through the leaves, maybe through tiny holes called stomata?

Then there's the part about chlorophyll. I remember that's the green pigment in plants that helps them absorb sunlight. So</reasoning><reasoning> chlorophyll is in the chloroplasts of the plant cells. The sunlight energy is used to convert the carbon dioxide and water into somet

In [13]:
# Streaming with Qwen-3-235B-A22B MoE model
streaming_response = client.chat.completions.create(
    model=QWEN_MOE_MODEL_ID,                 
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "What are the benefits of renewable energy?"}
    ],
    temperature=0,
    max_completion_tokens=1500,
    stream=True
)

# Extract and print the response text in real-time.
print("🤖 Streaming Qwen-3-235B-A22B MoE model response:")
for chunk in streaming_response:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")


🤖 Streaming Qwen-3-235B-A22B MoE model response:
Renewable energy offers numerous benefits for the environment, economy, and society. Here are some of the key advantages:

1. **Environmental Protection**:
   - **Reduces Greenhouse Gas Emissions**: Renewable sources like solar, wind, and hydropower produce little to no carbon dioxide or other greenhouse gases, helping to combat climate change.
   - **Improves Air Quality**: Unlike fossil fuels, renewables don’t emit harmful pollutants such as sulfur dioxide, nitrogen oxides, or particulate matter, leading to cleaner air and better public health.
   - **Conserves Water Resources**: Most renewable energy technologies use significantly less water than fossil fuel power plants, especially important in water-scarce regions.

2. **Sustainability and Energy Security**:
   - **Inexhaustible Supply**: Renewable energy sources like sunlight, wind, and water are naturally replenished and won’t run out, unlike finite fossil fuels.
   - **Reduces De

#### What's Happening Behind the Scenes?

When you use the OpenAI SDK with Bedrock, your requests are automatically translated to Bedrock's native `InvokeModel` API.

#### Request Translation
- **OpenAI SDK Request** → **Bedrock InvokeModel** 
- The request body structure remains the same
- But there are some key differences in how parameters are handled:

| Parameter | OpenAI SDK | Bedrock InvokeModel |
|-----------|------------|-------------------|
| **Model ID** | In request body | Part of the URL path |
| **Streaming** | `stream=True/False` | Different API endpoints:<br/>• `InvokeModel` (non-streaming)<br/>• `InvokeModelWithResponseStream` (streaming) |
| **Request Body** | Full chat completions format | Same format, but `model` and `stream` are optional |
| **Thinking Mode** | `reasoning_effort='high'/ do not set` | Qwen3-32B Dense model specific parameter |


#### Enhanced Function Calling with OpenAI SDK

Both Qwen models feature enhanced tool calling capabilities for superior performance in agent-based tasks. Let's demonstrate this with a weather lookup function.


In [14]:
def get_weather(location):
    """
    Get current weather for a given location.
    This is a mock function that returns sample weather data.
    
    Args:
        location (str): City and country, e.g. "Paris, France"
        
    Returns:
        dict: Weather information
    """
    # Mock weather data - in a real application, you'd call a weather API
    weather_data = {
        "Paris, France": {"temperature": "22°C", "condition": "Partly cloudy", "humidity": "65%"},
        "New York, USA": {"temperature": "18°C", "condition": "Sunny", "humidity": "45%"},
        "Tokyo, Japan": {"temperature": "25°C", "condition": "Rainy", "humidity": "80%"},
        "London, UK": {"temperature": "15°C", "condition": "Overcast", "humidity": "70%"},
        "Sydney, Australia": {"temperature": "28°C", "condition": "Clear", "humidity": "55%"}
    }
    
    return weather_data.get(location, {
        "temperature": "20°C", 
        "condition": "Data not available", 
        "humidity": "50%"
    })

# Define the function schema for OpenAI SDK
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current temperature and weather conditions for a given location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and country, e.g. 'Paris, France'"
                }
            },
            "required": ["location"],
            "additionalProperties": False
        }
    }
}]

print("✅ Weather function and tools configuration ready!")


✅ Weather function and tools configuration ready!


In [15]:
def chat_with_functions(client, model, messages, tools, max_iterations=3, reasoning_effort='minimal'):
    """
    Chat with function calling support using OpenAI SDK format.
    
    Args:
        client: OpenAI client instance
        model: Model ID to use
        messages: List of conversation messages
        tools: List of available tools/functions
        max_iterations: Maximum number of function call iterations
        reasoning_effort: Whether to use thinking mode for enhanced reasoning (MoE model only)
        
    Returns:
        Final assistant message
    """
    
    for iteration in range(max_iterations):
        print(f"🔄 Iteration {iteration + 1}")
        
        # Make request with tools
        request_params = {
            "model": model,
            "messages": messages,
            "tools": tools,
            "tool_choice": "auto"
        }
        
        # Add reasoning_effort only for dense model
        if reasoning_effort and "qwen3-32b-v1" in model:
            request_params["reasoning_effort"] = reasoning_effort
        
        response = client.chat.completions.create(**request_params)
        
        assistant_message = response.choices[0].message
        messages.append(assistant_message)
        
        # Check if the model wants to call functions
        if assistant_message.tool_calls:
            print(f"🔧 Model requested {len(assistant_message.tool_calls)} function call(s)")
            
            # Process each function call
            for tool_call in assistant_message.tool_calls:
                function_name = tool_call.function.name
                function_args = json.loads(tool_call.function.arguments)
                
                print(f"🔧 Calling function: {function_name}")
                print(f"🔧 Arguments: {function_args}")
                
                # Call the actual function
                if function_name == "get_weather":
                    function_result = get_weather(function_args["location"])
                    print(f"🔧 Function result: {function_result}")
                else:
                    function_result = {"error": f"Unknown function: {function_name}"}
                
                # Add function result to conversation
                function_message = {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "content": json.dumps(function_result)
                }
                messages.append(function_message)
                
        else:
            # No more function calls, return final response
            print("✅ No function calls requested, conversation complete")
            return assistant_message
    
    print("⚠️ Maximum iterations reached")
    return assistant_message

print("✅ Enhanced function calling handler ready!")


✅ Enhanced function calling handler ready!


In [None]:
# Test enhanced function calling with both models
weather_questions = [
    "What's the weather like in Paris today?",
    "Can you tell me the temperature in Tokyo?",
    "How's the weather in Sydney, Australia?",
    "What are the conditions like in New York?"
]

print("🌤️ Testing Enhanced Function Calling with Qwen Models")
print("=" * 60)

# Test with Qwen3-32B Dense model
print("\n🤖 Testing with Qwen3-32B Dense Model (with thinking mode)")
print("-" * 40)

for i, question in enumerate(weather_questions[:2], 1):  # Test first 2 questions
    print(f"\n📝 Test {i}: {question}")
    print("-" * 20)
    
    try:
        # Create conversation messages
        messages = [
            {"role": "system", "content": "You are a helpful weather assistant. Use the get_weather function to provide accurate weather information."},
            {"role": "user", "content": question}
        ]
        
        # Call the function calling handler with dense model
        final_response = chat_with_functions(
            client=client,
            model=QWEN_DENSE_MODEL_ID,
            messages=messages,
            tools=tools,
            reasoning_effort='high'  # Enable thinking mode for dense model
        )
        
        # Print the final response
        print("🤖 Final response:")
        print(final_response.content)
        
    except Exception as e:
        print(f"❌ Error: {str(e)}")
    
    print()

# Test with Qwen-3-235B-A22B MoE model
print("\n🧠 Testing with Qwen-3-235B-A22B MoE Model")
print("-" * 50)

for i, question in enumerate(weather_questions[2:], 1):  # Test last 2 questions
    print(f"\n📝 Test {i}: {question}")
    print("-" * 20)
    
    try:
        # Create conversation messages
        messages = [
            {"role": "system", "content": "You are a helpful weather assistant. Use the get_weather function to provide accurate weather information."},
            {"role": "user", "content": question}
        ]
        
        # Call the function calling handler with MoE model and thinking mode
        final_response = chat_with_functions(
            client=client,
            model=QWEN_MOE_MODEL_ID,
            messages=messages,
            tools=tools,
        )
        
        # Print the final response
        print("🧠 Final response:")
        print(final_response.content)
        
    except Exception as e:
        print(f"❌ Error: {str(e)}")
    
    print()


#### What Just Happened with Enhanced Function Calling?

The enhanced function calling demonstration shows both Qwen models' improved capabilities:

1. **Enhanced Tool Recognition**: Both models have superior tool calling performance for agent-based tasks
2. **Hybrid Thinking Integration**: The dense model can reason through complex tool selection and usage when `reasoning_effort='high'`
3. **Improved Function Execution**: Better understanding of when and how to use available tools
4. **Multi-step Reasoning**: Both models can plan and execute complex multi-tool workflows
5. **Context Awareness**: Enhanced understanding of conversation context for better tool usage decisions

**Key Advantages of Qwen Models' Enhanced Tool Calling:**
- More accurate tool selection based on user intent
- Better handling of complex multi-step agent workflows  
- Improved reasoning about tool parameters and results
- Enhanced error handling and recovery in tool usage scenarios
- **MoE Model (235B)**: Optimized for direct instruction-following in agent tasks
- **Dense Model (32B)**: Hybrid thinking mode for complex reasoning and tool planning

### Option 2: Amazon Bedrock's InvokeModel API

The Bedrock InvokeModel API is the foundational interface for interacting directly with any model hosted on Amazon Bedrock. It provides low-level, flexible access to model inference, allowing you to send input data and receive generated responses in a consistent way across all supported models.

**Key Benefits:**
- Direct Access: Interact with any Bedrock model using a unified API endpoint.
- Fine-Grained Control: Customize inference parameters and payloads for each request.
- Streaming Support: Use `InvokeModelWithResponseStream` for real-time, token-by-token output.
- Privacy: Amazon Bedrock does not store your input or output data—requests are used only for inference.
- **Thinking Mode Control**: Direct control over Qwen-3-235B-A22B MoE model's thinking vs non-thinking behavior.


#### Setup client

First, we setup the Amazon Bedrock client.


In [18]:
# Configure region for Bedrock client
region = None

if region is None:
    target_region = os.environ.get("AWS_REGION", os.environ.get("AWS_DEFAULT_REGION"))
else:
    target_region = "us-west-2"

bedrock_runtime = boto3.client('bedrock-runtime', region_name=region)
print(f"📍 Using region: {target_region} - change the region variable above to use a different supported region")


📍 Using region: None - change the region variable above to use a different supported region


#### Inference with InvokeModel API

Then we use the InvokeModel API to perform model inference with the two Qwen models.


In [19]:
def invoke_model(body, model_id, accept, content_type):
    """
    Invokes Amazon bedrock model to run an inference
    using the input provided in the request body.
    
    Args:
        body (dict): The invokation body to send to bedrock
        model_id (str): the model to query
        accept (str): input accept type
        content_type (str): content type
    Returns:
        Inference response from the model.
    """

    try:
        response = bedrock_runtime.invoke_model(
            body=json.dumps(body), 
            modelId=model_id, 
            accept=accept, 
            contentType=content_type
        )

        return response

    except Exception as e:
        print(f"Couldn't invoke {model_id}")
        raise e


In [None]:
# Example with Qwen-3-235B-A22B MoE model
messages = [
    {"role": "system", "content": "You are a concise, highly logical assistant."},
    {"role": "user",   "content": "What is the largest city in the southern hemisphere?"}
]

body = {
    "messages": messages,
    "temperature": 0,
    "max_completion_tokens": 1000,
}

accept = "application/json"
contentType = "application/json"

response = invoke_model(body, QWEN_MOE_MODEL_ID, accept, contentType)
response_body = json.loads(response.get("body").read())

print("🧠 Thinking mode response:")
print(response_body['choices'][0]['message']['content'])


In [None]:
# Example with Qwen3-32B Dense model and thinking mode
messages = [
    {"role": "system", "content": "You are a concise, highly logical assistant."},
    {"role": "user",   "content": "What is the largest city in the southern hemisphere?"}
]

body = {
    "messages": messages,
    "temperature": 0,
    "max_completion_tokens": 1000,
    "reasoning_effort": 'high'  

}

accept = "application/json"
contentType = "application/json"

response = invoke_model(body, QWEN_DENSE_MODEL_ID, accept, contentType)
response_body = json.loads(response.get("body").read())

print("📝 Response:")
print(response_body['choices'][0]['message']['content'])

📝 Response:
<reasoning>
Okay, so I need to figure out what the largest city in the southern hemisphere is. Let me start by recalling what the southern hemisphere includes. It's all the countries and regions that are located south of the equator. That includes parts of South America, Africa, Australia, and Antarctica. But since we're talking about cities, Antarctica probably doesn't count because it doesn't have permanent residents.

Now, the largest city by population. I know that the largest cities in the world are usually in the northern hemisphere, like Tokyo, Delhi, Shanghai, etc. But I need to check if any of these are in the southern hemisphere. For example, Sydney is in Australia, which is in the southern hemisphere. But is Sydney the largest? I think Australia's largest city is Sydney, but maybe there are other cities in other southern hemisphere countries that are bigger.

Wait, Brazil is in the southern hemisphere. São Paulo is a huge city. I remember that São Paulo is one of

#### Streaming with InvokeModel API

The InvokeModel API comes with built in streaming support. This can be useful in user-facing applications since it reduces time to first token (TTFT) metric and with that perceived inference latency for the end user. 

In [24]:
# Streaming with Qwen3-32B Dense model and thinking mode
messages = [
    {"role": "system", "content": "You are a concise, highly logical assistant."},
    {"role": "user",   "content": "What is the largest city in the southern hemisphere?"}
]

body = {
    "messages": messages,
    "temperature": 0,
    "max_completion_tokens": 1000,
    "reasoning_effort": 'high'  # Enable thinking mode for streaming
}

accept = "application/json"
contentType = "application/json"

start_time = datetime.now()

response = bedrock_runtime.invoke_model_with_response_stream(
    body=json.dumps(body), modelId=QWEN_DENSE_MODEL_ID, accept=accept, contentType=contentType
)
chunk_count = 0
time_to_first_token = None

# Process the response stream
stream = response.get("body")
if stream:
    print("🧠 Streaming thinking mode response:")
    for event in stream:
        chunk = event.get("chunk")
        if chunk:
            # Print the response chunk
            chunk_json = json.loads(chunk.get("bytes").decode())
            content_block_delta = chunk_json.get("choices")[0]["delta"].get("content")
            if content_block_delta:
                if time_to_first_token is None:
                    time_to_first_token = datetime.now() - start_time
                    print(f"Time to first token: {time_to_first_token}")

                chunk_count += 1
                print(content_block_delta, end="")
    print(f"\nTotal chunks: {chunk_count}")
else:
    print("No response stream received.")


🧠 Streaming thinking mode response:
Time to first token: 0:00:00.910533
<reasoning>
Okay, so I need to figure out what the largest city in the southern hemisphere is. Let me start by recalling what the southern hemisphere includes. It's all the countries and regions that are located</reasoning><reasoning> south of the equator. That includes parts of South America, Africa, Australia, and Antarctica. But since we're talking about cities, Antarctica probably doesn't count</reasoning><reasoning> because it doesn't have permanent residents.

Now, the largest city by population. I know that the largest cities in the world are usually in the northern hemisphere, like Tokyo, Delhi</reasoning><reasoning>, Shanghai, etc. But I need to check if any of these are in the southern hemisphere. Wait, Tokyo is in Japan, which is in the northern hemisphere. Delhi is in India, also northern. Shanghai is in China, northern</reasoning><reasoning>. So maybe the largest city in the southern hemisphere is one 

### Option 3: Amazon Bedrock's Converse API

The Bedrock Converse API provides a consistent interface for working with all Bedrock models that support messages. This means you can write your code once and use it across different models without changes. 

Key Benefits:
- Universal Interface: Same API structure works with Claude, Llama, Titan, and other models
- Model-Specific Parameters: Pass unique parameters when needed for specific models
- Privacy: Amazon Bedrock doesn't store any content you provide - data is only used for response generation
- Advanced Features: Built-in support for guardrails, tools/function calling, and prompt management
- **Thinking Mode Support**: Direct control over Qwen-3-235B-A22B's thinking capabilities

Additionally, the Converse API automatically separates the reasoning trace from the final response, giving developers the flexibility to show or hide the model's thinking process from end users based on their application needs.


In [27]:
# Converse API with Qwen-3-235B-A22B MoE model (no thinking mode)
response = bedrock_client.converse(
    modelId=QWEN_MOE_MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": [{"text": "What is the capital of Australia?"}]
        }
    ],
    system=[{"text": "You are a concise, highly logical assistant."}],
    inferenceConfig={
        "temperature": 0,
        "maxTokens": 1000
    }
    # Note: No additionalModelRequestFields needed for dense model
)

# Final response (dense model doesn't have reasoning trace)
print(f"🤖 MOE model response:")
print(response['output']['message']['content'][0]['text'])


🤖 MOE model response:
The capital of Australia is Canberra.


In [28]:
# Converse API with Qwen3-32B Dense model and thinking mode
response = bedrock_client.converse(
    modelId=QWEN_DENSE_MODEL_ID,
    messages=[
        {
            "role": "user",
            "content": [{"text": "How far from earth is the moon?"}]
        }
    ],
    system=[{"text": "You are a concise, highly logical assistant."}],
    inferenceConfig={
        "temperature": 0,
        "maxTokens": 1000
    },
    additionalModelRequestFields={
        "reasoning_effort": "high"# Enable thinking mode for MoE model
    }
)

# Message dict
print(f"📝 Message dict:")
print(response['output']['message']['content'])

# Reasoning trace (if available)
if 'reasoningContent' in response['output']['message']['content'][0]:
    print(f"📝 Reasoning trace:")
    print(response['output']['message']['content'][0]['reasoningContent']['reasoningText']['text'])

# Final response
print(f"📝 Final response:")
print(response['output']['message']['content'][1]['text'])

📝 Message dict:
[{'reasoningContent': {'reasoningText': {'text': "\nOkay, the user is asking how far the Moon is from Earth. Let me start by recalling the average distance. I think it's around 384,400 kilometers. But wait, the Moon's orbit isn't a perfect circle, so the distance varies. The closest point is called perigee, and the farthest is apogee. I should mention both to be accurate.\n\nWait, what are the exact numbers for perigee and apogee? I remember that perigee is about 363,300 km and apogee is around 405,500 km. Let me double-check those numbers to make sure. Yeah, that's right. So the average is in between those two. \n\nAlso, the user might be interested in the unit of measurement. They asked for the distance in kilometers, but sometimes it's also expressed in miles. Maybe I should include both units for clarity. Let me convert 384,400 km to miles. 1 kilometer is approximately 0.621371 miles, so multiplying 384,400 by 0.621371 gives roughly 238,855 miles. \n\nI should also 

#### Streaming with Converse API

The Converse API comes with built in streaming support. This can be useful in user-facing applications since it reduces time to first token (TTFT) metric and with that perceived inference latency for the end user. The output below will contain reasoning trace and final response. Based on your application you might want to hide the reasoning trace from the end user.

In [32]:
# Streaming through Converse API with Qwen3-32B Dense model with thinking mode
def bedrock_model_converse_stream_dense(client, system_prompt, user_prompt, max_tokens=1000, temperature=0, reasoning_effort='high'):
    response = ""
    response = client.converse_stream(
        modelId=QWEN_DENSE_MODEL_ID,
        messages=[  
            {
                "role": "user",
                "content": [
                    {
                        "text": user_prompt
                    }
                ]
            },                        
        ],
        system=[{"text": system_prompt}],
        inferenceConfig={
            "temperature": temperature,
            "maxTokens": max_tokens
        },
        additionalModelRequestFields={
            "reasoning_effort": "high"  # Enable thinking mode for Qwen3-32B Dense model
        }
    )
    # Extract and print the response text in real-time.
    for event in response['stream']:
        if 'contentBlockDelta' in event:
            chunk = event['contentBlockDelta']
            if chunk['delta'].get('reasoningContent', None):
                print(chunk['delta']['reasoningContent']['text'], end="")
            if chunk['delta'].get('text', None):
                print(chunk['delta']['text'], end="")
    return

In [33]:
# Streaming through Converse API with Qwen-3-235B-A22B MoE model (instruct/non-thinking)
def bedrock_model_converse_stream_moe(client, system_prompt, user_prompt, max_tokens=1000, temperature=0):
    response = ""
    response = client.converse_stream(
        modelId=QWEN_MOE_MODEL_ID,
        messages=[  
            {
                "role": "user",
                "content": [
                    {
                        "text": user_prompt
                    }
                ]
            },                        
        ],
        system=[{"text": system_prompt}],
        inferenceConfig={
            "temperature": temperature,
            "maxTokens": max_tokens
        }
        # Note: No additionalModelRequestFields needed for instruct model
    )
    # Extract and print the response text in real-time.
    for event in response['stream']:
        if 'contentBlockDelta' in event:
            chunk = event['contentBlockDelta']
            if chunk['delta'].get('text', None):
                print(chunk['delta']['text'], end="")
    return

In [35]:
# Example usage of streaming functions

print("\n\n🧠 Streaming with Qwen3-32B Dense model (thinking mode):")
bedrock_model_converse_stream_dense(
    client=bedrock_client,
    system_prompt="You are a helpful assistant that thinks through problems step by step.",
    user_prompt="Explain how a computer works in simple terms.",
    reasoning_effort='high'
)

print("\n\n🤖 Streaming with Qwen-3-235B-A22B MoE model (instruct/non-thinking):")
bedrock_model_converse_stream_moe(
    client=bedrock_client,
    system_prompt="You are a helpful assistant.",
    user_prompt="What are the benefits of renewable energy?"
)



🧠 Streaming with Qwen3-32B Dense model (thinking mode):

Okay, so I need to explain how a computer works in simple terms. Let me start by breaking down the main components of a computer. There's the CPU, memory, storage, input devices, output devices, and maybe the motherboard. But how do these parts work together?

First, the CPU is the brain of the computer. It processes instructions. But what does that mean exactly? It fetches instructions from memory, decodes them, and executes them. But how does it do that? Maybe I should mention binary, since computers use 0s and 1s. But I need to keep it simple, so maybe not go into too much detail about binary.

Memory, like RAM, is where the computer stores data and programs that are currently in use. When you open a program, it's loaded into RAM so the CPU can access it quickly. Storage, like a hard drive or SSD, is for long-term data. So when you shut down, the data in RAM is lost, but storage keeps it.

Input devices like a keyboard or mo

## Conclusion for Qwen Models

You've successfully explored **three powerful ways** to interact with Qwen models on Amazon Bedrock, including comprehensive tool use capabilities and thinking mode!

### Model Comparison Summary

| Feature | Qwen-3-235B-A22B (MoE) | Qwen3-32B (Dense) |
|---------|------------------------|-------------------|
| **Total Parameters** | 235B | 32B |
| **Active Parameters** | 22B per inference | 32B (all active) |
| **Thinking Mode** | ❌ Not supported (instruct/non-thinking) | ✅ Supported (hybrid thinking) |
| **Use Cases** | Advanced agent tasks, instruction-following | General-purpose, complex reasoning |
| **Compute Efficiency** | High (only 22B active) | Consistent (all 32B active) |
| **Performance** | Optimized for direct instruction-following | Optimized for reasoning and thinking |

### Key Benefits Achieved

✅ **Flexibility**: Three different API approaches for different use cases  
✅ **Performance**: Streaming support for improved user experience  
✅ **Familiarity**: Use existing OpenAI SDK patterns with AWS infrastructure  
✅ **Control**: Direct API access when you need fine-grained customization  
✅ **Consistency**: Universal interface that works across all Bedrock models  
✅ **Privacy**: AWS Bedrock doesn't store your data - only used for inference  
✅ **Tool Integration**: Enhanced function calling capabilities across all three approaches
✅ **Practical Comparison**: Side-by-side examples using the same function
✅ **Thinking Mode**: Step-by-step reasoning capabilities for complex problems (Dense model)
✅ **Architecture**: Both Mixture-of-Experts and Dense designs for different use cases
✅ **Model Choice**: Instruct/non-thinking (MoE) and hybrid thinking (Dense) architectures
✅ **Cross-Region Inference**: Support for multiple AWS regions
✅ **Knowledge Bases**: Integration with Amazon Bedrock Knowledge Bases
✅ **Bedrock Studio**: Full integration with Bedrock Studio for development

### What's Next?

You're now equipped with comprehensive knowledge to choose the right API approach and model for your specific use case. Whether you need:
- The **simplicity** of the OpenAI SDK
- The **control** of InvokeModel 
- The **consistency** of Converse API
- **Enhanced tool use capabilities** for external integrations
- **Thinking mode** for complex reasoning tasks (Dense model)
- **Direct instruction-following** for agent-based tasks (MoE model)

You have all the tools and examples to build powerful AI applications with Qwen models' advanced capabilities on Amazon Bedrock!