# Deploy Qwen3-Next-80B-A3B-Instruct on Amazon SageMaker

This notebook demonstrates how to deploy the Qwen3-Next-80B-A3B-Instruct model on Amazon SageMaker using a custom vLLM container.

## Model Overview
- **Model**: Qwen3-Next-80B-A3B-Instruct
- **Parameters**: 80B total (3B activated)
- **Context Length**: 262K tokens (extensible to 1M+)
- **Architecture**: Hybrid Attention + High-Sparsity MoE (512 experts, 10 activated)

## Prerequisites

1. AWS CLI configured with appropriate permissions
2. SageMaker execution role with necessary permissions
3. Access to ml.g6e.12xlarge for SageMaker endpoint
4. Custom Docker image pushed to ECR using ```build_and_push.sh``` script

In [2]:
# !pip install -r code/requirements.txt

In [13]:
!pip install -U strands-agents strands-agents-tools
!pip install pydantic==2.11.7 openai  # Compatible version
!pip install mypy-boto3-sagemaker-runtime  # Type stubs for SageMaker

Collecting openai
  Downloading openai-1.109.1-py3-none-any.whl.metadata (29 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Downloading openai-1.109.1-py3-none-any.whl (948 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m948.6/948.6 kB[0m [31m42.2 MB/s[0m  [33m0:00:00[0m
[?25hDownloading distro-1.9.0-py3-none-any.whl (20 kB)
Downloading jiter-0.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (348 kB)
Installing collected packages: jiter, distro, openai
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3/3[0m [openai]2m2/3[0m [openai]
[1A[2KSuccessfully installed distro-1.9.0 jiter-0.11.0 openai-1.109.1


In [None]:
import boto3
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import json
import time
from datetime import datetime

import base64
import requests
import subprocess
import os
from typing import Dict, List, Any, Optional

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # IAM role for SageMaker operations
bucket = "your-s3-bucket"  # S3 bucket for storing model artifacts

# Get AWS account and region information
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']
region = boto3.Session().region_name

print(f"AWS Account ID: {account_id}")
print(f"AWS Region: {region}")
print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket}")

In [3]:
# Configuration for custom container
repository_name = "qwen-vllm-byoc"
image_tag = "latest"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{image_tag}"

print(f"Custom container will be built and pushed to:")
print(f"Repository: {repository_name}")
print(f"Image URI: {image_uri}")

Custom container will be built and pushed to:
Repository: qwen-vllm-byoc
Image URI: 459006231907.dkr.ecr.us-west-2.amazonaws.com/qwen-vllm-byoc:latest


## Configuration

In [None]:
# 🎯 Model and Endpoint Configuration
model_name = "qwen3-next-80b-a3b-instruct"
endpoint_name = f"{model_name}-{int(time.time())}"  # Unique endpoint name with timestamp

# 🖥️ Instance Configuration
instance_type = "ml.g6e.12xlarge"  # 4x NVIDIA L40S GPUs, 192GB RAM
instance_count = 1                  # Single instance deployment

# 🗄️ Storage Configuration  
prefix = "qwen3-next-deployment"              # S3 prefix for organization

print(f"🚀 Deployment Configuration:")
print(f"   Model Name: {model_name}")
print(f"   Endpoint Name: {endpoint_name}")
print(f"   Instance Type: {instance_type}")
print(f"   Instance Count: {instance_count}")
print(f"   S3 Bucket: {bucket}")
print(f"   S3 Prefix: {prefix}")

## Step 1: Create SageMaker Model Configuration

Now we'll create the SageMaker model configuration that combines our custom container with the uploaded configuration files.

In [6]:
# Create SageMaker model configuration
print("🏗️ Creating SageMaker model configuration...")

model = Model(
    image_uri=image_uri,                        # Custom vLLM container
    model_data=None,
    role=role,                                       # SageMaker execution role
    name=f"{model_name}-model-{int(time.time())}",   # Unique model name
    sagemaker_session=sagemaker_session,
    env={
        # 🚀 vLLM Configuration
        'VLLM_USE_V1': '1',                          # Use vLLM v1 engine (required for Qwen3-Next)
        'VLLM_WORKER_MULTIPROC_METHOD': 'spawn',     # Process spawning method
        'VLLM_DISTRIBUTED_EXECUTOR_BACKEND': 'mp',   # Multi-processing backend
        'VLLM_LOGGING_LEVEL': 'INFO',                # Logging level
        
        # 🖥️ GPU Configuration  
        'CUDA_VISIBLE_DEVICES': '0,1,2,3',           # Use all 4 GPUs
        'TORCH_CUDA_ARCH_LIST': '8.9',               # NVIDIA L40S compute capability
        
        # 🗂️ Cache Directories
        'MODEL_CACHE_DIR': '/opt/ml/model',          # Model cache location
        'TRANSFORMERS_CACHE': '/tmp/transformers_cache',  # HuggingFace cache
        'HF_HOME': '/tmp/hf_home',                   # HuggingFace home directory
        
        # 🌐 Server Configuration
        'SAGEMAKER_BIND_TO_PORT': '8080',            # Internal server port
        'SAGEMAKER_BIND_TO_HOST': '0.0.0.0',         # Bind to all interfaces
        
        # 🔧 Optimization Settings
        'NCCL_DEBUG': 'INFO',                        # NCCL debugging (for multi-GPU)
        'TORCH_COMPILE_DISABLE': '1',                # Disable PyTorch compilation
        'VLLM_DISABLE_CUSTOM_ALL_REDUCE': '1',       # Disable custom all-reduce (stability)
    }
)

print(f"✅ SageMaker model created successfully!")

🏗️ Creating SageMaker model configuration...
✅ SageMaker model created successfully!


## Step 2: Deploy Model to SageMaker Endpoint

This step creates and deploys the model to a real-time inference endpoint. The deployment includes:

- **Model Loading**: Downloading Qwen3-Next-80B-A3B-Instruct from HuggingFace
- **vLLM Initialization**: Setting up the inference engine with tensor parallelism
- **Health Checks**: Ensuring the endpoint is ready for inference

**⏱️ Expected Time**: 10-15 minutes

In [7]:
# Deploy the model to a SageMaker endpoint
print(f"🚀 Starting deployment to endpoint: {endpoint_name}")
print(f"⏱️  Estimated Time: 10-15 minutes")

# Start deployment with optimized settings
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
    container_startup_health_check_timeout=1200,  # 20 minutes for container startup
    model_data_download_timeout=1200,             # 20 minutes for model download
    wait=True,          # Wait for deployment to complete
)

print(f"✅ Deployment completed successfully!")

🚀 Starting deployment to endpoint: qwen3-next-80b-a3b-instruct-1758721723
⏱️  Estimated Time: 10-15 minutes
--------------!✅ Deployment completed successfully!


In [8]:
# Create predictor for inference (alternative to deployed predictor)

from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

print(f"🔗 Connected to endpoint: {endpoint_name}")

🔗 Connected to endpoint: qwen3-next-80b-a3b-instruct-1758721723


## Step 3: Test the Deployed Model

Now let's test our deployed Qwen3-Next model with various inference scenarios:

### 📝 Basic Chat Completion

The model supports OpenAI-compatible chat completion format with system and user messages.

In [10]:
# 🧪 Test 1: Code Generation
from IPython.display import display, Markdown

chat_request = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful Python programming assistant. Write clean, well-commented code."
        },
        {
            "role": "user",
            "content": "Write a Python function to calculate the Fibonacci sequence using dynamic programming."
        }
    ],
    "max_tokens": 1000,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20
}

# Measure inference time
start_time = time.time()
response = predictor.predict(chat_request)
end_time = time.time()

# Display results
print(f"✅ Code Generation Test Completed")
print(f"   Response Time: {end_time - start_time:.2f} seconds")
print()

# Render the response as Markdown
display(Markdown("**Generated Code:**"))
display(Markdown(response["choices"][0]["message"]["content"]))

✅ Code Generation Test Completed
   Response Time: 12.72 seconds



**Generated Code:**

Here's a Python function to calculate the Fibonacci sequence using dynamic programming with memoization:

```python
def fibonacci_dp(n, memo={}):
    """
    Calculate the nth Fibonacci number using dynamic programming with memoization.
    
    Args:
        n (int): The position in the Fibonacci sequence (0-indexed)
        memo (dict): Dictionary to store previously calculated values (default: empty dict)
    
    Returns:
        int: The nth Fibonacci number
    
    Examples:
        >>> fibonacci_dp(0)
        0
        >>> fibonacci_dp(1)
        1
        >>> fibonacci_dp(10)
        55
        >>> fibonacci_dp(50)
        12586269025
    """
    # Base cases
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    
    # Calculate using dynamic programming (memoization)
    memo[n] = fibonacci_dp(n - 1, memo) + fibonacci_dp(n - 2, memo)
    return memo[n]


# Alternative iterative approach (more memory efficient)
def fibonacci_dp_iterative(n):
    """
    Calculate the nth Fibonacci number using an iterative dynamic programming approach.
    This approach uses O(1) space instead of O(n) space for memoization.
    
    Args:
        n (int): The position in the Fibonacci sequence (0-indexed)
    
    Returns:
        int: The nth Fibonacci number
    
    Examples:
        >>> fibonacci_dp_iterative(0)
        0
        >>> fibonacci_dp_iterative(1)
        1
        >>> fibonacci_dp_iterative(10)
        55
    """
    if n <= 1:
        return n
    
    # Only keep track of the last two values
    prev2 = 0  # F(0)
    prev1 = 1  # F(1)
    
    # Calculate iteratively from 2 to n
    for i in range(2, n + 1):
        current = prev1 + prev2
        prev2 = prev1
        prev1 = current
    
    return prev1


# Test function to demonstrate both approaches
def test_fibonacci_functions():
    """Test both Fibonacci implementations with sample values."""
    test_cases = [0, 1, 2, 3, 4, 5, 10, 15, 20]
    
    print("Testing Fibonacci functions:")
    print("-" * 50)
    
    for n in test_cases:
        result1 = fibonacci_dp(n)
        result2 = fibonacci_dp_iterative(n)
        print(f"F({n}) = {result1} (memoized) | {result2} (iterative)")
    
    # Test performance with a larger number
    print("\nTesting with larger number (n=50):")
    import time
    
    start_time = time.time()
    result_memo = fibonacci_dp(50)
    memo_time = time.time() - start_time
    
    start_time = time.time()
    result_iter = fibonacci_dp_iterative(50)
    iter_time = time.time() - start_time
    
    print(f"F(50) = {result_memo}")
    print(f"Memoized time: {memo_time:.6f} seconds")
    print(f"Iterative time: {iter_time:.6f} seconds")


if __name__ == "__main__":
    test_fibonacci_functions()
```

## Key Features:

### 1. **Memoized Recursive Approach (`fibonacci_dp`)**:
- Uses a dictionary to store previously calculated values
- Avoids redundant calculations by storing results
- Time complexity: O(n)
- Space complexity: O(n) for the memo dictionary and recursion stack

### 2. **Iterative Approach (`fibonacci_dp_iterative`)**:
- Uses only two variables to track the previous two Fibonacci numbers
- No recursion overhead
- Time complexity: O(n)
- Space complexity: O(1) - most memory efficient

### 3. **Why Dynamic Programming?**
- **Overlapping Subproblems**: Fibonacci calculations repeatedly compute the same values
- **Optimal Substructure**: The solution to F(n) depends on solutions to F(n-1) and F(n-2)
- **Memoization** transforms the exponential O(2^n) naive recursive solution into linear O(n)

### 4. **Performance Comparison**:
- Naive recursion: O(2^n) - extremely slow for n > 40
- DP memoized: O(n) - fast and readable
- DP iterative: O(n) with O(1) space - most efficient for large n

The iterative version is recommended for production code due to its constant space complexity and lack of recursion depth limits.

In [None]:
# 🧪 Test 2: Scientific Explanation  
print("🧪 Testing Scientific Explanation...")

science_request = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful science educator. Explain complex topics clearly with examples."
        },
        {
            "role": "user",
            "content": "Explain quantum computing principles in simple terms with practical applications."
        }
    ],
    "max_tokens": 1500,      
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20
}

# Measure inference time
start_time = time.time()
response = predictor.predict(science_request)
end_time = time.time()

# Display results
print(f"✅ Scientific Explanation Test Completed")
print(f"   Response Time: {end_time - start_time:.2f} seconds")
print()

# Render the response
display(Markdown("**Quantum Computing Explanation:**"))
display(Markdown(response["choices"][0]["message"]["content"]))

### 🤖 Advanced Integration: Strands Agents with Tool Use

Qwen3-Next excels at tool calling and agent workflows. Here we integrate with **Strands Agents** framework for advanced AI agent capabilities.

In [14]:
# 🤖 Configure Strands Agent with SageMaker Integration
from strands import Agent
from strands.models.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell


# Create SageMaker AI Model for Strands
sagemaker_model = SageMakerAIModel(
    endpoint_config={
        "endpoint_name": endpoint_name,      # Use our deployed endpoint
        "region_name": region,               # AWS region
    },
    payload_config={
        "max_tokens": 1000,                  # Response length limit
        "temperature": 0.7,                  # Creativity level
        "stream": False,                    
    }
)

# Create agent with useful tools
agent = Agent(
    model=sagemaker_model, 
    tools=[
        calculator,    # Mathematical calculations
        current_time,  # Get current date/time
        file_read,     # Read local files
        shell          # Execute shell commands (use with caution)
    ]
)

print(f"✅ Strands Agent configured successfully!")

✅ Strands Agent configured successfully!


In [15]:
# 🧮 Test Agent with Mathematical Problem Solving

response = agent("what's the square root of 12")

print(f"✅ Agent Response:")
print(f"   Stop Reason: {response.stop_reason}")
print(f"   Cycles: {response.metrics.cycle_count}")
print(f"   Duration: {sum(response.metrics.cycle_durations):.2f}s")
print()

# Display the agent's mathematical reasoning
display(Markdown(response.message['content'][0]['text']))


Tool #1: calculator


The square root of 12 is approximately 3.4641016151.✅ Agent Response:
   Stop Reason: end_turn
   Cycles: 2
   Duration: 1.01s



The square root of 12 is approximately 3.4641016151.

## Step 4: Cleanup and Cost Management

**⚠️ Important**: SageMaker endpoints incur costs while running. Remember to clean up resources when testing is complete.

### 💰 Cost Information:
- **ml.g6e.12xlarge**: ~$10.00/hour (varies by region)
- **Storage**: S3 charges for model artifacts
- **Data Transfer**: Charges for inference requests/responses

In [None]:
# 🗑️ Delete Endpoint (UNCOMMENT TO EXECUTE)
# ⚠️ WARNING: This will delete your endpoint and stop all billing
# Only run this when you're completely done with testing

# print(f"🗑️ Deleting endpoint: {endpoint_name}")
# predictor.delete_endpoint(delete_endpoint_config=True)
# print(f"✅ Endpoint {endpoint_name} deleted successfully")

print("💡 Cleanup Instructions:")
print("   1. Uncomment the deletion code above")
print("   2. Run the cell to delete the endpoint")
print("   3. Verify deletion in AWS Console")
print("   4. Check that billing has stopped")

# Show current endpoint status
try:
    import boto3
    sm_client = boto3.client('sagemaker', region_name=region)
    endpoint_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = endpoint_desc['EndpointStatus']
    print(f"\n📊 Current endpoint status: {status}")
except Exception as e:
    print(f"\n❓ Could not check endpoint status: {str(e)}")

## Summary

### ✅ What We Accomplished

1. **Successfully deployed** Qwen3-Next-80B-A3B-Instruct on SageMaker
2. **Configured vLLM** with optimized settings for ml.g6e.12xlarge
3. **Tested inference** with multiple scenarios (code generation, explanations)
4. **Integrated Strands Agents** for advanced tool calling capabilities

### 🚀 Key Features Enabled

- **80B Parameter Model** with 3B activation per token (MoE efficiency)
- **Multi-Token Prediction** for faster inference (2+ tokens per forward pass)
- **Tool Calling** integration with Strands Agents framework
- **Tensor Parallelism** across 4 NVIDIA L40S GPUs
