# Deploy Qwen3-Next-80B-A3B-Instruct on Amazon SageMaker

This notebook demonstrates how to deploy the Qwen3-Next-80B-A3B-Instruct model on Amazon SageMaker using a custom vLLM container.

## Model Overview
- **Model**: Qwen3-Next-80B-A3B-Instruct
- **Parameters**: 80B total (3B activated)
- **Context Length**: 262K tokens (extensible to 1M+)
- **Architecture**: Hybrid Attention + High-Sparsity MoE (512 experts, 10 activated)

## Prerequisites

1. AWS CLI configured with appropriate permissions
2. SageMaker execution role with necessary permissions
3. Access to ml.g6e.12xlarge for SageMaker endpoint
4. Custom Docker image pushed to ECR using ```build_and_push.sh``` script

In [2]:
# !pip install -r code/requirements.txt

In [3]:
!pip install -U strands-agents strands-agents-tools
!pip install pydantic==2.11.7  # Compatible version
!pip install mypy-boto3-sagemaker-runtime  # Type stubs for SageMaker

In [None]:
import boto3
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
import json
import time
from datetime import datetime

import base64
import requests
import subprocess
import os
from typing import Dict, List, Any, Optional

# Initialize SageMaker session and get execution role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()  # IAM role for SageMaker operations
bucket = "your-s3-bucket"  # S3 bucket for storing model artifacts

# Get AWS account and region information
sts_client = boto3.client('sts')
account_id = sts_client.get_caller_identity()['Account']
region = boto3.Session().region_name

print(f"AWS Account ID: {account_id}")
print(f"AWS Region: {region}")
print(f"SageMaker role: {role}")
print(f"S3 bucket: {bucket}")

In [2]:
# Configuration for custom container
repository_name = "qwen-vllm-byoc"
image_tag = "latest"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repository_name}:{image_tag}"

print(f"Custom container will be built and pushed to:")
print(f"Repository: {repository_name}")
print(f"Image URI: {image_uri}")

Custom container will be built and pushed to:
Repository: qwen-vllm-byoc
Image URI: 459006231907.dkr.ecr.us-west-2.amazonaws.com/qwen-vllm-byoc:latest


## Configuration

In [None]:
# 🎯 Model and Endpoint Configuration
model_name = "qwen3-next-80b-a3b-instruct"
endpoint_name = f"{model_name}-{int(time.time())}"  # Unique endpoint name with timestamp

# 🖥️ Instance Configuration
instance_type = "ml.g6e.12xlarge"  # 4x NVIDIA L40S GPUs, 192GB RAM
instance_count = 1                  # Single instance deployment

# 🗄️ Storage Configuration  
prefix = "qwen3-next-deployment"              # S3 prefix for organization
byoc_code_dir = "./code"                      # Local directory with model files

print(f"🚀 Deployment Configuration:")
print(f"   Model Name: {model_name}")
print(f"   Endpoint Name: {endpoint_name}")
print(f"   Instance Type: {instance_type}")
print(f"   Instance Count: {instance_count}")
print(f"   S3 Bucket: {bucket}")
print(f"   S3 Prefix: {prefix}")
print(f"   Code Directory: {byoc_code_dir}")

# Verify code directory exists
import os
if os.path.exists(byoc_code_dir):
    files = os.listdir(byoc_code_dir)
    print(f"   Code Files: {files}")
else:
    print(f"   ⚠️  Warning: Code directory '{byoc_code_dir}' not found!")

## Step 1: Upload Model Configuration Files to S3

For BYOC deployments, we need to upload our custom model configuration files to S3:

- **`model.py`**: Inference server with vLLM integration
- **`serving.properties`**: vLLM engine configuration parameters
- **`requirements.txt`**: Python dependencies for the container

These files will be downloaded by the container during startup.

In [None]:
# Upload BYOC configuration files to S3
print("📤 Uploading model configuration files to S3...")

byoc_config_uri = sagemaker_session.upload_data(
    path=byoc_code_dir, 
    bucket=bucket, 
    key_prefix=f"{prefix}/code"
)

# Prepare model data configuration for SageMaker
model_data = {
    "S3DataSource": {
        "S3Uri": f"{byoc_config_uri}/",
        "S3DataType": "S3Prefix",         # Directory containing multiple files
        "CompressionType": "None"         # Files are not compressed
    }
}

print(f"✅ Upload completed!")
print(f"   S3 URI: {byoc_config_uri}")

## Step 2: Create SageMaker Model Configuration

Now we'll create the SageMaker model configuration that combines our custom container with the uploaded configuration files.

In [5]:
# Create SageMaker model configuration
print("🏗️ Creating SageMaker model configuration...")

model = Model(
    image_uri=image_uri,                        # Custom vLLM container
    model_data=model_data,                           # S3 path to configuration files
    role=role,                                       # SageMaker execution role
    name=f"{model_name}-model-{int(time.time())}",   # Unique model name
    sagemaker_session=sagemaker_session,
    env={
        # 🚀 vLLM Configuration
        'VLLM_USE_V1': '1',                          # Use vLLM v1 engine (required for Qwen3-Next)
        'VLLM_WORKER_MULTIPROC_METHOD': 'spawn',     # Process spawning method
        'VLLM_DISTRIBUTED_EXECUTOR_BACKEND': 'mp',   # Multi-processing backend
        'VLLM_LOGGING_LEVEL': 'INFO',                # Logging level
        
        # 🖥️ GPU Configuration  
        'CUDA_VISIBLE_DEVICES': '0,1,2,3',           # Use all 4 GPUs
        'TORCH_CUDA_ARCH_LIST': '8.9',               # NVIDIA L40S compute capability
        
        # 🗂️ Cache Directories
        'MODEL_CACHE_DIR': '/opt/ml/model',          # Model cache location
        'TRANSFORMERS_CACHE': '/tmp/transformers_cache',  # HuggingFace cache
        'HF_HOME': '/tmp/hf_home',                   # HuggingFace home directory
        
        # 🌐 Server Configuration
        'SAGEMAKER_BIND_TO_PORT': '8080',            # Internal server port
        'SAGEMAKER_BIND_TO_HOST': '0.0.0.0',         # Bind to all interfaces
        
        # 🔧 Optimization Settings
        'NCCL_DEBUG': 'INFO',                        # NCCL debugging (for multi-GPU)
        'TORCH_COMPILE_DISABLE': '1',                # Disable PyTorch compilation
        'VLLM_DISABLE_CUSTOM_ALL_REDUCE': '1',       # Disable custom all-reduce (stability)
    }
)

print(f"✅ SageMaker model created successfully!")

🏗️ Creating SageMaker model configuration...
✅ SageMaker model created successfully!


## Step 3: Deploy Model to SageMaker Endpoint

This step creates and deploys the model to a real-time inference endpoint. The deployment includes:

- **Model Loading**: Downloading Qwen3-Next-80B-A3B-Instruct from HuggingFace
- **vLLM Initialization**: Setting up the inference engine with tensor parallelism
- **Health Checks**: Ensuring the endpoint is ready for inference

**⏱️ Expected Time**: 10-15 minutes

In [None]:
# Deploy the model to a SageMaker endpoint
print(f"🚀 Starting deployment to endpoint: {endpoint_name}")
print(f"⏱️  Estimated Time: 10-15 minutes")

# Start deployment with optimized settings
predictor = model.deploy(
    initial_instance_count=instance_count,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
    container_startup_health_check_timeout=1200,  # 20 minutes for container startup
    model_data_download_timeout=1200,             # 20 minutes for model download
    wait=True,          # Wait for deployment to complete
)

print(f"✅ Deployment completed successfully!")

🚀 Starting deployment to endpoint: qwen3-next-80b-a3b-instruct-1758212940
⏱️  Estimated Time: 10-15 minutes
-

In [None]:
# Create predictor for inference (alternative to deployed predictor)

from sagemaker.predictor import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

print(f"🔗 Connected to endpoint: {endpoint_name}")

## Step 4: Test the Deployed Model

Now let's test our deployed Qwen3-Next model with various inference scenarios:

### 📝 Basic Chat Completion

The model supports OpenAI-compatible chat completion format with system and user messages.

In [12]:
# 🧪 Test 1: Code Generation
from IPython.display import display, Markdown

chat_request = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful Python programming assistant. Write clean, well-commented code."
        },
        {
            "role": "user",
            "content": "Write a Python function to calculate the Fibonacci sequence using dynamic programming."
        }
    ],
    "max_tokens": 1000,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20
}

# Measure inference time
start_time = time.time()
response = predictor.predict(chat_request)
end_time = time.time()

# Display results
print(f"✅ Code Generation Test Completed")
print(f"   Response Time: {end_time - start_time:.2f} seconds")
print()

# Render the response as Markdown
display(Markdown("**Generated Code:**"))
display(Markdown(response["choices"][0]["message"]["content"]))

✅ Code Generation Test Completed
   Response Time: 11.58 seconds



**Generated Code:**

Here's a Python function to calculate the Fibonacci sequence using dynamic programming with memoization:

```python
def fibonacci_dp(n, memo={}):
    """
    Calculate the nth Fibonacci number using dynamic programming with memoization.
    
    Args:
        n (int): The position in the Fibonacci sequence (0-indexed)
        memo (dict): Dictionary to store previously calculated values (default: {})
    
    Returns:
        int: The nth Fibonacci number
    
    Examples:
        >>> fibonacci_dp(0)
        0
        >>> fibonacci_dp(1)
        1
        >>> fibonacci_dp(10)
        55
        >>> fibonacci_dp(20)
        6765
    """
    # Base cases
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    
    # Calculate using dynamic programming (memoization)
    memo[n] = fibonacci_dp(n - 1, memo) + fibonacci_dp(n - 2, memo)
    return memo[n]


# Alternative iterative approach (more memory efficient)
def fibonacci_dp_iterative(n):
    """
    Calculate the nth Fibonacci number using an iterative dynamic programming approach.
    This approach uses O(1) space instead of O(n) space for memoization.
    
    Args:
        n (int): The position in the Fibonacci sequence (0-indexed)
    
    Returns:
        int: The nth Fibonacci number
    
    Examples:
        >>> fibonacci_dp_iterative(0)
        0
        >>> fibonacci_dp_iterative(1)
        1
        >>> fibonacci_dp_iterative(10)
        55
        >>> fibonacci_dp_iterative(20)
        6765
    """
    if n <= 1:
        return n
    
    # Only keep track of the last two values
    prev2 = 0  # F(0)
    prev1 = 1  # F(1)
    
    # Calculate iteratively from 2 to n
    for i in range(2, n + 1):
        current = prev1 + prev2
        prev2 = prev1
        prev1 = current
    
    return prev1


# Example usage and test function
if __name__ == "__main__":
    # Test both implementations
    print("Testing Fibonacci Dynamic Programming implementations:")
    
    # Test cases
    test_cases = [0, 1, 2, 3, 4, 5, 10, 15, 20]
    
    print("\nRecursive DP with memoization:")
    for n in test_cases:
        result = fibonacci_dp(n)
        print(f"F({n}) = {result}")
    
    print("\nIterative DP (space efficient):")
    for n in test_cases:
        result = fibonacci_dp_iterative(n)
        print(f"F({n}) = {result}")
    
    # Performance comparison for larger values
    import time
    
    n = 35
    print(f"\nPerformance test for F({n}):")
    
    # Test recursive DP
    start_time = time.time()
    result1 = fibonacci_dp(n)
    end_time = time.time()
    print(f"Recursive DP: {result1} (Time: {end_time - start_time:.6f} seconds)")
    
    # Test iterative DP
    start_time = time.time()
    result2 = fibonacci_dp_iterative(n)
    end_time = time.time()
    print(f"Iterative DP: {result2} (Time: {end_time - start_time:.6f} seconds)")
```

## Key Features:

### 1. **Recursive DP with Memoization** (`fibonacci_dp`)
- Uses a dictionary to store previously computed values
- Avoids redundant calculations by storing results
- Time complexity: O(n)
- Space complexity: O(n)

### 2. **Iterative DP** (`fibonacci_dp_iterative`)
- Uses only two variables to track the previous two Fibonacci numbers
- More memory efficient - O(1) space complexity
- Time complexity: O(n)
- No recursion overhead

### 3. **Benefits of Dynamic Programming**:
- Eliminates exponential time complexity of naive recursion
- Each Fibonacci number is calculated only once
- Much faster for large values of n

### 4. **Usage Notes**:
- The recursive version uses a default mutable argument (`memo={}`) which is efficient but can cause issues if used in multi-threaded environments
- The iterative version is generally preferred for production code due to its space efficiency and lack of recursion depth limits

Both implementations correctly handle edge cases (n=0, n=1) and provide the standard Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13

In [13]:
# 🧪 Test 2: Scientific Explanation  
print("🧪 Testing Scientific Explanation...")

science_request = {
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful science educator. Explain complex topics clearly with examples."
        },
        {
            "role": "user",
            "content": "Explain quantum computing principles in simple terms with practical applications."
        }
    ],
    "max_tokens": 1500,      
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20
}

# Measure inference time
start_time = time.time()
response = predictor.predict(science_request)
end_time = time.time()

# Display results
print(f"✅ Scientific Explanation Test Completed")
print(f"   Response Time: {end_time - start_time:.2f} seconds")
print()

# Render the response
display(Markdown("**Quantum Computing Explanation:**"))
display(Markdown(response["choices"][0]["message"]["content"]))

🧪 Testing Scientific Explanation...
✅ Scientific Explanation Test Completed
   Response Time: 11.37 seconds



**Quantum Computing Explanation:**

Absolutely! Let’s break down **quantum computing** in simple, everyday terms — no PhD required.

---

### 🌟 What is Quantum Computing? (The Simple Version)

Imagine a regular computer (like your laptop or phone) uses **bits** — tiny switches that are either **0 or 1**. Like a light bulb: off (0) or on (1).

A **quantum computer** uses **qubits** (quantum bits). These aren’t just on or off — they can be **both 0 and 1 at the same time**! 🤯

This strange ability is called **superposition**.

---

### 🔑 Three Core Quantum Principles (Explained Simply)

#### 1. **Superposition** — Being in Two States at Once
Think of a spinning coin. While it’s spinning, it’s neither purely heads nor tails — it’s kind of both. Only when you stop it (measure it) does it “choose” one.

> ✅ A qubit is like a spinning coin — it holds many possibilities until you look.

#### 2. **Entanglement** — Spooky Connection at a Distance
Imagine two magic dice. You roll one in New York, and instantly the other in Tokyo shows the same number — even if they’ve never met!

> ✅ When two qubits are entangled, changing one instantly affects the other — no matter how far apart they are. This lets quantum computers link information in powerful ways.

#### 3. **Interference** — Amplifying the Right Answers
Quantum computers don’t just guess — they use wave-like interference (like sound waves canceling or boosting each other) to **amplify correct answers** and **cancel out wrong ones**.

> ✅ It’s like tuning a radio: you turn the dial until the music gets loud and clear — the quantum computer does this mathematically to find the best solution.

---

### 💡 Practical Applications (Where Quantum Computing Will Help)

| Area | How Quantum Helps | Real-World Example |
|------|-------------------|---------------------|
| **Drug Discovery** | Simulates molecules at quantum level | Finding new medicines faster — e.g., a cure for Alzheimer’s or personalized cancer treatments |
| **Cybersecurity** | Can break current encryption… but also create unbreakable encryption | Banks and governments will use **quantum-safe encryption** to protect data |
| **Logistics & Optimization** | Finds the best route among billions of options | Amazon could find the fastest delivery routes for 1 million packages in seconds |
| **Artificial Intelligence** | Speeds up machine learning training | AI models that learn from less data, recognize patterns faster (e.g., medical imaging) |
| **Climate Modeling** | Simulates complex weather and chemical reactions | Better predictions for climate change and clean energy solutions (like better batteries) |
| **Finance** | Optimizes portfolios and detects fraud | Hedge funds use it to predict market trends or manage risk better |

---

### 🚫 What Quantum Computers Won’t Do (Myths Busted)

- ❌ They won’t replace your laptop for browsing or Netflix.
- ❌ They’re not “faster computers” for everything — only for *very specific* problems.
- ✅ They’re like a **specialized super-tool** — great for complex math, chemistry, and optimization.

---

### 🧩 Analogy: Quantum vs Classical Computer

Think of finding your way out of a maze:

- **Classical computer**: Tries one path at a time — left, then right, then back, etc. Very slow if the maze is huge.
- **Quantum computer**: Explores *all paths at once* using superposition, then uses interference to “collapse” into the correct exit.

---

### 🌐 Real-World Progress (2024)

- **IBM**, **Google**, and **Rigetti** have built quantum computers with 100–1000+ qubits.
- **China** and the **US** are investing billions — it’s the new “space race.”
- Companies like **JPMorgan Chase** and **BMW** are already testing quantum algorithms.

---

### ✅ In a Nutshell

> **Quantum computing = Using the weird rules of tiny particles to solve problems too complex for regular computers.**

It’s not magic — it’s physics. And while we’re still early in the journey, quantum computing could revolutionize medicine, security, AI, and more in the next 10–20 years.

Think of it as the next leap after the transistor — and we’re just turning it on. 🔌⚛️

Let me know if you’d like a diagram or a fun analogy using pizza or cats! 😊

### 🤖 Advanced Integration: Strands Agents with Tool Use

Qwen3-Next excels at tool calling and agent workflows. Here we integrate with **Strands Agents** framework for advanced AI agent capabilities.

In [23]:
# 🤖 Configure Strands Agent with SageMaker Integration
from strands import Agent
from strands.models.sagemaker import SageMakerAIModel
from strands_tools import calculator, current_time, file_read, shell


# Create SageMaker AI Model for Strands
sagemaker_model = SageMakerAIModel(
    endpoint_config={
        "endpoint_name": endpoint_name,      # Use our deployed endpoint
        "region_name": region,               # AWS region
    },
    payload_config={
        "max_tokens": 1000,                  # Response length limit
        "temperature": 0.7,                  # Creativity level
        "stream": False,                    
    }
)

# Create agent with useful tools
agent = Agent(
    model=sagemaker_model, 
    tools=[
        calculator,    # Mathematical calculations
        current_time,  # Get current date/time
        file_read,     # Read local files
        shell          # Execute shell commands (use with caution)
    ]
)

print(f"✅ Strands Agent configured successfully!")

✅ Strands Agent configured successfully!


In [None]:
# 🧮 Test Agent with Mathematical Problem Solving

response = agent("what's the square root of 12")

print(f"✅ Agent Response:")
print(f"   Stop Reason: {response.stop_reason}")
print(f"   Cycles: {response.metrics.cycle_count}")
print(f"   Duration: {sum(response.metrics.cycle_durations):.2f}s")
print()

# Display the agent's mathematical reasoning
display(Markdown(response.message['content'][0]['text']))

## Step 5: Cleanup and Cost Management

**⚠️ Important**: SageMaker endpoints incur costs while running. Remember to clean up resources when testing is complete.

### 💰 Cost Information:
- **ml.g6e.12xlarge**: ~$10.00/hour (varies by region)
- **Storage**: S3 charges for model artifacts
- **Data Transfer**: Charges for inference requests/responses

In [None]:
# 🗑️ Delete Endpoint (UNCOMMENT TO EXECUTE)
# ⚠️ WARNING: This will delete your endpoint and stop all billing
# Only run this when you're completely done with testing

# print(f"🗑️ Deleting endpoint: {endpoint_name}")
# predictor.delete_endpoint(delete_endpoint_config=True)
# print(f"✅ Endpoint {endpoint_name} deleted successfully")

print("💡 Cleanup Instructions:")
print("   1. Uncomment the deletion code above")
print("   2. Run the cell to delete the endpoint")
print("   3. Verify deletion in AWS Console")
print("   4. Check that billing has stopped")

# Show current endpoint status
try:
    import boto3
    sm_client = boto3.client('sagemaker', region_name=region)
    endpoint_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = endpoint_desc['EndpointStatus']
    print(f"\n📊 Current endpoint status: {status}")
except Exception as e:
    print(f"\n❓ Could not check endpoint status: {str(e)}")

## Summary

### ✅ What We Accomplished

1. **Successfully deployed** Qwen3-Next-80B-A3B-Instruct on SageMaker
2. **Configured vLLM** with optimized settings for ml.g6e.12xlarge
3. **Tested inference** with multiple scenarios (code generation, explanations)
4. **Integrated Strands Agents** for advanced tool calling capabilities

### 🚀 Key Features Enabled

- **80B Parameter Model** with 3B activation per token (MoE efficiency)
- **Multi-Token Prediction** for faster inference (2+ tokens per forward pass)
- **Tool Calling** integration with Strands Agents framework
- **Tensor Parallelism** across 4 NVIDIA L40S GPUs

### 🛠️ Customization Options

- **Modify `serving.properties`**: Adjust vLLM parameters
- **Update `model.py`**: Add custom preprocessing/postprocessing
- **Instance Types**: Scale up to ml.g6e.24xlarge for higher throughput
