# re:Invent 2025 - AIM311: Optimize Open Weight Models for Low-Latency, Cost-Effective AI Apps

## Lab 1a: Model Selection Framework
**Focus**: Use-case-driven model selection, pricing analysis, performance optimization

### What You'll Learn in Lab 1a

This lab takes a **use-case-first approach** to model selection. Instead of starting with technical metrics, we'll help you:

1. **Identify your use case** - Understand whether you're building a chatbot, agent, document analyzer, or multimodal application
2. **Select the right model** - Learn which models match your specific requirements (latency, context window, capabilities)
3. **Validate with benchmarks** - Review independent performance data to inform your decision
4. **Analyze costs and performance** - Test real pricing and latency metrics with hands-on examples

### Models We'll Explore

This workshop covers 7 open-weight models available on AWS Bedrock:

- **Llama 4 Maverick** (400B MoE) - Multimodal chat and analysis with 1M context
- **Llama 4 Scout** (109B MoE) - Ultra-long document processing with 3.5M context
- **GPT OSS 120B** - Complex reasoning and agentic workflows
- **GPT OSS 20B** - Fast, cost-effective responses for high-volume applications
- **Qwen3 235B MoE** - Advanced reasoning with thinking mode
- **Qwen3 32B** - Balanced performance for enterprise applications
- **DeepSeek V3.1** (685B MoE) - Cost-optimized for high-volume deployments

**Bonus**: We'll also explore **Qwen3 Coder** models (480B MoE, 30B MoE) specialized for code generation.

**What is a Mixture of Experts (MoE)?**: Mixture of Experts (MoE) is an LLM architecture that uses multiple specialized "expert" networks within each layer, but only activates a subset of them for each input token. This sparse activation enables models to scale to several billion parameters while maintaining reasonable inference speed. However, MoE comes with higher memory requirements since all experts must be loaded.


### Let's get started! üöÄ

## üõ†Ô∏è Environment Setup
**‚è±Ô∏è Pre-workshop setup - run these cells before starting**

In [None]:
# Note: asyncio is built-in to Python 3, no need to install
! pip install -q rich pandas Pillow
! pip install -q boto3

In [None]:
import boto3
from rich.console import Console

# Initialize console for rich output
console = Console()

# AWS Configuration
AWS_REGION = "us-west-2"
bedrock_runtime = boto3.client('bedrock-runtime', region_name=AWS_REGION)

print("‚úÖ Environment setup complete!")

## üéØ Step 1: Identify Your Use Case

Before diving into pricing and performance metrics, let's start with the most important question: **What are you building?**

Different use cases have different requirements. A real-time chatbot needs low latency, while a document analysis system needs a large context window. 

By identifying your use case first, you can narrow down which models are relevant to your needs.

### Common Use Case Categories and Requirements

Understanding the relationship between your use case and model capabilities helps you make informed decisions:


| Use Case | Example Applications | Key Requirements | Priorities | Model Constraints |
|----------|---------------------|------------------|---------------|-------------------|
| Chatbot (text) | Customer support<br>Q&A systems<br>Virtual assistants | Low latency (<500ms TTFT)<br>Streaming support<br>Conversational memory | Speed > Cost> Quality | Fast inference<br>Streaming support<br>Cost at scale |
| Agent | Research assistants<br>Automation workflows<br>Data analysis | Tool calling<br>Multi-step reasoning<br>Function execution | Quality > Tool Calling > Speed | Complex logic and strong reasoning<br>High reliability<br>Error handling |
| Long Document Analysis | Legal review<br>Medical records<br>Contract analysis | Processing long documents<br>High accuracy | Context > Quality > Cost | Large context window<br>multimodal architecture<br> Extraction precision<br>Strong comprehension |
| Code Generation | IDE assistants<br>Code review,<br>Documentation generation | Code understanding<br>Multi-language support<br>Syntax accuracy | Code Quality > Context > Speed | Trained on multiple programming languages<br>Context awareness<br>Understanding of programming patterns |
| Multimodal Case | Document OCR<br>Visual Q&A<br>Image description | Vision + text processing<br>Image understanding | Multimodal Support > Quality > Speed | High quality image encoding<br>Multiple format support<br>Visual reasoning |


**üí° Pro Tip**: Most real-world applications combine multiple use cases. For example, a customer support system might need both chatbot capabilities (low latency) and document analysis (knowledge base search). In these cases, prioritize your primary use case and validate that secondary requirements are met.

**üí≠ Think about:** Based on the use cases you just learned about, which capability do you think is MOST important for your application? Consider multimodal support, large context windows, thinking mode, cost, or response time.

---

## üìà Step 2: High Level Model Comparison and Use Case Mapping

The table below combines key performance metrics with use case recommendations to help you narrowing down the model choice for your application.

| Model | Context Window | Best Use Case Match | Potential Applications |
|-------|----------------|---------------------|----------------------|
| DeepSeek V3.1 | 128K | Complex reasoning & agents | **Primary**: Agent systems<br>**Alternative:** Code generation (thinking mode)<br>**Budget:** Chatbots (non-thinking mode) |
| GPT OSS 120B | 128K | Complex reasoning & agents | **Primary**: Agent systems<br>**Alternative:** Complex reasoning tasks |
| GPT OSS 20B | 128K | Fast, cost-effective responses | **Primary**: Chatbots<br>**Alternative:** High-volume applications |
| Qwen3 235B MoE | 128K | Reasoning with thinking mode | **Primary**: Agent systems<br>**Alternative:** Complex reasoning |
| Qwen3 32B | 128K | Balanced enterprise apps | **Primary**: Chatbots<br>**Alternative:** General enterprise use |
| Qwen3 Coder 480B | 256K | Code generation & analysis | **Primary**: Code generation<br>**Alternative:** Complex coding tasks |
| Qwen3 Coder 30B | 256K | Code generation & analysis | **Primary**: Budget code generation<br>**Alternative:** Simple coding tasks |
| Llama 4 Maverick | 1M | Multimodal applications | **Primary**: Complex OCR & charts<br>**Alternative:** Document analysis |
| Llama 4 Scout | 3.5M | Ultra-long document analysis | **Primary**: Long documents<br>**Alternative:** Cost-effective multimodal |

#### ‚ö†Ô∏è Important Notes

| Category | Details |
|----------|---------|
| Regional Availability | ‚Ä¢ New models are regularly added to AWS Bedrock <br>‚Ä¢ Not all models are available in all AWS regions<br>‚Ä¢ Pricing may vary by region<br>‚Ä¢ Check the [AWS Bedrock documentation](https://docs.aws.amazon.com/bedrock/) for current availability |


### üéØ How to Read Benchmarks for Your Use Case

Different use cases prioritize different metrics. You need to identify what matters most for your application.

Typical operational metrics:
- **TTFT (Time to First Token)**: How quickly the model starts responding - critical for real-time applications
- **Throughput**: Tokens generated per second - affects streaming speed and user experience
- **Cost**: Price per million tokens - important for high-volume deployments
- **Context Need**: Maximum input size required for your use case

General Model Quality Considerations:
- Larger models typically provide better reasoning and accuracy but at higher cost and latency
- Specialized models (like code-focused variants) excel in their domain but may underperform in general tasks
- MoE (Mixture of Experts) models offer good performance-to-cost ratios for diverse workloads
- "Thinking mode" capabilities enhance complex reasoning but increase token usage and response time
- Consider the trade-off between model capability and operational requirements for your specific use case

### üîó External Resources & Live Benchmarks

Benchmark data evolves as models are updated and new models are released. Here are some sources that can be used as starting point:

#### üèÜ Benchmark Sources

| Resource | Focus | Best For |
|----------|-------|----------|
| [artificialanalysis.ai](https://artificialanalysis.ai/) | Independent quality benchmarks across multiple models<br>Cost comparisons across providers<br>Speed metrics (tokens/second, time to first token) | Understanding model capabilities with regular updates as new models are released |
| [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) | Academic benchmarks (MMLU, HellaSwag, TruthfulQA, etc.) | Understanding model capabilities on standardized tests |
| [LMSYS Org Projects](https://lmsys.org/projects/) | Datasets and evaluation tools for large models | Understanding real-world conversational quality |

#### ‚ö†Ô∏è Important Notes

| Category | Details |
|----------|---------|
| Benchmark Limitations | ‚Ä¢ Benchmarks are approximations, not guarantees<br>‚Ä¢ Performance varies based on prompt, region, and load<br>‚Ä¢ Your specific use case may differ from benchmark scenarios<br>‚Ä¢ Always test with your own data before production deployment |

---

### ‚úÖ Are You Ready to Test?

Now that you understand how to interpret benchmarks and where to find current data, let's move to hands-on testing with real AWS Bedrock models. 

The next sections will show you:

1. **Pricing comparison** - See actual costs for your use case
2. **Performance metrics** - Measure latency and throughput in real-time


## üí∞ Step 3: Hands-On Pricing Analysis

This section helps you calculate costs for your specific use case. Understanding the economics is crucial for making informed decisions, especially when deploying at scale.

**What you'll learn:**
- How the AWS Bedrock pricing translates into monthly costs for your use case
- What "good" cost looks like for different scenarios
- Cost optimization strategies

**üí≠ Think about:** Before we reveal the pricing data, which model do you think will be the cheapest for a chatbot use case (100 input + 200 output tokens per request)? Does model size always correlate with cost?


In [None]:
import pandas as pd

# Step 1: Get the pricing data (silent mode)
from extract_bedrock_pricing import extract_bedrock_model_pricing
pricing_data, model_mapping, bedrock_pricing_json = extract_bedrock_model_pricing(verbose=False)

# Step 2: Convert to pandas DataFrame
print("Converting to DataFrame...")
df = pd.DataFrame.from_dict(bedrock_pricing_json, orient='index')

# Step 3: Clean up the DataFrame
df.reset_index(inplace=True)
df.rename(columns={'index': 'model_id', 
                   'input': '$/1M input tokens', 
                   'output': '$/1M output tokens'}, inplace=True)

# Step 5: Add derived columns
df['provider'] = df['model_id'].str.split('.').str[0]

# Step 4: Reorder columns
df = df[['provider', 'name', '$/1M input tokens', '$/1M output tokens', 'model_id', 'region']]

print(f"‚úÖ DataFrame ready! Shape (rows, columns): {df.shape}")


In [None]:
# Print out the table with model provider names, model names, prices per token, model ids and primary region
# The table can be sorted by any column to help you find the best model for your needs

# Filter to workshop models only
workshop_models = [
    'DeepSeek DeepSeek V3.1',
    'Meta Llama 4 Maverick 17B',
    'Meta Llama 4 Scout 17B',
    'OpenAI gpt-oss-120b',
    'OpenAI gpt-oss-20b',
    'Qwen Qwen3 235B A22B 2507',
    'Qwen Qwen3 32B',
    'Qwen Qwen3 Coder 480B A35B'
]

df_workshop = df[df['name'].isin(workshop_models)].copy()

print("\nüí° Sorting Tips:")
print("   ‚Ä¢ Sort by output cost (default): df.sort_values('$/1M output tokens', ascending=False)")
print("   ‚Ä¢ Sort by input cost: df.sort_values('$/1M input tokens')")
print("   ‚Ä¢ Sort by provider: df.sort_values('provider')")
print("   ‚Ä¢ Sort by model name: df.sort_values('name')")
print("   ‚Ä¢ Multiple columns: df.sort_values(['provider', '$/1M output tokens'])\n")

# Default view: sorted by output token cost (most expensive first)
df_workshop.sort_values('$/1M output tokens', ascending=False)

### üìä Use Case Cost Analysis

Raw pricing per million tokens is useful, but what does it mean for your actual application? Let's calculate the monthly costs for a realistic scenario:

**Scenario**: 1 million requests per month
- **Chatbot**: 100 input tokens + 200 output tokens per request
- **Agent**: 500 input tokens + 300 output tokens per request (multi-step reasoning)
- **Document Analysis**: 50,000 input tokens + 500 output tokens per request

Let's see how costs compare across models for each use case:

In [None]:
# Define use case scenarios (tokens per request)
use_cases = {
    'Chatbot': {'input': 100, 'output': 200},
    'Agent': {'input': 500, 'output': 300},
    'Document Analysis': {'input': 50000, 'output': 500}
}

# Number of requests per month
requests_per_month = 1_000_000

# Calculate costs for each use case
for use_case_name, tokens in use_cases.items():
    # Cost per request = (input_tokens * input_price + output_tokens * output_price) / 1M
    df_workshop[f'{use_case_name} ($/month)'] = (
        (tokens['input'] * df_workshop['$/1M input tokens'] + 
         tokens['output'] * df_workshop['$/1M output tokens']) / 1_000_000
    ) * requests_per_month

# Create summary table
cost_summary = df_workshop[['name'] + [f'{uc} ($/month)' for uc in use_cases.keys()]].copy()
cost_summary = cost_summary.sort_values('Chatbot ($/month)')

print("\nüí∞ Monthly Cost Comparison (1M requests/month)\n")
print("=" * 80)
print("\nüí° Sorting Tips:")
print("   ‚Ä¢ Sort by chatbot cost: cost_summary.sort_values('Chatbot ($/month)')")
print("   ‚Ä¢ Sort by agent cost: cost_summary.sort_values('Agent ($/month)')")
print("   ‚Ä¢ Sort by document analysis cost: cost_summary.sort_values('Document Analysis ($/month)')")
print("   ‚Ä¢ Sort by model name: cost_summary.sort_values('name')\n")
cost_summary

### üéØ Cost Optimization Guide

Understanding how to optimize for cost depends mostly on your use case and business model. Use this table and tips to evaluate costs optimization strategies for your specific scenario:

| Use Case | Key Optimization Tips |
|----------|----------------------|
| üí¨ Chatbot (100 in + 200 out tokens) | Cache common queries<br>Optimize prompt length |
| ü§ñ Agent (500 in + 300 out tokens) | Batch tool calls<br>Cache intermediate results<br>Early stopping |
| üìÑ Document Analysis (50K in + 500 out tokens) | Use RAG approach<br>Chunk documents<br>Preprocess to extract relevant sections |
| üíª Code Generation (100 in + 200 out tokens) | Quality matters most<br>Specialized models worth premium |
| üñºÔ∏è Multimodal (image + text) | Resize images<br>Compress without quality loss<br>Batch requests |

**üí° General Cost Optimization Strategies:**
1. Right-size your model (don't use 120B when 20B will do)
2. Optimize prompts (shorter, clearer = fewer tokens)
3. Use streaming to improve UX without cost increase
4. Monitor usage and set max token limits
5. A/B test to validate expensive models provide value

**üéØ Next Steps**: Now that you understand the costs, let's measure actual performance metrics to validate your model choice!

**üí≠ Think about:** Before we run the performance tests, make your predictions: Which model will have the fastest Time to First Token? Which will be cheapest per request? Which will have the highest throughput?

---

## üí∞ Step 4: Hands-On Performance Analysis

In [None]:
from llm_compare_jupyter_clean import compare_models_simple

# pick your favorite model ids from the table above
models = [
    "deepseek.v3-v1:0",
    "openai.gpt-oss-120b-1:0",
    "openai.gpt-oss-20b-1:0",
    "meta.llama4-maverick-17b-instruct-v1:0	",
    "meta.llama4-scout-17b-instruct-v1:0",
    "qwen.qwen3-235b-a22b-2507-v1:0",
    "qwen.qwen3-32b-v1:0",
    # "qwen.qwen3-coder-480b-a35b-v1:0"
    # "qwen.qwen3-coder-30b-a3b-v1:0"
]
df_metrics = compare_models_simple(models, "Explain machine learning in 100 words", bedrock_pricing_json, timeout_single_llm_sec=30)

In [None]:
# Print out the table with model performance metrics
# The table can be sorted by any column to help you compare models

print("\nüí° Sorting Tips - Try these commands to sort by different metrics:")
print("   ‚Ä¢ Sort by cost (default): df_metrics.drop('Response', axis=1).sort_values('Cost_Cents')")
print("   ‚Ä¢ Sort by latency: df_metrics.drop('Response', axis=1).sort_values('Latency_s')")
print("   ‚Ä¢ Sort by TTFT (fastest first): df_metrics.drop('Response', axis=1).sort_values('TTFT_s')")
print("   ‚Ä¢ Sort by throughput (highest first): df_metrics.drop('Response', axis=1).sort_values('Throughput_tokens_per_sec', ascending=False)")
print("   ‚Ä¢ Sort by output tokens: df_metrics.drop('Response', axis=1).sort_values('Output_Tokens', ascending=False)")
print("   ‚Ä¢ Sort by tokens/word efficiency: df_metrics.drop('Response', axis=1).sort_values('Tokens_Per_Word')\n")

# Available columns for sorting:
# Cost_Cents - Latency_s - TTFT_s - Input_Tokens - Output_Tokens - Total_Tokens
# Word_Count - Tokens_Per_Word - Throughput_tokens_per_sec - Throughput_words_per_sec
# Pricing_Type - Input_Rate_Per_1M - Output_Rate_Per_1M

#df_metrics.drop('Response', axis=1).sort_values('Cost_Cents', ascending=True)
df_metrics.drop(['Response', 'Pricing_Type', 'Input_Rate_Per_1M', 'Output_Rate_Per_1M'], axis=1).sort_values('TTFT_s', ascending=True)

### üéØ Performance Interpretation by Use Case

Understanding what "good" performance looks like depends on your use case. Use this comprehensive table to interpret the metrics above and identify which models meet your requirements:

| Use Case | Priorities | Reference TTFT | Reference Throughput | Reference Latency | Why These Metrics Matter |
|----------|---------------|-------------|-------------------|----------------|--------------------------|
| üí¨ Chatbot | TTFT > Throughput > Latency | 500ms-1s | 50-100 tokens/s | 2-6s | - Users perceive <500ms TTFT as instant<br>- High throughput improves streaming responsiveness<br>- Cost matters at scale. |
| ü§ñ Agent | Quality > Tool Calling > Throughput | 1-3s | 25-50 tokens/s | 5-15s | - Accuracy more important than speed for multi-step workflows<br>- Acceptable to be slower since agents run in background<br>- Higher cost justified for better reasoning |
| üìÑ Document Analysis | Context > Quality > Cost | 2-10s | 25-50 tokens/s | 10-60s | - Context window must fit document size<br>- Latency less critical for batch processing<br>- Accuracy in extraction is paramount. |
| üíª Code Generation | Code Quality > Context > Speed | 1-4s | 25-50 tokens/s | 5-20s | - Correctness and syntax accuracy matter most<br>- Larger context helps understand full codebases<br>- Faster generation improves developer experience |
| üñºÔ∏è Multimodal | Multimodal Support > Quality > Speed | 1-5s | 50-100 tokens/s | 3-10s | - Images add 100-1000+ tokens depending on resolution<br>- Visual understanding accuracy is critical |

---

### üìä How to Use This Table

1. **Find your use case** in the leftmost column
2. **Check priority order** to understand what matters most
3. **Compare your results** from the performance metrics above against the target thresholds
4. **Review recommended models** for your use case
5. **Read the rationale** to understand why these metrics matter

**üí° Pro Tip**: The "best" model depends on your specific requirements. A model that's "poor" for chatbots might be "excellent" for document analysis. Always prioritize the metrics that matter most for YOUR use case.

**Key Insights**:
- **Chatbots**: Speed is king - users expect instant responses
- **Agents**: Quality over speed - accuracy matters more than raw performance
- **Document Analysis**: Context window is critical - must fit your document size
- **Code Generation**: Specialized models worth the premium for accuracy
- **Multimodal**: Limited options - Llama 4 Maverick is currently the only choice

**üí≠ Think about:** Based on what you just saw, which factor is most important for your use case? Fastest response time, lowest cost, best response quality, or largest context window?

## üå≥ Quick Decision Tree

- **üìä Need multimodal (image + text)?** ‚Üí Llama 4 Maverick
- **üìÑ Processing very long documents?** ‚Üí Llama 4 Scout (3.5M context)
- **ü§ñ Building complex agents/workflows?** ‚Üí GPT OSS 120B or DeepSeek V3.1
- **‚ö° Need fastest responses?** ‚Üí GPT OSS 20B
- **üí∞ Budget is primary concern?** ‚Üí GPT OSS 20B
- **üéì Need step-by-step reasoning?** ‚Üí DeepSeek V3.1 or GPT OSS 120B or Qwen3 235B (thinking mode)
- **üè¢ Enterprise general-purpose?** ‚Üí Qwen3 32B Dense

---

## üéâ Congratulations!

You've completed Lab 1a and learned how to:
- ‚úÖ Identify your use case and requirements
- ‚úÖ Match models to your specific needs
- ‚úÖ Interpret benchmark data in context
- ‚úÖ Analyze pricing and performance metrics

**Next**: Continue to **[Lab 1b](Lab1b_-_API_Integration_Options.ipynb)** to explore different API options (Invoke, Converse, ChatCompletions) for integrating these models into your applications.

---