# When and How to Use Extended Thinking

In this notebook, we'll dive deeper into Claude 3.7 Sonnet's extended thinking capability, exploring:

1. A task complexity classification framework
2. A decision tree for when to use extended thinking
3. Examples of appropriate use cases vs. cases where it's unnecessary
4. Performance benchmarking on different task types
5. Cost implications and optimization strategies

By the end, you'll have a systematic approach to determine when extended thinking is beneficial and how to optimize its use for your specific applications.

> **Note**: In this lesson, we're using the utility functions we developed in Lesson 1. The `claude_utils.py` module contains helper functions for creating Bedrock clients, invoking Claude with or without extended thinking, and displaying responses. This allows us to focus on the core concepts of when and how to use extended thinking rather than repeating boilerplate code.
>
> If you haven't completed Lesson 1 yet, you may want to review it first to understand how these utility functions work. Alternatively, you can examine the `claude_utils.py` file directly to see the implementation details.

In [None]:
# Import required libraries
import boto3
import json
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from IPython.display import display, Markdown, HTML
import claude_utils

# Configure plot styling
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_context("notebook", font_scale=1.2)
pd.set_option('display.max_colwidth', None)

In [None]:
# Set up the Bedrock clients using our utility module
REGION = 'us-west-2'  # Change to your preferred region
bedrock, bedrock_runtime = claude_utils.create_bedrock_clients(REGION)

# Claude 3.7 Sonnet model ID (consistent with Lesson 1)
CLAUDE_37_SONNET_MODEL_ID = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

# Verify model availability
claude_utils.verify_model_availability(bedrock, CLAUDE_37_SONNET_MODEL_ID)

## 1. Task Complexity Classification Framework

To determine when extended thinking is beneficial, we first need a framework to classify task complexity. This will help us make systematic decisions about when to use extended thinking and how much reasoning budget to allocate.

Our framework classifies tasks into four levels of complexity:

1. **Simple**: Straightforward factual queries, basic information retrieval, simple calculations
2. **Medium**: Multi-step reasoning, moderate math problems, basic analysis tasks
3. **Complex**: In-depth analysis, complex reasoning chains, constraint problems
4. **Very Complex**: Systems design, advanced mathematical proofs, multi-stage problem solving

Let's implement a classifier function that can automatically categorize tasks based on their complexity.

In [None]:
def classify_task_complexity(prompt, model_id='us.anthropic.claude-3-5-haiku-20241022-v1:0'):
    """
    Use Claude 3.5 Haiku to quickly classify the complexity of a task
    
    Args:
        prompt (str): The user prompt to classify
        model_id (str): The model ID to use for classification (defaults to Claude 3.5 Haiku for speed/cost)
        
    Returns:
        str: Complexity classification ('simple', 'medium', 'complex', or 'very_complex')
    """
    system_prompt = [
        {
            "text": """You are a task complexity classifier. Classify the complexity of the given task into one of these categories: 'simple', 'medium', 'complex', or 'very_complex'. 

Here are examples of each complexity level:
- simple: "What is the capital of France?", "Calculate 25% of 80", "Summarize this short paragraph in one sentence"
- medium: "A man has 53 socks in his drawer: 21 blue, 15 black and 17 red. How many socks must he take out to guarantee a black pair?", "Explain the greenhouse effect and its impact on climate"
- complex: "Design a ride-sharing service that optimizes for driver availability and route efficiency", "Analyze the causes and economic impacts of the 2008 financial crisis"
- very_complex: "Given a graph with n vertices and m edges, design an O(n+m) algorithm to find all bridges", "Design a quantum computing algorithm to solve the traveling salesman problem"

Respond with only the category name, nothing else."""
        }
    ]

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "text": f"Classify the complexity of this task: {prompt}"
                }
            ]
        }
    ]

    try:
        start_time = time.time()
        response = bedrock_runtime.converse(
            modelId=model_id,
            messages=messages,
            system=system_prompt,
            inferenceConfig={
                "temperature": 0,
                "maxTokens": 10  # We only need a short response
            }
        )
        elapsed_time = time.time() - start_time

        # Extract the classification
        result = None
        if response.get('output', {}).get('message', {}).get('content'):
            content_blocks = response['output']['message']['content']
            for block in content_blocks:
                if 'text' in block:
                    result = block['text'].strip().lower()
                    break

        # Ensure the result is one of our expected categories
        valid_categories = ['simple', 'medium', 'complex', 'very_complex']
        if result not in valid_categories:
            result = 'medium'  # Default to medium if unexpected response

        # Calculate approx cost
        tokens = response['usage']['totalTokens']
        cost = tokens * 0.00000125  # Assuming $0.00125 per 1K tokens for Haiku
        
        print(f"Classification: {result} (determined in {elapsed_time:.2f}s, {tokens} tokens, ${cost:.6f})")
        return result

    except Exception as e:
        print(f"Error classifying task complexity: {e}")
        return "medium"  # Default to medium complexity if there's an error

### Understanding the Task Complexity Classifier

The `classify_task_complexity` function is an efficient way to automatically categorize the complexity of user prompts. Here's how it works:

1. **Leveraging a smaller model**: We use **Claude 3.5 Haiku** instead of Claude 3.7 Sonnet for this classification step because it's faster and more cost-effective for this simple decision task.

2. **Classification framework**: The function sends the prompt to Claude with explicit instructions defining four complexity categories (simple, medium, complex, very_complex) along with examples of each.

3. **Efficiency considerations**: The function is optimized for speed and cost by:
   - Using a smaller model **(Claude Haiku 3.5)**
   - Setting temperature to 0 for deterministic responses
   - Limiting the output to just 10 tokens
   - Requesting only the category name

4. **Practical application**: Think of this as the "triage" step in our workflow - similar to how a CPU scheduler determines how much processing time to allocate to different tasks. This initial assessment helps us allocate the appropriate "thinking resources" to the task at hand.

This approach creates a "thinking pipeline" where we efficiently allocate Claude's reasoning capabilities based on task demands - using the right tool for each stage of the process.

In [None]:
# Define example tasks of varying complexity
example_tasks = {
    "simple_1": "What is the capital of France?",
    "simple_2": "What is 15% of 200?",

    "medium_1": "A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of black socks?",
    "medium_2": "Compare and contrast reinforcement learning and supervised learning in AI.",

    "complex_1": "Design a system for a ride-sharing service that optimizes for driver availability, passenger wait times, and route efficiency. Include considerations for peak hours, variable pricing, and geographic distribution of drivers.",
    "complex_2": "Analyze the causes and potential solutions to the prisoner's dilemma in game theory, including real-world applications and limitations.",

    "very_complex_1": "Given a graph G with n vertices and m edges, design an algorithm to find all bridges in the graph in O(n+m) time. A bridge is defined as an edge whose removal increases the number of connected components in the graph. Provide pseudocode and explain why your algorithm achieves the desired time complexity.",
    "very_complex_2": "Develop a comprehensive framework for a central bank to implement and manage a central bank digital currency (CBDC), addressing technological architecture, monetary policy implications, privacy concerns, and financial inclusion aspects."
}

# Test the classifier on our examples
print("Testing Claude Haiku 3.5 task complexity classifier...\n")
results = {}

for label, task in example_tasks.items():
    # Show abbreviated prompt if too long
    display_prompt = task if len(task) < 100 else task[:97] + "..."
    print(f"Task {label}: \"{display_prompt}\"")
    complexity = classify_task_complexity(task)
    results[label] = complexity
    print("-" * 80 + "\n")

# Display summary as a DataFrame
#summary_df = pd.DataFrame([(k, v) for k, v in results.items()], 
#                        columns=['Task', 'Classified Complexity'])
#display(summary_df)

## 2. Decision Tree for When to Use Extended Thinking

Based on our task complexity framework, we can create a decision tree to help determine:
1. Whether to use extended thinking
2. How much reasoning budget to allocate

The decision tree takes into account:
- Task complexity
- Performance requirements
- Time sensitivity
- Cost considerations

Here's a visualization of our decision tree:

![Decision Tree](./images/lesson2/complexity.png)

### Now, let's create a function to automatically determine whether to use extended thinking and what budget to allocate:

In [None]:
def determine_extended_thinking_strategy(prompt, time_sensitive=False):
    """
    Determine whether to use extended thinking and what budget to allocate
    based on task complexity and time sensitivity
    
    Args:
        prompt (str): The user prompt
        time_sensitive (bool): Whether the task is time-sensitive
        
    Returns:
        dict: Strategy with 'use_extended_thinking' and 'reasoning_budget' keys
    """
    # First, classify the task complexity
    complexity = classify_task_complexity(prompt)
    
    # Define reasoning budget ranges for each complexity level
    budget_ranges = {
        'simple': (0, 0),  # No extended thinking for simple tasks
        'medium': (1024, 2048),
        'complex': (2048, 8192),
        'very_complex': (8192, 16384)
    }
    
    # Determine whether to use extended thinking based on complexity and time sensitivity
    use_extended_thinking = True
    
    if complexity == 'simple':
        use_extended_thinking = False
    elif complexity == 'medium' and time_sensitive:
        use_extended_thinking = False
    
    # Determine reasoning budget (if using extended thinking)
    if use_extended_thinking:
        min_budget, max_budget = budget_ranges[complexity]
        
        # Use the lower end of the range if time_sensitive, otherwise use the middle
        if time_sensitive:
            reasoning_budget = min_budget
        else:
            reasoning_budget = (min_budget + max_budget) // 2
    else:
        reasoning_budget = 0
    
    strategy = {
        'complexity': complexity,
        'use_extended_thinking': use_extended_thinking,
        'reasoning_budget': reasoning_budget,
        'time_sensitive': time_sensitive
    }
    
    return strategy

### Understanding the Extended Thinking Strategy Function

The `determine_extended_thinking_strategy` function acts as an automated decision-making system that applies our decision tree logic, first classifying task complexity and then determining whether to use extended thinking and what reasoning budget to allocate based on both complexity and time sensitivity. Like a smart resource manager, it efficiently routes tasks to the appropriate processing pipeline with the right amount of "thinking power" based on the task's demands.

## 3. Examples of Appropriate Use Cases vs. Cases Where Extended Thinking is Unnecessary

Now that we have our framework and decision tree, let's examine specific examples of when extended thinking is beneficial and when it's unnecessary. We'll test both scenarios with real prompts and compare the results.

In [None]:
def test_with_and_without_extended_thinking(prompt, max_tokens=500):
    """
    Test a prompt with and without extended thinking and compare the results
    
    Args:
        prompt (str): The prompt to test
        max_tokens (int): Maximum tokens for the response
        
    Returns:
        dict: Results including both responses and metrics
    """
    print(f"Testing prompt: {prompt[:100]}..." if len(prompt) > 100 else f"Testing prompt: {prompt}")
    
    # Determine optimal strategy
    strategy = determine_extended_thinking_strategy(prompt)
    print(f"Strategy: Complexity={strategy['complexity']}, Use Extended Thinking={strategy['use_extended_thinking']}, Budget={strategy['reasoning_budget']}")
    
    # Test without extended thinking
    print("\nTesting WITHOUT extended thinking...")
    standard_response = claude_utils.invoke_claude(
        bedrock_runtime,
        prompt, 
        CLAUDE_37_SONNET_MODEL_ID, 
        enable_reasoning=False,
        max_tokens=max_tokens
    )
    
    # Test with extended thinking (if recommended)
    reasoning_response = None
    if strategy['use_extended_thinking']:
        print("\nTesting WITH extended thinking...")
        reasoning_response = claude_utils.invoke_claude(
            bedrock_runtime,
            prompt, 
            CLAUDE_37_SONNET_MODEL_ID, 
            enable_reasoning=True,
            reasoning_budget=strategy['reasoning_budget'],
            max_tokens=max_tokens
        )
    
    # Display results
    standard_result = claude_utils.extract_response_content(standard_response)
    print("\n--- Standard Mode Result ---")
    standard_time = standard_response.get('_elapsed_time', 0)
    standard_tokens = standard_response.get('usage', {}).get('totalTokens', 0)
    standard_cost = (standard_response.get('usage', {}).get('inputTokens', 0) * 0.000003) + \
                   (standard_response.get('usage', {}).get('outputTokens', 0) * 0.000005)
    
    print(f"Time: {standard_time:.2f}s, Tokens: {standard_tokens}, Cost: ${standard_cost:.6f}")
    display(Markdown(f"**Standard Mode Result:**\n{standard_result[:500]}..."))
    
    # Only show extended thinking results if we used it
    reasoning_result = None
    if reasoning_response:
        reasoning_result = claude_utils.extract_response_content(reasoning_response)
        print("\n--- Extended Thinking Mode Result ---")
        reasoning_time = reasoning_response.get('_elapsed_time', 0)
        reasoning_tokens = reasoning_response.get('usage', {}).get('totalTokens', 0)
        reasoning_cost = (reasoning_response.get('usage', {}).get('inputTokens', 0) * 0.000003) + \
                        (reasoning_response.get('usage', {}).get('outputTokens', 0) * 0.000015)
        
        print(f"Time: {reasoning_time:.2f}s, Tokens: {reasoning_tokens}, Cost: ${reasoning_cost:.6f}")
        display(Markdown(f"**Extended Thinking Mode Result:**\n{reasoning_result[:500]}..."))
    
    return {
        'strategy': strategy,
        'standard_response': standard_response,
        'reasoning_response': reasoning_response,
        'standard_result': standard_result,
        'reasoning_result': reasoning_result
    }

### Now let's test some appropriate and inappropriate use cases for extended thinking:

In [None]:
# Example 1: Simple factual query (should NOT use extended thinking)
simple_query = "What are the three primary colors of light?"
simple_results = test_with_and_without_extended_thinking(simple_query)
print("\n" + "="*80 + "\n")

# Example 2: Complex reasoning task (SHOULD use extended thinking)
complex_query = """
Analyze the knapsack problem in computer science. Explain the dynamic programming approach 
to solve it, provide pseudocode, and analyze the time and space complexity. 
Also explain when a greedy approach might work and when it would fail.
"""
complex_results = test_with_and_without_extended_thinking(complex_query, max_tokens=800)
print("\n" + "="*80 + "\n")

# Example 3: Medium complexity task (may use extended thinking if not time-sensitive)
medium_query = """
Compare and contrast supervised learning and unsupervised learning in machine learning. 
Give examples of algorithms in each category and scenarios where one would be preferred over the other.
"""
medium_results = test_with_and_without_extended_thinking(medium_query)

## 4. Performance Benchmarking on Different Task Types

Now that we've seen individual examples, let's systematically benchmark Claude's performance across different task types with and without extended thinking. We'll measure several key metrics:

1. **Response quality** (qualitative assessment)
2. **Time to generate response** (latency)
3. **Token usage** (and associated cost)
4. **Efficiency** (tokens per second)

This benchmarking will help us quantify the tradeoffs between using extended thinking and standard mode for different task types.

In [None]:
def run_benchmarking_comparison(tasks, max_tokens=500):
    """
    Run a systematic benchmarking comparison across different task types
    
    Args:
        tasks (dict): Dictionary of task labels to prompts
        max_tokens (int): Maximum tokens for responses
        
    Returns:
        pd.DataFrame: Benchmarking results
    """
    results = []
    
    for label, prompt in tasks.items():
        print(f"\nBenchmarking task: {label}")
        print("-" * 50)
        
        # 1. Test without extended thinking
        print(f"Testing standard mode...")
        standard_start = time.time()
        standard_response = claude_utils.invoke_claude(
            bedrock_runtime,
            prompt, 
            CLAUDE_37_SONNET_MODEL_ID, 
            enable_reasoning=False,
            max_tokens=max_tokens
        )
        standard_time = time.time() - standard_start
        
        standard_input_tokens = standard_response.get('usage', {}).get('inputTokens', 0)
        standard_output_tokens = standard_response.get('usage', {}).get('outputTokens', 0)
        standard_total_tokens = standard_response.get('usage', {}).get('totalTokens', 0)
        standard_cost = (standard_input_tokens * 0.000003) + (standard_output_tokens * 0.000015)
        
        # 2. Test with extended thinking
        print(f"Testing extended thinking mode...")
        # Determine appropriate budget based on complexity
        complexity = classify_task_complexity(prompt)
        budget_map = {
            'simple': 1024,
            'medium': 2048,
            'complex': 4096,
            'very_complex': 8192
        }
        budget = budget_map.get(complexity, 2048)
        
        reasoning_start = time.time()
        reasoning_response = claude_utils.invoke_claude(
            bedrock_runtime,
            prompt, 
            CLAUDE_37_SONNET_MODEL_ID, 
            enable_reasoning=True,
            reasoning_budget=budget,
            max_tokens=max_tokens
        )
        reasoning_time = time.time() - reasoning_start
        
        reasoning_input_tokens = reasoning_response.get('usage', {}).get('inputTokens', 0)
        reasoning_output_tokens = reasoning_response.get('usage', {}).get('outputTokens', 0)
        reasoning_total_tokens = reasoning_response.get('usage', {}).get('totalTokens', 0)
        reasoning_cost = (reasoning_input_tokens * 0.000003) + (reasoning_output_tokens * 0.000015)
        
        # Calculate efficiency metrics
        standard_efficiency = standard_total_tokens / standard_time if standard_time > 0 else 0
        reasoning_efficiency = reasoning_total_tokens / reasoning_time if reasoning_time > 0 else 0
        
        # Collect results
        result = {
            'Task': label,
            'Complexity': complexity,
            'Standard_Time': standard_time,
            'Standard_Tokens': standard_total_tokens,
            'Standard_Cost': standard_cost,
            'Standard_Efficiency': standard_efficiency,
            'Reasoning_Time': reasoning_time,
            'Reasoning_Tokens': reasoning_total_tokens,
            'Reasoning_Cost': reasoning_cost,
            'Reasoning_Efficiency': reasoning_efficiency,
            'Time_Ratio': reasoning_time / standard_time if standard_time > 0 else 0,
            'Cost_Ratio': reasoning_cost / standard_cost if standard_cost > 0 else 0
        }
        
        print(f"Results: {complexity} task")
        print(f"Standard: {standard_time:.2f}s, {standard_total_tokens} tokens, ${standard_cost:.6f}")
        print(f"Reasoning: {reasoning_time:.2f}s, {reasoning_total_tokens} tokens, ${reasoning_cost:.6f}")
        
        results.append(result)
    
    # Convert to DataFrame
    df = pd.DataFrame(results)
    return df

# Define benchmark tasks
benchmark_tasks = {
    "FactualQuery": "What are the major planets in our solar system?",
    "SimpleCalc": "If I have 5 apples and give away 2, then buy 3 more, how many do I have?",
    "MediumAnalysis": "Compare the advantages and disadvantages of renewable energy vs. fossil fuels.",
    "SockDrawer": "A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of black socks?",
    "ComplexDesign": "Design a system for coordinating autonomous delivery drones in an urban environment, considering obstacles, weather, traffic patterns, and regulatory compliance.",
    "AdvancedCoding": "Explain how you would implement a distributed system for real-time processing of financial transactions that ensures ACID properties while maintaining high throughput."
}

# Run the benchmarking comparison
benchmark_results = run_benchmarking_comparison(benchmark_tasks)

# Display results
display(benchmark_results)

# Visualize the results
plt.figure(figsize=(14, 10))

# 1. Time comparison
plt.subplot(2, 2, 1)
benchmark_results.plot(kind='bar', x='Task', y=['Standard_Time', 'Reasoning_Time'], ax=plt.gca())
plt.title('Response Time Comparison')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=45)

# 2. Cost comparison
plt.subplot(2, 2, 2)
benchmark_results.plot(kind='bar', x='Task', y=['Standard_Cost', 'Reasoning_Cost'], ax=plt.gca())
plt.title('Cost Comparison')
plt.ylabel('Cost ($)')
plt.xticks(rotation=45)

# 3. Efficiency comparison
plt.subplot(2, 2, 3)
benchmark_results.plot(kind='bar', x='Task', y=['Standard_Efficiency', 'Reasoning_Efficiency'], ax=plt.gca())
plt.title('Efficiency Comparison (Tokens/Second)')
plt.ylabel('Tokens per Second')
plt.xticks(rotation=45)

# 4. Ratio analysis by complexity
plt.subplot(2, 2, 4)
complexity_order = ['simple', 'medium', 'complex', 'very_complex']
benchmark_results['Complexity'] = pd.Categorical(benchmark_results['Complexity'], categories=complexity_order, ordered=True)
benchmark_results.sort_values('Complexity', inplace=True)

# Create separate scatter plots for each ratio instead of trying to plot both at once
benchmark_results.plot(kind='scatter', x='Complexity', y='Time_Ratio', color='blue', label='Time Ratio', ax=plt.gca())
benchmark_results.plot(kind='scatter', x='Complexity', y='Cost_Ratio', color='red', label='Cost Ratio', ax=plt.gca())

plt.title('Extended Thinking/Standard Ratios by Complexity')
plt.ylabel('Ratio (Extended/Standard)')
plt.xticks(rotation=45)
plt.legend()

plt.tight_layout()
plt.show()

### Understanding the Benchmarking Function

The `run_benchmarking_comparison` function serves as our experimental lab, systematically testing Claude's performance across different task types both with and without extended thinking. For each task, it measures response time, token usage, cost, and efficiency metrics, allowing us to quantify the exact performance tradeoffs and determine where extended thinking provides the most value. Think of it as a controlled experiment that helps us see beyond anecdotal evidence to develop data-driven guidelines for when to use extended thinking.

## 5. Cost Implications and Optimization Strategies

Based on our benchmarking results, we can develop some strategies to optimize the cost-performance tradeoff when using Claude's extended thinking capabilities.

### Cost Structure

Claude 3.7 Sonnet's pricing is consistent across both standard and extended thinking modes:
- **Input tokens**: $3 per million tokens
- **Output tokens**: $15 per million tokens (including thinking tokens)

This means that extended thinking primarily increases costs through additional output tokens used for reasoning.



### Optimization Strategies

Let's explore several strategies to optimize the cost-benefit tradeoff:

| Strategy | Description | When to Use |
|----------|-------------|-------------|
| Task Complexity Filtering | Only use extended thinking for complex and very complex tasks | Apply our classifier to route simple tasks to standard mode |
| Dynamic Budget Allocation | Allocate reasoning budget based on task complexity | Scale budget from 1024 tokens (minimum) to 16384+ tokens based on complexity |
| Two-phase Approach | Use standard mode first, only invoke extended thinking if needed | For uncertain cases or when standard mode confidence is low |
| Batch Similar Tasks | Group similar tasks to amortize the classification cost | For applications processing many similar requests (e.g., customer service) |
| Reasoning Budget Caps | Set upper limits on reasoning budgets based on ROI analysis | When diminishing returns are observed at higher budgets |

### Implementation Example: Cost-Optimized Extended Thinking

Below is an implementation of an optimized approach that balances cost and performance. This function:

1. Uses our complexity classifier to determine if extended thinking is needed
2. Allocates an appropriate reasoning budget based on complexity
3. Enforces budget caps to prevent excessive token usage
4. Provides transparency through detailed monitoring metrics

In [None]:
def cost_optimized_invoke(prompt, time_sensitive=False, max_tokens=1000):
    """
    Invoke Claude with cost-optimized extended thinking
    
    Args:
        prompt (str): The user prompt
        time_sensitive (bool): Whether the task is time-sensitive
        max_tokens (int): Maximum tokens for the final response
        
    Returns:
        dict: Response and performance metrics
    """
    # Step 1: Determine strategy
    strategy = determine_extended_thinking_strategy(prompt, time_sensitive=time_sensitive)
    
    # Step 2: Apply budget caps based on ROI analysis
    # (These caps would ideally be determined through extensive benchmarking)
    budget_caps = {
        'simple': 0,
        'medium': 2048,
        'complex': 4096,
        'very_complex': 8192
    }
    
    if strategy['use_extended_thinking']:
        strategy['reasoning_budget'] = min(strategy['reasoning_budget'], budget_caps[strategy['complexity']])
    
    # Step 3: Invoke Claude with the optimized settings
    response = claude_utils.invoke_claude(
        bedrock_runtime,
        prompt,
        CLAUDE_37_SONNET_MODEL_ID,
        enable_reasoning=strategy['use_extended_thinking'],
        reasoning_budget=strategy['reasoning_budget'],
        max_tokens=max_tokens
    )
    
    # Step 4: Calculate performance metrics
    elapsed_time = response.get('_elapsed_time', 0)
    input_tokens = response.get('usage', {}).get('inputTokens', 0)
    output_tokens = response.get('usage', {}).get('outputTokens', 0)
    total_tokens = response.get('usage', {}).get('totalTokens', 0)
    
    input_cost = input_tokens * 0.000003
    output_cost = output_tokens * 0.000005
    total_cost = input_cost + output_cost
    
    metrics = {
        'strategy': strategy,
        'elapsed_time': elapsed_time,
        'input_tokens': input_tokens,
        'output_tokens': output_tokens,
        'total_tokens': total_tokens,
        'input_cost': input_cost,
        'output_cost': output_cost,
        'total_cost': total_cost
    }
    
    # Add the metrics to the response for transparency
    response['_metrics'] = metrics
    
    return response

# Demonstrate the cost-optimized approach
print("Testing cost-optimized approach...")
test_prompt = """
A spaceship needs to visit 5 planets (A, B, C, D, and E) starting from Earth and then returning to Earth. 
The distances between each pair of locations are as follows (in light-years):
Earth-A: 5, Earth-B: 7, Earth-C: 8, Earth-D: 10, Earth-E: 12
A-B: 4, A-C: 6, A-D: 9, A-E: 11
B-C: 5, B-D: 8, B-E: 10
C-D: 6, C-E: 9
D-E: 7

What's the shortest possible route that visits each planet exactly once before returning to Earth?
"""

cost_optimized_response = cost_optimized_invoke(test_prompt)
claude_utils.display_claude_response(cost_optimized_response)

# Display the optimization metrics
metrics = cost_optimized_response.get('_metrics', {})
print("\nOptimization Metrics:")
print(f"Task Complexity: {metrics.get('strategy', {}).get('complexity', 'unknown')}")
print(f"Extended Thinking Used: {metrics.get('strategy', {}).get('use_extended_thinking', False)}")
print(f"Reasoning Budget: {metrics.get('strategy', {}).get('reasoning_budget', 0)} tokens")
print(f"Time: {metrics.get('elapsed_time', 0):.2f} seconds")
print(f"Cost: ${metrics.get('total_cost', 0):.6f}")

## Conclusion: A Framework for Effective Use of Extended Thinking

We've developed a systematic framework for determining when to use Claude's extended thinking capability and how to optimize its use:

1. **Classify task complexity** using our automated classifier
2. **Apply our decision tree** to determine if extended thinking is needed
3. **Allocate appropriate reasoning budget** based on complexity
4. **Monitor performance metrics** to continuously refine the approach

By following this framework, you can:
- Improve response quality for complex tasks
- Avoid unnecessary costs for simple tasks
- Balance performance and cost considerations
- Adapt the approach based on your specific requirements

In the next notebook, we'll explore how to optimize reasoning budget allocation in more detail, building on the foundation established here.