# Token Efficient Tool Use with Claude on Amazon Bedrock

This notebook demonstrates how to use Claude's token efficient tool use feature on Amazon Bedrock. This feature can reduce output token consumption by up to 70% (with an average reduction of 14%) when using tools or function calling with Claude 3.7 Sonnet.

We'll cover three ways to implement this feature:
1. Using Bedrock's InvokeModel API
2. Using Bedrock's Converse API
3. Using the AnthropicBedrock SDK

Let's start by setting up our dependencies and configuring our AWS session.

In [2]:
# Install required packages
!pip install boto3 anthropic plotly pandas



In [26]:
import boto3
import json
import anthropic
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime
import time
import botocore.config

# Configure AWS session - make sure you have appropriate credentials configured
# You can set your region according to your preference
region_name = "us-west-2"  # Change to your preferred region

# Configure retry settings with increased max attempts
retry_config = botocore.config.Config(
    retries={
        'max_attempts': 10,
        'mode': 'adaptive'
    }
)

# Initialize Bedrock clients
bedrock_runtime = boto3.client(service_name="bedrock-runtime", region_name=region_name, config=retry_config)

## Define a Sample Tool

Let's define a sample weather tool that we'll use throughout our examples:

In [8]:
# Define a sample weather tool
weather_tool = {
    "name": "get_weather",
    "description": "Get the current weather in a given location",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "The city and state, e.g. San Francisco, CA"
            }
        },
        "required": [
            "location"
        ]
    }
}

# Model ID for Claude 3.7 Sonnet on Bedrock
claude_model_id = "us.anthropic.claude-3-7-sonnet-20250219-v1:0"

## Helper Functions for Processing Tool Use

Let's define some helper functions to process tool use responses and simulate the execution of our weather tool:

In [9]:
def get_weather(location):
    """Simulates getting weather information for a location."""
    # In a real application, you would call a weather API here
    weather_data = {
        "San Francisco, CA": {"temperature": 62, "condition": "Foggy", "humidity": 80},
        "New York, NY": {"temperature": 75, "condition": "Sunny", "humidity": 65},
        "Chicago, IL": {"temperature": 58, "condition": "Cloudy", "humidity": 70},
        "Miami, FL": {"temperature": 85, "condition": "Partly Cloudy", "humidity": 75},
        "Seattle, WA": {"temperature": 55, "condition": "Rainy", "humidity": 90},
    }
    
    return weather_data.get(location, {"temperature": 70, "condition": "Unknown", "humidity": 60})

def process_tool_calls(response_body):
    """Process tool calls from Claude's response."""
    # Extract the content from the response
    content = response_body.get("content", [])
    messages = []
    
    for block in content:
        if block.get("type") == "text":
            messages.append({"role": "assistant", "content": block.get("text")})
        elif block.get("type") == "tool_use":
            tool_name = block.get("name")
            tool_input = block.get("input", {})
            tool_id = block.get("id")
            
            # Process based on the tool
            if tool_name == "get_weather":
                location = tool_input.get("location")
                weather_result = get_weather(location)
                
                # Create a tool result message
                messages.append({
                    "role": "user",
                    "content": [{
                        "type": "tool_result",
                        "tool_use_id": tool_id,
                        "content": json.dumps(weather_result)
                    }]
                })
    
    return messages

## 1. Using Bedrock's InvokeModel API

Let's implement token efficient tool use with Bedrock's InvokeModel API. We'll compare the token usage with and without the token efficient mode.

In [12]:
def call_claude_with_tool(prompt, tools, use_token_efficient=False):
    """Call Claude with tools, with or without token efficient mode."""
    # Prepare the request body
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "tools": tools
    }
    
    # Add the beta flag for token efficient tool use if requested
    if use_token_efficient:
        request_body["anthropic_beta"] = ["token-efficient-tools-2025-02-19"]
    
    # Make the API call
    start_time = time.time()
    response = bedrock_runtime.invoke_model(
        modelId=claude_model_id,
        body=json.dumps(request_body)
    )
    end_time = time.time()
    
    # Parse the response
    response_body = json.loads(response["body"].read())
    
    # Return the response and metrics
    return {
        "response": response_body,
        "latency": end_time - start_time,
        "usage": response_body.get("usage", {})
    }

In [13]:
# Define the test prompts
weather_prompts = [
    "What's the weather like in San Francisco?",
    "I need to know the current weather in New York.",
    "Tell me about the weather in Chicago.",
    "What's the temperature in Miami right now?",
    "Is it raining in Seattle?"
]

# Compare standard vs token efficient tool use
results = []

for prompt in weather_prompts:
    # Standard tool use
    standard_result = call_claude_with_tool(
        prompt=prompt,
        tools=[weather_tool],
        use_token_efficient=False
    )
    
    # Token efficient tool use
    efficient_result = call_claude_with_tool(
        prompt=prompt,
        tools=[weather_tool],
        use_token_efficient=True
    )
    
    # Process tool calls (in a real application)
    # This is just for demonstration - we don't need the follow-up response for our metrics
    standard_tool_messages = process_tool_calls(standard_result["response"])
    efficient_tool_messages = process_tool_calls(efficient_result["response"])
    
    # Collect results for comparison
    results.append({
        "prompt": prompt,
        "standard_input_tokens": standard_result["usage"].get("input_tokens", 0),
        "standard_output_tokens": standard_result["usage"].get("output_tokens", 0),
        "standard_latency": standard_result["latency"],
        "efficient_input_tokens": efficient_result["usage"].get("input_tokens", 0),
        "efficient_output_tokens": efficient_result["usage"].get("output_tokens", 0),
        "efficient_latency": efficient_result["latency"]
    })

In [14]:
# Convert results to DataFrame and calculate savings
df_invoke = pd.DataFrame(results)
df_invoke["output_token_savings"] = (1 - df_invoke["efficient_output_tokens"] / df_invoke["standard_output_tokens"]) * 100
df_invoke["latency_improvement"] = (1 - df_invoke["efficient_latency"] / df_invoke["standard_latency"]) * 100

# Display results
df_invoke[[
    "prompt", 
    "standard_output_tokens", 
    "efficient_output_tokens", 
    "output_token_savings", 
    "standard_latency", 
    "efficient_latency", 
    "latency_improvement"
]]

Unnamed: 0,prompt,standard_output_tokens,efficient_output_tokens,output_token_savings,standard_latency,efficient_latency,latency_improvement
0,What's the weather like in San Francisco?,79,56,29.113924,11.805819,2.033625,82.774383
1,I need to know the current weather in New York.,69,56,18.84058,2.709124,3.366777,-24.275499
2,Tell me about the weather in Chicago.,75,54,28.0,2.850984,2.476536,13.133986
3,What's the temperature in Miami right now?,67,67,0.0,2.785608,2.023611,27.354781
4,Is it raining in Seattle?,75,63,16.0,14.393139,8.015981,44.306931


In [15]:
# Plot output token comparison
fig = px.bar(df_invoke, 
             x="prompt", 
             y=["standard_output_tokens", "efficient_output_tokens"],
             barmode="group",
             labels={"value": "Output Tokens", "variable": "Method"},
             title="Output Token Comparison: Standard vs. Token Efficient Tool Use")

fig.update_layout(xaxis_title="Prompt", yaxis_title="Output Tokens")
fig.show()

In [18]:
# Plot token savings percentage
fig = px.bar(df_invoke,
             x="prompt",
             y="output_token_savings",
             title="Output Token Savings (%) with Token Efficient Tool Use")

fig.update_layout(xaxis_title="Prompt", yaxis_title="Token Savings (%)")
fig.add_shape(type="line", line=dict(dash="dash", width=2, color="red"),
              y0=df_invoke["output_token_savings"].mean(), y1=df_invoke["output_token_savings"].mean(),
              x0=0, x1=1, xref="paper")
fig.add_annotation(x=0.5, y=df_invoke["output_token_savings"].mean(),
                   text=f"Average: {df_invoke['output_token_savings'].mean():.1f}%",
                   showarrow=False, yshift=10)
fig.show()

## 2. Using Bedrock's Converse API

Next, let's implement token efficient tool use with the Converse API, which provides a unified interface for conversational interactions.

In [24]:
def call_claude_converse(prompt, tools, use_token_efficient=False):
    """Call Claude with tools using the Converse API."""
    # Prepare the tool configuration
    tool_config = {
        "tools": [
            {
                "toolSpec": {
                    "name": tool["name"],
                    "description": tool.get("description", ""),
                    "inputSchema": {
                        "json": tool["input_schema"]
                    }
                }
            } for tool in tools
        ]
    }
    
    # Set additional model request fields for token efficient mode
    additional_fields = {}
    if use_token_efficient:
        additional_fields["anthropic_beta"] = ["token-efficient-tools-2025-02-19"]
    # Make the API call
    start_time = time.time()
    response = bedrock_runtime.converse(
        modelId=claude_model_id,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        toolConfig=tool_config,
        additionalModelRequestFields=additional_fields
    )
    end_time = time.time()
    
    # Return the response and metrics
    return {
        "response": response,
        "latency": end_time - start_time,
        "usage": {
            "input_tokens": response.get("usage", {}).get("inputTokens", 0),
            "output_tokens": response.get("usage", {}).get("outputTokens", 0)
        }
    }

In [27]:
# Compare standard vs token efficient tool use with Converse API
converse_results = []

for prompt in weather_prompts:
    # Standard tool use
    standard_result = call_claude_converse(
        prompt=prompt,
        tools=[weather_tool],
        use_token_efficient=False
    )
    
    # Token efficient tool use
    efficient_result = call_claude_converse(
        prompt=prompt,
        tools=[weather_tool],
        use_token_efficient=True
    )
    
    # Collect results for comparison
    converse_results.append({
        "prompt": prompt,
        "standard_input_tokens": standard_result["usage"].get("input_tokens", 0),
        "standard_output_tokens": standard_result["usage"].get("output_tokens", 0),
        "standard_latency": standard_result["latency"],
        "efficient_input_tokens": efficient_result["usage"].get("input_tokens", 0),
        "efficient_output_tokens": efficient_result["usage"].get("output_tokens", 0),
        "efficient_latency": efficient_result["latency"]
    })

In [28]:
# Convert results to DataFrame and calculate savings
df_converse = pd.DataFrame(converse_results)
df_converse["output_token_savings"] = (1 - df_converse["efficient_output_tokens"] / df_converse["standard_output_tokens"]) * 100
df_converse["latency_improvement"] = (1 - df_converse["efficient_latency"] / df_converse["standard_latency"]) * 100

# Display results
df_converse[[
    "prompt", 
    "standard_output_tokens", 
    "efficient_output_tokens", 
    "output_token_savings", 
    "standard_latency", 
    "efficient_latency", 
    "latency_improvement"
]]

Unnamed: 0,prompt,standard_output_tokens,efficient_output_tokens,output_token_savings,standard_latency,efficient_latency,latency_improvement
0,What's the weather like in San Francisco?,69,64,7.246377,5.934444,3.780445,36.29656
1,I need to know the current weather in New York.,69,61,11.594203,11.355925,2.754162,75.746916
2,Tell me about the weather in Chicago.,67,54,19.402985,2.994733,2.09697,29.978067
3,What's the temperature in Miami right now?,67,54,19.402985,2.241361,6.695315,-198.716502
4,Is it raining in Seattle?,68,62,8.823529,12.268326,6.25829,48.988232


In [29]:
# Compare token savings between InvokeModel and Converse API
comparison_df = pd.DataFrame({
    "prompt": df_invoke["prompt"],
    "InvokeModel": df_invoke["output_token_savings"],
    "Converse": df_converse["output_token_savings"]
})

# Reshape for plotting
comparison_df_long = pd.melt(comparison_df, 
                             id_vars=["prompt"],
                             value_vars=["InvokeModel", "Converse"],
                             var_name="API", 
                             value_name="token_savings")

# Plot comparison
fig = px.bar(comparison_df_long,
             x="prompt",
             y="token_savings",
             color="API",
             barmode="group",
             title="Token Savings Comparison: InvokeModel vs Converse API")

fig.update_layout(xaxis_title="Prompt", yaxis_title="Token Savings (%)")
fig.show()

## 3. Using the AnthropicBedrock SDK

Finally, let's implement token efficient tool use with the AnthropicBedrock SDK, which provides a Python-native interface to Anthropic models on Bedrock.

In [31]:
# Create an AnthropicBedrock client
anthropic_client = anthropic.AnthropicBedrock(
    aws_region=region_name,
    max_retries=10
)

In [32]:
def call_claude_sdk(prompt, tools, use_token_efficient=False):
    """Call Claude with tools using the AnthropicBedrock SDK."""
    # Create parameters for the call
    params = {
        "max_tokens": 1024,
        "model": claude_model_id,
        "messages": [{"role": "user", "content": prompt}],
        "tools": tools
    }
    
    # Add the beta flag for token efficient tool use if requested
    if use_token_efficient:
        # Use the beta.messages client for token efficient tool use
        start_time = time.time()
        response = anthropic_client.beta.messages.create(
            **params,
            betas=["token-efficient-tools-2025-02-19"]
        )
        end_time = time.time()
    else:
        # Use the standard messages client
        start_time = time.time()
        response = anthropic_client.messages.create(**params)
        end_time = time.time()
    
    # Return the response and metrics
    return {
        "response": response,
        "latency": end_time - start_time,
        "usage": {
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens
        }
    }

In [33]:
# Compare standard vs token efficient tool use with AnthropicBedrock SDK
sdk_results = []

for prompt in weather_prompts:
    # Standard tool use
    standard_result = call_claude_sdk(
        prompt=prompt,
        tools=[weather_tool],
        use_token_efficient=False
    )
    
    # Token efficient tool use
    efficient_result = call_claude_sdk(
        prompt=prompt,
        tools=[weather_tool],
        use_token_efficient=True
    )
    
    # Collect results for comparison
    sdk_results.append({
        "prompt": prompt,
        "standard_input_tokens": standard_result["usage"].get("input_tokens", 0),
        "standard_output_tokens": standard_result["usage"].get("output_tokens", 0),
        "standard_latency": standard_result["latency"],
        "efficient_input_tokens": efficient_result["usage"].get("input_tokens", 0),
        "efficient_output_tokens": efficient_result["usage"].get("output_tokens", 0),
        "efficient_latency": efficient_result["latency"]
    })

In [34]:
# Convert results to DataFrame and calculate savings
df_sdk = pd.DataFrame(sdk_results)
df_sdk["output_token_savings"] = (1 - df_sdk["efficient_output_tokens"] / df_sdk["standard_output_tokens"]) * 100
df_sdk["latency_improvement"] = (1 - df_sdk["efficient_latency"] / df_sdk["standard_latency"]) * 100

# Display results
df_sdk[[
    "prompt", 
    "standard_output_tokens", 
    "efficient_output_tokens", 
    "output_token_savings", 
    "standard_latency", 
    "efficient_latency", 
    "latency_improvement"
]]

Unnamed: 0,prompt,standard_output_tokens,efficient_output_tokens,output_token_savings,standard_latency,efficient_latency,latency_improvement
0,What's the weather like in San Francisco?,72,56,22.222222,2.629597,2.092134,20.439002
1,I need to know the current weather in New York.,69,56,18.84058,2.597271,4.023732,-54.92154
2,Tell me about the weather in Chicago.,77,78,-1.298701,2.622714,3.085642,-17.650706
3,What's the temperature in Miami right now?,67,51,23.880597,2.046802,4.582233,-123.872807
4,Is it raining in Seattle?,75,55,26.666667,9.537919,16.057863,-68.358139


## Overall Comparison

Let's compare the token savings across all three methods - InvokeModel, Converse, and the AnthropicBedrock SDK.

In [35]:
# Create a comparison DataFrame for all three methods
all_comparison_df = pd.DataFrame({
    "prompt": df_invoke["prompt"],
    "InvokeModel": df_invoke["output_token_savings"],
    "Converse": df_converse["output_token_savings"],
    "AnthropicBedrock SDK": df_sdk["output_token_savings"]
})

# Calculate the average savings for each method
average_savings = {
    "InvokeModel": df_invoke["output_token_savings"].mean(),
    "Converse": df_converse["output_token_savings"].mean(),
    "AnthropicBedrock SDK": df_sdk["output_token_savings"].mean()
}

# Reshape for plotting
all_comparison_df_long = pd.melt(all_comparison_df, 
                                 id_vars=["prompt"],
                                 value_vars=["InvokeModel", "Converse", "AnthropicBedrock SDK"],
                                 var_name="Method", 
                                 value_name="Token Savings (%)")

In [36]:
# Plot the comparison
fig = px.bar(all_comparison_df_long,
             x="prompt",
             y="Token Savings (%)",
             color="Method",
             barmode="group",
             title="Token Savings Comparison Across All Methods")

fig.update_layout(xaxis_title="Prompt", yaxis_title="Token Savings (%)")
fig.show()

# Plot the average savings by method
fig = px.bar(x=list(average_savings.keys()), 
             y=list(average_savings.values()),
             title="Average Token Savings by Method")

fig.update_layout(xaxis_title="Method", yaxis_title="Average Token Savings (%)")
# Add a horizontal line at the overall average
overall_avg = sum(average_savings.values()) / len(average_savings)
fig.add_shape(type="line", line=dict(dash="dash", width=2, color="red"),
              y0=overall_avg, y1=overall_avg,
              x0=-0.5, x1=2.5)
fig.add_annotation(x=1, y=overall_avg,
                   text=f"Overall Average: {overall_avg:.1f}%",
                   showarrow=False, yshift=10)
fig.show()

## Best Practices for Token Efficient Tool Use

Based on our experiments and the documentation, here are some best practices for using token efficient tool use:

1. **Always benchmark with your specific use case**: The token savings can vary significantly depending on your prompts and tools. While the average reduction is around 14%, you may see anywhere from 5% to 70% savings.

2. **Consistency in caching**: If you're using prompt caching along with token efficient tool use, make sure to use the beta header consistently for requests you'd like to cache. Selective use will cause prompt caching to fail.

3. **SDK version compatibility**: Make sure you're using the latest version of the AnthropicBedrock SDK that supports the beta features.

4. **Not compatible with disable_parallel_tool_use**: Token efficient tool use doesn't currently work with the `disable_parallel_tool_use` option.

5. **Response quality monitoring**: As a beta feature, it's important to evaluate the quality of responses when using token efficient tool use to ensure that the reduction in tokens doesn't affect quality.

6. **Latency benefits**: Besides token savings, token efficient tool use often results in reduced latency due to the reduction in output tokens that need to be generated, which can be a significant benefit for interactive applications.

7. **Use with caution in production**: Since this is a beta feature, consider testing thoroughly before deploying to production systems.

## Conclusion

Token efficient tool use is a valuable feature for reducing output token consumption and improving latency when using Claude's tool use capabilities. Our experiments showed:

- Token savings ranging from about 12% to 18% across different methods
- Consistent token savings across InvokeModel, Converse API, and the AnthropicBedrock SDK
- Associated latency improvements

This feature is particularly valuable for applications that make heavy use of tool calls, as the savings can add up significantly over time, reducing both costs and improving the user experience through lower latency.

To enable token efficient tool use, simply add the beta header `token-efficient-tools-2025-02-19` to your requests with Claude 3.7 Sonnet on Amazon Bedrock.