# Semantic Tool Selection: Token Efficiency & Accuracy Analysis

Based on: [Internal Representations as Indicators of Hallucinations in Agent Tool Selection](https://arxiv.org/pdf/2601.05214)

## What This Demo Measures

This notebook compares **Traditional** (all 31 tools) vs **Semantic** (top-3 filtered tools) approaches across two critical metrics:

1. **Token Consumption**: How many tokens are used per query?
2. **Tool Selection Accuracy**: Does the agent pick the correct tool?

### The Dual Problem

- ‚ùå **Token Waste**: Sending 31 tool descriptions = ~4,500 tokens per query
- ‚ùå **Hallucination Risk**: More tools = more confusion = wrong tool selection

### The Solution

```
User Query ‚Üí FAISS Search ‚Üí Top 3 Tools ‚Üí Agent ‚Üí Correct Selection + Fewer Tokens
```

In [None]:
!pip install -q -r requirements.txt

## Configure API Key

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

# Uncomment to set manually:
# os.environ['OPENAI_API_KEY'] = 'your-key-here'

assert os.getenv('OPENAI_API_KEY'), 'Set OPENAI_API_KEY in .env file or uncomment line above'

## Setup

In [1]:
import sys
import io
import re
from typing import Dict, List, Tuple

from strands import Agent
from strands.models.openai import OpenAIModel
from enhanced_tools import ALL_TOOLS
from registry import build_index, search_tools, swap_tools

print(f"‚úÖ Loaded {len(ALL_TOOLS)} tools")

‚úÖ Loaded 29 tools


## Build Semantic Index

In [2]:
build_index(ALL_TOOLS)
print("‚úÖ FAISS index built")



Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


‚úÖ Indexed 29 tools
‚úÖ FAISS index built


## Test Queries

Diverse queries testing different tool categories:

In [3]:
TESTS = [
    # Ambiguous queries - could match multiple tools
    ("Search for something in Paris", "search"),  # Generic vs search_hotels
    ("Check something", "check"),  # Generic vs check_hotel_availability
    ("Get some details", "get_details"),  # Generic vs get_hotel_details
    ("What's the status?", "get_status"),  # Generic vs get_flight_status
    ("I need information", "get_info"),  # Generic vs specific tools
    
    # Hotel queries - similar tools
    ("Search hotels in Barcelona", "search_hotels"),  # vs search_real_hotels
    ("Find real hotels in France", "search_real_hotels"),  # vs search_hotels
    ("How much does a hotel cost?", "get_hotel_pricing"),  # vs get_hotel_details
    ("What amenities does the hotel have?", "get_hotel_details"),  # vs get_hotel_pricing
    ("Is the hotel available tomorrow?", "check_hotel_availability"),  # vs check_hotel_availability_dates
    ("Check availability from March 15 to 18", "check_hotel_availability_dates"),  # vs check_hotel_availability
    
    # Flight queries - similar tools
    ("Find flights to Tokyo", "search_flights"),  # vs search_flight_prices
    ("How much do flights cost?", "search_flight_prices"),  # vs search_flights
    ("Is flight AA123 on time?", "get_flight_status"),  # vs get_flight_details
    
    # Booking queries - generic vs specific
    ("Book a hotel for me", "book_hotel"),  # vs book (generic)
    ("Book a flight for me", "book_flight"),  # vs book (generic)
    ("Book something", "book"),  # Generic
    
    # Clear specific queries
    ("Convert 500 USD to EUR", "get_currency_exchange"),
    ("Do I need a visa for Spain?", "get_travel_documents"),
]

print(f"üìã Test suite: {len(TESTS)} queries (ambiguous queries to test tool confusion)")
print(f"üìä Tool pool: {len(ALL_TOOLS)} tools")

üìã Test suite: 13 queries
üìä Tool pool: 29 tools


## Helper Functions

In [4]:
def run_and_capture_with_tokens(agent, query: str) -> Tuple[List[str], Dict]:
    """Run agent and capture tool calls + token usage"""
    old_stdout = sys.stdout
    sys.stdout = io.StringIO()
    
    try:
        result = agent(query)
        output = sys.stdout.getvalue()
    finally:
        sys.stdout = old_stdout
    
    # Extract tool calls
    tools = re.findall(r'Tool #\d+: (\w+)', output)
    
    # Extract token usage from result
    tokens = {'input': 0, 'output': 0, 'total': 0}
    
    if hasattr(result, 'metrics'):
        summary = result.metrics.get_summary()
        usage = summary.get('accumulated_usage', {})
        tokens['input'] = usage.get('inputTokens', 0)
        tokens['output'] = usage.get('outputTokens', 0)
        tokens['total'] = usage.get('totalTokens', 0)
    
    # Fallback: estimate if no usage data
    if tokens['total'] == 0:
        num_tools = len(agent.tool_registry.get_all_tools_config())
        tokens['input'] = num_tools * 50 + 100  # ~50 tokens per tool + prompt
        tokens['output'] = 50  # estimated
        tokens['total'] = tokens['input'] + tokens['output']
        tokens['estimated'] = True
    
    return tools, tokens

## Test 1: Traditional Approach (All 29 Tools)

Agent receives all tools on every query.

In [5]:
MODEL = OpenAIModel(model_id="gpt-4o-mini")
PROMPT = "You are a travel assistant. Use the correct tool to answer questions."

print("="*80)
print(f"TEST 1: TRADITIONAL - {len(ALL_TOOLS)} tools every query")
print("="*80)

trad_results = []
trad_correct = 0
trad_total_tokens = 0

for query, expected in TESTS:
    agent = Agent(tools=ALL_TOOLS, system_prompt=PROMPT, model=MODEL)
    tools, tokens = run_and_capture_with_tokens(agent, query)
    
    ok = expected in tools
    trad_correct += ok
    trad_total_tokens += tokens['total']
    
    trad_results.append({
        'query': query,
        'expected': expected,
        'actual': tools,
        'correct': ok,
        'tokens': tokens
    })
    
    status = '‚úÖ' if ok else '‚ùå'
    est = ' (est)' if tokens.get('estimated') else ''
    print(f"{status} {query[:50]:50} | {tokens['total']:5} tokens{est}")
    print(f"   Called: {tools[:2] if tools else 'NO TOOL'}")

print(f"\nüìä Traditional Results:")
print(f"   Accuracy: {trad_correct}/{len(TESTS)} ({100*trad_correct/len(TESTS):.1f}%)")
print(f"   Total tokens: {trad_total_tokens:,}")
print(f"   Avg tokens/query: {trad_total_tokens/len(TESTS):.0f}")

TEST 1: TRADITIONAL - 29 tools every query
‚úÖ What's the current weather in Paris?               |  1806 tokens
   Called: ['get_weather']
‚úÖ Is it raining in Tokyo right now?                  |  1815 tokens
   Called: ['get_weather']
‚úÖ Find me flights from New York to London next week  |  1855 tokens
   Called: ['search_flights']
‚úÖ What's the status of flight AA123?                 |  1821 tokens
   Called: ['get_flight_status']
‚ùå How much do flights to Tokyo cost?                 |   900 tokens
   Called: NO TOOL
‚ùå Book a hotel room in Rome for John Smith           |   935 tokens
   Called: NO TOOL
‚úÖ Search for hotels in Barcelona                     |  1822 tokens
   Called: ['search_hotels']
‚úÖ How much does a room at the Marriott cost?         |  1827 tokens
   Called: ['get_hotel_pricing']
‚ùå Show me the top-rated hotels in the database       |  1835 tokens
   Called: ['search_hotels']
‚ùå Find hotels in France with rating above 9.0        |  2915 tokens
   Called: 

## Test 2: Semantic Approach (Top-3 Filtered Tools)

Agent receives only the 3 most relevant tools per query.

In [6]:
print("="*80)
print("TEST 2: SEMANTIC - Top-3 tools per query")
print("="*80)

sem_results = []
sem_correct = 0
sem_total_tokens = 0

for query, expected in TESTS:
    selected = search_tools(query, top_k=3)
    selected_names = [t.__name__ for t in selected]
    
    agent = Agent(tools=selected, system_prompt=PROMPT, model=MODEL)
    tools, tokens = run_and_capture_with_tokens(agent, query)
    
    ok = expected in tools
    sem_correct += ok
    sem_total_tokens += tokens['total']
    
    sem_results.append({
        'query': query,
        'expected': expected,
        'selected': selected_names,
        'actual': tools,
        'correct': ok,
        'tokens': tokens
    })
    
    status = '‚úÖ' if ok else '‚ùå'
    est = ' (est)' if tokens.get('estimated') else ''
    print(f"{status} {query[:50]:50} | {tokens['total']:5} tokens{est}")
    print(f"   Available: {selected_names}")
    print(f"   Called: {tools[:2] if tools else 'NO TOOL'}")

print(f"\nüìä Semantic Results:")
print(f"   Accuracy: {sem_correct}/{len(TESTS)} ({100*sem_correct/len(TESTS):.1f}%)")
print(f"   Total tokens: {sem_total_tokens:,}")
print(f"   Avg tokens/query: {sem_total_tokens/len(TESTS):.0f}")

TEST 2: SEMANTIC - Top-3 tools per query
‚úÖ What's the current weather in Paris?               |   284 tokens
   Available: ['get_weather', 'get_weather_forecast', 'get_weather_alerts']
   Called: ['get_weather']
‚úÖ Is it raining in Tokyo right now?                  |   295 tokens
   Available: ['get_weather_alerts', 'get_weather', 'get_weather_forecast']
   Called: ['get_weather']
‚úÖ Find me flights from New York to London next week  |   374 tokens
   Available: ['search_flight_prices', 'search_flights', 'get_flight_details']
   Called: ['search_flights']
‚úÖ What's the status of flight AA123?                 |   321 tokens
   Available: ['get_flight_status', 'check_flight_availability', 'get_flight_details']
   Called: ['get_flight_status']
‚ùå How much do flights to Tokyo cost?                 |   161 tokens
   Available: ['search_flight_prices', 'search_flights', 'book_flight']
   Called: NO TOOL
‚ùå Book a hotel room in Rome for John Smith           |   164 tokens
   Available:

## Test 3: Semantic + Memory (Single Agent)

Same agent across all queries, tools swapped dynamically.

In [7]:
print("="*80)
print("TEST 3: SEMANTIC + MEMORY - Single agent, dynamic tool swapping")
print("="*80)

initial_tools = search_tools(TESTS[0][0], top_k=3)
memory_agent = Agent(tools=initial_tools, system_prompt=PROMPT, model=MODEL)

mem_results = []
mem_correct = 0
mem_total_tokens = 0

for query, expected in TESTS:
    selected = search_tools(query, top_k=3)
    selected_names = [t.__name__ for t in selected]
    swap_tools(memory_agent, selected)
    
    tools, tokens = run_and_capture_with_tokens(memory_agent, query)
    
    ok = expected in tools
    mem_correct += ok
    mem_total_tokens += tokens['total']
    
    mem_results.append({
        'query': query,
        'expected': expected,
        'selected': selected_names,
        'actual': tools,
        'correct': ok,
        'tokens': tokens,
        'messages': len(memory_agent.messages)
    })
    
    status = '‚úÖ' if ok else '‚ùå'
    est = ' (est)' if tokens.get('estimated') else ''
    print(f"{status} {query[:50]:50} | {tokens['total']:5} tokens{est}")
    print(f"   Available: {selected_names} | Memory: {len(memory_agent.messages)} msgs")
    print(f"   Called: {tools[:2] if tools else 'NO TOOL'}")

print(f"\nüìä Semantic+Memory Results:")
print(f"   Accuracy: {mem_correct}/{len(TESTS)} ({100*mem_correct/len(TESTS):.1f}%)")
print(f"   Total tokens: {mem_total_tokens:,}")
print(f"   Avg tokens/query: {mem_total_tokens/len(TESTS):.0f}")
print(f"   Final conversation: {len(memory_agent.messages)} messages")

TEST 3: SEMANTIC + MEMORY - Single agent, dynamic tool swapping
‚úÖ What's the current weather in Paris?               |   284 tokens
   Available: ['get_weather', 'get_weather_forecast', 'get_weather_alerts'] | Memory: 4 msgs
   Called: ['get_weather']
‚úÖ Is it raining in Tokyo right now?                  |   688 tokens
   Available: ['get_weather_alerts', 'get_weather', 'get_weather_forecast'] | Memory: 8 msgs
   Called: ['get_weather']
‚úÖ Find me flights from New York to London next week  |  1294 tokens
   Available: ['search_flight_prices', 'search_flights', 'get_flight_details'] | Memory: 12 msgs
   Called: ['search_flights']
‚úÖ What's the status of flight AA123?                 |  2036 tokens
   Available: ['get_flight_status', 'check_flight_availability', 'get_flight_details'] | Memory: 16 msgs
   Called: ['get_flight_status']
‚úÖ How much do flights to Tokyo cost?                 |  2967 tokens
   Available: ['search_flight_prices', 'search_flights', 'book_flight'] | Memory:

## Comparative Analysis

In [None]:
print("="*80)
print("COMPARATIVE ANALYSIS")
print("="*80)

print(f"\nüìä Accuracy Comparison:")
print(f"   Traditional:      {trad_correct}/{len(TESTS)} ({100*trad_correct/len(TESTS):.1f}%)")
print(f"   Semantic:         {sem_correct}/{len(TESTS)} ({100*sem_correct/len(TESTS):.1f}%)")
print(f"   Semantic+Memory:  {mem_correct}/{len(TESTS)} ({100*mem_correct/len(TESTS):.1f}%)")

if sem_correct > trad_correct:
    print(f"   ‚úÖ Semantic improved by +{sem_correct - trad_correct} queries")
elif sem_correct < trad_correct:
    print(f"   ‚ö†Ô∏è  Semantic decreased by {trad_correct - sem_correct} queries")
else:
    print(f"   ‚û°Ô∏è  Same accuracy")

print(f"\nüí∞ Token Consumption:")
print(f"   Traditional:      {trad_total_tokens:,} tokens ({trad_total_tokens/len(TESTS):.0f} avg)")
print(f"   Semantic:         {sem_total_tokens:,} tokens ({sem_total_tokens/len(TESTS):.0f} avg)")
print(f"   Semantic+Memory:  {mem_total_tokens:,} tokens ({mem_total_tokens/len(TESTS):.0f} avg)")

if trad_total_tokens > 0:
    sem_savings = trad_total_tokens - sem_total_tokens
    mem_savings = trad_total_tokens - mem_total_tokens
    print(f"\nüí° Token Savings:")
    print(f"   Semantic vs Traditional:  {sem_savings:,} tokens ({100*sem_savings/trad_total_tokens:.1f}% reduction)")
    print(f"   Memory vs Traditional:    {mem_savings:,} tokens ({100*mem_savings/trad_total_tokens:.1f}% reduction)")
    
    if mem_total_tokens > sem_total_tokens:
        overhead = mem_total_tokens - sem_total_tokens
        print(f"   Memory overhead:          +{overhead:,} tokens (conversation history)")

## Per-Query Token Breakdown

In [None]:
print("="*80)
print("PER-QUERY TOKEN BREAKDOWN")
print("="*80)

print(f"\n{'Query':<50} {'Trad':>8} {'Sem':>8} {'Mem':>8} {'Saved':>8}")
print("-"*80)

for i in range(len(TESTS)):
    query = TESTS[i][0][:49]
    trad_tok = trad_results[i]['tokens']['total']
    sem_tok = sem_results[i]['tokens']['total']
    mem_tok = mem_results[i]['tokens']['total']
    saved = trad_tok - sem_tok
    
    print(f"{query:<50} {trad_tok:8} {sem_tok:8} {mem_tok:8} {saved:8}")

## Error Analysis

In [None]:
print("="*80)
print("ERROR ANALYSIS")
print("="*80)

print(f"\nüîç Traditional Errors:")
trad_errors = [r for r in trad_results if not r['correct']]
if trad_errors:
    for r in trad_errors:
        print(f"   ‚ùå '{r['query'][:60]}'")
        print(f"      Expected: {r['expected']}, Got: {r['actual']}")
else:
    print("   ‚úÖ No errors")

print(f"\nüîç Semantic Errors:")
sem_errors = [r for r in sem_results if not r['correct']]
if sem_errors:
    for r in sem_errors:
        print(f"   ‚ùå '{r['query'][:60]}'")
        print(f"      Expected: {r['expected']}, Got: {r['actual']}")
        print(f"      Available: {r['selected']}")
        if r['expected'] not in r['selected']:
            print(f"      ‚ö†Ô∏è  Correct tool NOT in top-3 (FAISS filtering issue)")
else:
    print("   ‚úÖ No errors")

print(f"\nüîç Semantic+Memory Errors:")
mem_errors = [r for r in mem_results if not r['correct']]
if mem_errors:
    for r in mem_errors:
        print(f"   ‚ùå '{r['query'][:60]}'")
        print(f"      Expected: {r['expected']}, Got: {r['actual']}")
        print(f"      Available: {r['selected']}")
else:
    print("   ‚úÖ No errors")

## Summary

### Key Findings

| Approach | Tools/Query | Accuracy | Avg Tokens | Token Savings |
|----------|-------------|----------|------------|---------------|
| Traditional | 31 tools | X% | ~Y tokens | Baseline |
| Semantic | 3 tools | X% | ~Y tokens | Z% |
| Semantic+Memory | 3 tools | X% | ~Y tokens | Z% |

### Why Semantic Tool Selection Works

1. **Reduced Token Waste**: 3 tools vs 31 = ~89% fewer tokens on tool descriptions
2. **Better Accuracy**: Fewer choices = less confusion = correct tool selection
3. **Production Ready**: `swap_tools()` preserves conversation memory

### References

- [Internal Representations as Indicators of Hallucinations in Agent Tool Selection](https://arxiv.org/pdf/2601.05214)