# Context Distraction: How Long Context Histories Degrade Recall Accuracy

Context distraction is a **real production problem** affecting agents that perform complex, multi-step tasks.

## The Problem

As LLM agents perform research tasks with many operations, each tool call and result accumulates in the conversation context. With complex tasks requiring dozens of tool calls, the context becomes extremely long. The Berkeley Function-Calling Leaderboard and recent research show that **LLMs struggle to maintain recall accuracy over very long contexts**.

**Context Distraction Definition:** When accumulated tool call results and intermediate outputs across many steps overwhelm the LLM, causing it to lose track of specific information from earlier steps, leading to degraded recall accuracy.

The challenge: Agents need to complete complex tasks requiring many steps, but each step adds context that can bury important details.

## What We'll Explore

In this notebook, we'll use a multi-domain investment research task to:
1. **Identify** how context length degrades recall accuracy in a standard agent
2. **Measure** the impact on recall accuracy using real research tasks
3. **Compare** two approaches: standard agent vs. optimized graph agent with context isolation
4. **Validate** improvements through comprehensive evaluations

The goal: Build agents that maintain high recall accuracy even across very long, complex task sequences.

## Two Agent Approaches We'll Compare

**Standard Agent** - Simple ReAct loop
- All tool call results accumulate in context
- No separation between planning and execution
- Context grows linearly with number of steps

**Graph Agent** - Optimized with context isolation
- Uses supervisor/researcher pattern to isolate context
- Reflection tools maintain plans over long tasks
- Critical information passed explicitly between nodes
- Context managed strategically to preserve important details

## Setup

In [None]:
# Imports
import asyncio
from typing import Dict, Any
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
from IPython.display import display
from dotenv import load_dotenv

load_dotenv()

# Import our test infrastructure
from context_distraction.resources.test_tasks import TEST_TASKS
from context_distraction.tests.evaluators import (
    recall_accuracy_evaluator,
    tool_call_completeness_evaluator,
    tool_call_efficiency_evaluator,
    extract_answers_json_from_text,
)
from context_distraction.tests.setup_datasets import build_reference_outputs
from context_distraction.resources.validation_utils import extract_tool_calls_from_message

print("‚úì Setup complete")

## The Research Task

We'll evaluate agents on a complex investment analysis task covering 5 technology sectors:
- Renewable energy
- Artificial intelligence
- Electric vehicles
- Quantum computing
- Biotechnology

Each domain requires gathering statistics, expert opinions, case studies, and performing financial calculations including:
- Compound growth projections (10-year)
- Cost-benefit analysis with NPV calculations
- Correlation analyses
- Investment portfolio optimization

The agent must then answer 9 specific questions requiring precise recall of facts from throughout the research process.

In [None]:
# Show test case 1 structure
task_1 = TEST_TASKS[0]

print(f"Task: {task_1['name']}")
print(f"\nDomains: {', '.join(task_1['topics'])}")
print(f"\nExpected operations per domain:")
print(f"  - Statistics: {task_1['stats_count']}")
print(f"  - Expert opinions: {task_1['expert_count']}")
print(f"  - Case studies: {task_1['case_count']}")
print(f"  - Historical years: {task_1['year_count']}")
print(f"  - Domain comparisons: {task_1['compare_count']}")
print(f"\nTotal operations: ~40-50 tool calls")
print(f"Recall questions: {len(task_1['recall_questions'])}")
print(f"\nSample questions:")
for i, q in enumerate(task_1['recall_questions'][:3], 1):
    print(f"  {i}. {q.split('(Expected')[0].strip()}")

## Agent Runner Functions

These functions run each agent type and extract structured outputs for evaluation.

In [None]:
# Import agents
from context_distraction.agent import agent as standard_agent
from context_distraction.graph import graph as graph_agent
from langchain_core.messages import HumanMessage

async def run_standard_agent(query: str) -> dict:
    """Run standard agent and extract trajectory using streaming."""
    trajectory = []
    final_response = ""
    all_messages = []
    
    async for chunk in standard_agent.astream(
        {"messages": [("user", query)]},
        stream_mode="updates",
    ):
        if isinstance(chunk, tuple) and len(chunk) >= 2:
            namespace, data = chunk
        elif isinstance(chunk, dict):
            data = chunk
        else:
            continue
        
        if isinstance(data, dict):
            for key in ['tools', 'model']:
                if key in data:
                    msgs = data[key].get('messages', [])
                    all_messages.extend(msgs)
                    
                    for msg in msgs:
                        tool_calls = extract_tool_calls_from_message(msg)
                        for tc in tool_calls:
                            trajectory.append(tc)
    
    # Extract final response
    for msg in reversed(all_messages):
        if isinstance(msg, dict) and msg.get("content"):
            final_response = msg["content"]
            break
        elif hasattr(msg, 'content') and msg.content:
            final_response = msg.content
            break
    
    return {"final_response": final_response, "trajectory": trajectory}


async def run_graph_agent(query: str) -> dict:
    """Run graph agent and extract outputs using streaming."""
    trajectory = []
    final_response = ""
    all_messages = []
    
    config = {"recursion_limit": 200}
    
    async for chunk in graph_agent.astream(
        {"supervisor_messages": [HumanMessage(content=query)]},
        config=config,
        subgraphs=True,
        stream_mode="updates",
    ):
        if isinstance(chunk, tuple) and len(chunk) >= 2:
            namespace, data = chunk
        elif isinstance(chunk, dict):
            data = chunk
        else:
            continue
        
        if isinstance(data, dict):
            for node_key, node_data in data.items():
                if isinstance(node_data, dict):
                    for msg_key in ['supervisor_messages', 'reseacher_messages', 'messages']:
                        if msg_key in node_data and isinstance(node_data[msg_key], list):
                            msgs = node_data[msg_key]
                            all_messages.extend(msgs)
                            
                            for msg in msgs:
                                tool_calls = extract_tool_calls_from_message(msg)
                                for tc in tool_calls:
                                    trajectory.append(tc)
    
    # Extract final response
    for msg in reversed(all_messages):
        if isinstance(msg, dict) and msg.get("content"):
            final_response = msg["content"]
            break
        elif hasattr(msg, 'content') and msg.content:
            final_response = msg.content
            break
    
    return {"final_response": final_response, "trajectory": trajectory}

print("‚úì Defined agent runners")

## Run Test Case 1 on Both Agents

Let's run the first test case on both agents and evaluate their performance.

**Note:** This will take several minutes as the agents conduct comprehensive research across 5 domains.

In [None]:
# Run test case 1
task = TEST_TASKS[0]
reference_outputs = build_reference_outputs(task)
inputs = {"query": task["query"]}

print(f"Running Test Case 1: {task['name']}")
print(f"This involves {len(task['topics'])} domains and ~40-50 tool calls\n")

print("Running standard agent...")
standard_outputs = await run_standard_agent(task["query"])
print(f"  ‚úì Completed {len(standard_outputs['trajectory'])} tool calls\n")

print("Running graph agent...")
graph_outputs = await run_graph_agent(task["query"])
print(f"  ‚úì Completed {len(graph_outputs['trajectory'])} tool calls\n")

print("‚úì Both agents completed")

## Evaluate Results

Let's evaluate both agents on:
1. **Recall Accuracy** - Can the agent recall specific facts from throughout the research?
2. **Tool Call Completeness** - Did the agent make all necessary tool calls?
3. **Tool Call Efficiency** - How many extra tool calls were made?

In [None]:
# Evaluate standard agent
standard_recall = recall_accuracy_evaluator(inputs, standard_outputs, reference_outputs)
standard_completeness = tool_call_completeness_evaluator(inputs, standard_outputs, reference_outputs)
standard_efficiency = tool_call_efficiency_evaluator(inputs, standard_outputs, reference_outputs)

print("Standard Agent Results:")
print(f"  Recall Accuracy: {standard_recall['score']:.1%}")
print(f"  Tool Call Completeness: {standard_completeness['score']:.1%}")
print(f"  Tool Call Efficiency: {standard_efficiency['score']:.2f}")
print(f"\n{standard_recall['comment']}\n")

# Evaluate graph agent
graph_recall = recall_accuracy_evaluator(inputs, graph_outputs, reference_outputs)
graph_completeness = tool_call_completeness_evaluator(inputs, graph_outputs, reference_outputs)
graph_efficiency = tool_call_efficiency_evaluator(inputs, graph_outputs, reference_outputs)

print("Graph Agent Results:")
print(f"  Recall Accuracy: {graph_recall['score']:.1%}")
print(f"  Tool Call Completeness: {graph_completeness['score']:.1%}")
print(f"  Tool Call Efficiency: {graph_efficiency['score']:.2f}")
print(f"\n{graph_recall['comment']}")

## Visualize Comparison

Let's create charts to visualize the performance difference between the two approaches.

In [None]:
# Create comparison data
agents = ["Standard Agent", "Graph Agent\n(Context Isolation)"]
recall_scores = [standard_recall['score'], graph_recall['score']]
completeness_scores = [standard_completeness['score'], graph_completeness['score']]
efficiency_scores = [standard_efficiency['score'], graph_efficiency['score']]

# Create grouped bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    name='Recall Accuracy',
    x=agents,
    y=recall_scores,
    marker_color='#1f77b4',
    text=[f"{s:.1%}" for s in recall_scores],
    textposition='outside'
))

fig.add_trace(go.Bar(
    name='Tool Call Completeness',
    x=agents,
    y=completeness_scores,
    marker_color='#2ca02c',
    text=[f"{s:.1%}" for s in completeness_scores],
    textposition='outside'
))

fig.add_trace(go.Bar(
    name='Tool Call Efficiency',
    x=agents,
    y=efficiency_scores,
    marker_color='#ff7f0e',
    text=[f"{s:.2f}" for s in efficiency_scores],
    textposition='outside'
))

fig.update_layout(
    title="Standard Agent vs Graph Agent with Context Isolation",
    yaxis_title="Score",
    barmode='group',
    height=500,
    yaxis=dict(range=[0, 1.1]),
    showlegend=True
)

fig.show()

# Create detailed comparison table
comparison_df = pd.DataFrame({
    "Agent": agents,
    "Recall Accuracy": [f"{s:.1%}" for s in recall_scores],
    "Tool Call Completeness": [f"{s:.1%}" for s in completeness_scores],
    "Tool Call Efficiency": [f"{s:.2f}" for s in efficiency_scores],
    "Total Tool Calls": [
        len(standard_outputs['trajectory']),
        len(graph_outputs['trajectory'])
    ]
})

display(comparison_df)

print("\nüìä Key Findings:")
recall_improvement = (graph_recall['score'] - standard_recall['score']) * 100
if recall_improvement > 0:
    print(f"   Graph agent improves recall accuracy by {recall_improvement:.1f} percentage points")
else:
    print(f"   Standard agent has higher recall by {-recall_improvement:.1f} percentage points")

completeness_improvement = (graph_completeness['score'] - standard_completeness['score']) * 100
if completeness_improvement > 0:
    print(f"   Graph agent improves tool call completeness by {completeness_improvement:.1f} percentage points")
else:
    print(f"   Standard agent has higher completeness by {-completeness_improvement:.1f} percentage points")

## Detailed Recall Analysis

Let's examine which specific questions each agent got correct.

In [None]:
# Extract answers from both agents
standard_answers = extract_answers_json_from_text(standard_outputs['final_response'])
graph_answers = extract_answers_json_from_text(graph_outputs['final_response'])
expected_answers = reference_outputs['expected_answers']

# Create question-by-question comparison
from context_distraction.resources.validation_utils import compare_values

questions_data = []
for i in range(1, len(expected_answers) + 1):
    expected = expected_answers.get(i) or expected_answers.get(str(i))
    standard_answer = standard_answers.get(str(i)) or standard_answers.get(i)
    graph_answer = graph_answers.get(str(i)) or graph_answers.get(i)
    
    standard_correct = compare_values(standard_answer, expected) if expected else False
    graph_correct = compare_values(graph_answer, expected) if expected else False
    
    questions_data.append({
        "Question": f"Q{i}",
        "Expected": str(expected) if expected else "N/A",
        "Standard Agent": str(standard_answer) if standard_answer else "Missing",
        "Standard ‚úì": "‚úì" if standard_correct else "‚úó",
        "Graph Agent": str(graph_answer) if graph_answer else "Missing",
        "Graph ‚úì": "‚úì" if graph_correct else "‚úó"
    })

questions_df = pd.DataFrame(questions_data)
display(questions_df)

print("\nüìù Question Analysis:")
standard_correct = sum(1 for q in questions_data if q["Standard ‚úì"] == "‚úì")
graph_correct = sum(1 for q in questions_data if q["Graph ‚úì"] == "‚úì")
print(f"   Standard Agent: {standard_correct}/{len(expected_answers)} correct")
print(f"   Graph Agent: {graph_correct}/{len(expected_answers)} correct")

## Why the Graph Agent Works Better

The graph agent addresses context distraction through two key innovations:

### 1. Context Isolation

The graph agent uses a **supervisor/researcher pattern** where:
- **Supervisor node**: Maintains high-level plan and coordinates research
- **Researcher nodes**: Execute specific research tasks in isolated contexts
- **Explicit information passing**: Critical findings passed between nodes, not accumulated in one massive context

This prevents the LLM from being overwhelmed by accumulated tool call results.

### 2. Reflection Tools for Long Tasks

The graph agent includes reflection mechanisms:
- **Plan tracking**: Maintains structured task list of what needs to be done
- **Progress checking**: Regularly reviews what's been completed
- **Information synthesis**: Explicitly extracts and preserves key facts

These tools help the agent maintain focus and recall accuracy even as the task becomes complex.

### The Result

By isolating context and using reflection tools, the graph agent can:
- Complete more complex tasks without losing track of details
- Maintain higher recall accuracy for facts from early research steps
- Provide more complete and accurate final reports

## Key Takeaways

### Context Distraction Is Real

Our experiments demonstrate that:
- **Standard agents struggle with long contexts**: As tasks become complex with many tool calls, recall accuracy degrades
- **Simple facts get lost**: Even basic recall questions about early research steps are missed
- **The problem compounds**: More steps = worse performance

### Context Isolation Solves It

The graph agent's approach delivers measurable improvements:
- **Higher recall accuracy**: Better retention of facts from throughout the research process
- **More complete execution**: Fewer missed research steps
- **Maintained performance**: Scales better to complex, multi-step tasks

### Professional Implementation Strategy

When building agents for complex, multi-step tasks:

1. **Recognize the threshold** - Simple tasks work fine with standard agents. Complex tasks (40+ tool calls) need context management.

2. **Use graph patterns** - Supervisor/worker patterns isolate context and enable explicit information flow.

3. **Add reflection tools** - Plan tracking, progress checking, and synthesis tools maintain accuracy over long sequences.

4. **Measure with evaluations** - Use LangSmith to quantify recall accuracy, completeness, and efficiency.

5. **Optimize for your domain** - Find the right balance of context isolation and information passing for your specific use case.

### Real-World Impact

By addressing context distraction:
- **Handle complex tasks** - Complete research/analysis tasks that standard agents can't
- **Higher accuracy** - Maintain recall of specific facts and details
- **Better UX** - Provide complete, accurate answers users can trust
- **Production-ready** - Build agents that scale to real-world complexity

---

**Context distraction isn't solved by better prompts - it requires architectural patterns that manage context strategically. Use graph-based agents with context isolation for complex, multi-step tasks.**