# Context Distraction: When Long Context Histories Overwhelm LLM Agents

Context distraction is a **real production problem** affecting agents with long conversation histories.

## The Problem

As agents perform multi-step tasks, each step generates information that accumulates in the context. With many steps, the context becomes extremely long, and LLMs struggle to maintain accuracy when recalling information from early steps.

**Context Distraction Definition:** When a simple tool-calling LLM agent accumulates so much context history across many steps that it overwhelms the LLM and causes it to lose track of information, leading to degraded recall accuracy and task completion.

The challenge: Agents need to perform complex, multi-step tasks, but each step adds to context length. Eventually, the LLM can't effectively recall specific details from earlier steps.

## What We'll Explore

In this notebook, we'll use a research assistant agent to:
1. **Demonstrate** how context length grows with multi-step research tasks
2. **Measure** recall accuracy degradation as context length increases
3. **Identify** the point where context distraction significantly impacts performance
4. **Evaluate** different strategies for managing long contexts

The goal: Understand context distraction and develop strategies to maintain accuracy across long task sequences.

## Research Tasks We'll Test

We'll evaluate agents on research tasks of varying complexity:
- **Small tasks**: 2-3 topics, ~8-12 steps
- **Medium tasks**: 4-6 topics, ~16-24 steps  
- **Large tasks**: 8-10 topics, ~32-40 steps
- **Very large tasks**: 10+ topics with deep analysis, ~50+ steps

Each step generates verbose, detailed information that must be recalled accurately later.


## Setup


In [1]:
# Imports
import os
from typing import List, Dict, Any, Literal
from langchain.agents import create_agent
from langsmith import Client, evaluate, traceable
from langsmith.schemas import Run, Example
from langchain_anthropic import ChatAnthropic
import plotly.graph_objects as go
import plotly.express as px
from IPython.display import display, HTML
import pandas as pd
from langchain_openai import ChatOpenAI
from typing import TypedDict, Annotated
from dotenv import load_dotenv
import os
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

load_dotenv()

# Our research assistant agent tools
from context_distraction.tools import (
    all_research_tools,
    core_research_tools,
    analysis_tools
)
from context_distraction.instructions import (
    RESEARCH_ASSISTANT_INSTRUCTIONS,
    DETAILED_RESEARCH_INSTRUCTIONS
)
from context_distraction.agent import (
    EXAMPLE_RESEARCH_TASKS,
    RECALL_TEST_QUESTIONS,
    create_focused_agent,
    create_full_research_agent
)

# Initialize LangSmith
client = Client()

print("âœ“ Setup complete")


ImportError: cannot import name 'EXAMPLE_RESEARCH_TASKS' from 'context_distraction.agent' (/Users/robertxu/Desktop/Projects/education/context-failure-evals/context_distraction/agent.py)

In [None]:
# Initialize LLM - using OpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

print(f"Using model: gpt-4o-mini")


## Creating Agents

We'll compare two agents:
1. **Standard Agent**: Uses `create_agent` - accumulates all tool results in context
2. **Deep Agent**: Uses `create_deep_agent` - stores tool results in filesystem


In [None]:
# Create agents
standard_agent = create_standard_agent(llm)
deep_agent = create_deep_agent_instance(llm)

print(f"âœ“ Created standard agent ({len(all_research_tools)} tools)")
print(f"âœ“ Created deep agent ({len(all_research_tools)} tools)")


## Creating the Research Agent

We'll create a research agent with full capabilities to conduct comprehensive multi-topic research.


In [None]:
# Create research agent with all tools
research_agent = create_agent(
    model=llm,
    tools=all_research_tools,
    system_prompt=DETAILED_RESEARCH_INSTRUCTIONS
)

print(f"Research agent: {len(all_research_tools)} tools available")
print(f"Core research tools: {len(core_research_tools)}")
print(f"Analysis tools: {len(analysis_tools)}")
print("\nThis agent can conduct comprehensive research but will accumulate")
print("extensive context across many research steps.")


## Create Dataset and Evaluators


In [None]:
# Test dataset: Research tasks of varying complexity
test_tasks = [
    {
        "name": "Small Task (2 topics)",
        "query": "Research renewable energy and electric vehicles, then create a report comparing their growth and synergies.",
        "topics": ["renewable_energy", "electric_vehicles"],
        "expected_steps": 8,
        "recall_questions": [
            "What was the global installed capacity in gigawatts for renewable energy?",
            "What was the battery cost per kWh for electric vehicles?",
            "What is the name of an expert you consulted for renewable energy?"
        ]
    },
    {
        "name": "Medium Task (4 topics)",
        "query": "Research renewable energy, artificial intelligence, climate change, and electric vehicles. Synthesize findings into a comprehensive report.",
        "topics": ["renewable_energy", "artificial_intelligence", "climate_change", "electric_vehicles"],
        "expected_steps": 16,
        "recall_questions": [
            "What was the global installed capacity in gigawatts for renewable energy?",
            "What is the global AI market size in billions of USD?",
            "What was the temperature increase in degrees Celsius since pre-industrial times?",
            "What was the battery cost per kWh for electric vehicles?"
        ]
    },
    {
        "name": "Large Task (6 topics)",
        "query": "Conduct comprehensive research on renewable energy, AI, climate change, quantum computing, biotechnology, and space exploration. Create a detailed report.",
        "topics": ["renewable_energy", "artificial_intelligence", "climate_change", "quantum_computing", "biotechnology", "space_exploration"],
        "expected_steps": 24,
        "recall_questions": [
            "What was the global installed capacity in gigawatts for renewable energy?",
            "How many parameters does GPT-4 have according to your research?",
            "What was the temperature increase in degrees Celsius?",
            "How many qubits does IBM's Condor processor have?",
            "What is the global biotechnology market size in billions of USD?",
            "How many satellites were launched in 2023?"
        ]
    },
    {
        "name": "Very Large Task (10 topics)",
        "query": "Research all major technology domains: renewable energy, AI, climate change, quantum computing, biotechnology, space exploration, cybersecurity, electric vehicles, nanotechnology, and blockchain. Synthesize into a comprehensive report with specific statistics, expert opinions, and case studies.",
        "topics": [
            "renewable_energy", "artificial_intelligence", "climate_change",
            "quantum_computing", "biotechnology", "space_exploration",
            "cybersecurity", "electric_vehicles", "nanotechnology", "blockchain"
        ],
        "expected_steps": 40,
        "recall_questions": [
            "What was the global installed capacity in gigawatts for renewable energy?",
            "What is the global AI market size in billions of USD?",
            "What was the temperature increase in degrees Celsius?",
            "How many qubits does IBM's Condor processor have?",
            "What is the global biotechnology market size?",
            "How many satellites were launched in 2023?",
            "What is the global cybersecurity market size in billions of USD?",
            "What was the battery cost per kWh for electric vehicles?",
            "What is the global nanotechnology market size?",
            "What is the cryptocurrency market cap in billions of USD?"
        ]
    }
]

print(f"Test dataset: {len(test_tasks)} tasks")
for task in test_tasks:
    print(f"\n{task['name']}: {len(task['topics'])} topics, ~{task['expected_steps']} steps")


## Creating Dataset and Evaluators

We'll create a LangSmith dataset with our research tasks and define evaluators to measure:
- **Recall accuracy** - Can the agent recall specific details from early research steps?
- **Context length** - How much context accumulates across steps
- **Task completion** - Does the agent complete all required research steps?


In [None]:
# Recall accuracy evaluator
def recall_accuracy_evaluator(run: Run, example: Example) -> Dict[str, Any]:
    final_response = run.outputs.get("final_response", "")
    recall_questions = example.outputs.get("recall_questions", [])
    
    if not recall_questions or not final_response:
        return {"key": "recall_accuracy", "score": 0.0, "comment": "Missing data"}
    
    correct_count = 0
    response_lower = final_response.lower()
    
    # Check for key facts
    checks = [
        ("renewable energy" in q.lower() and "capacity" in q.lower(), "3372" in final_response or "3,372" in final_response),
        ("battery cost" in q.lower() or "kwh" in q.lower(), "139" in final_response),
        ("ai market" in q.lower(), "196.6" in final_response or "196" in final_response),
        ("temperature increase" in q.lower(), "1.1" in final_response),
        ("qubits" in q.lower() or "condor" in q.lower(), "1121" in final_response or "1,121" in final_response),
        ("biotechnology" in q.lower() and "market" in q.lower(), "1023" in final_response or "1,023" in final_response),
        ("satellites" in q.lower() and "2023" in q.lower(), "2877" in final_response or "2,877" in final_response),
        ("cybersecurity" in q.lower() and "market" in q.lower(), "202" in final_response),
        ("nanotechnology" in q.lower() and "market" in q.lower(), "75" in final_response),
        ("cryptocurrency" in q.lower() or "crypto market" in q.lower(), "1200" in final_response or "1,200" in final_response),
    ]
    
    for question in recall_questions:
        q_lower = question.lower()
        found = False
        for pattern, condition in checks:
            if pattern and condition:
                found = True
                break
        if found:
            correct_count += 1
    
    accuracy = correct_count / len(recall_questions) if recall_questions else 0.0
    return {"key": "recall_accuracy", "score": accuracy, "comment": f"Recalled {correct_count}/{len(recall_questions)} facts"}

# Context length evaluator
def context_length_evaluator(run: Run, example: Example) -> Dict[str, Any]:
    trajectory = run.outputs.get("trajectory", [])
    final_response = run.outputs.get("final_response", "")
    tool_calls_count = len(trajectory)
    estimated_context_chars = tool_calls_count * 10000 + len(final_response)
    context_k_chars = estimated_context_chars / 1000
    return {"key": "context_length", "score": context_k_chars, "comment": f"{context_k_chars:.0f}K chars ({tool_calls_count} tool calls)"}

# Task completion evaluator
def task_completion_evaluator(run: Run, example: Example) -> Dict[str, Any]:
    final_response = run.outputs.get("final_response", "")
    topics = example.outputs.get("topics", [])
    trajectory = run.outputs.get("trajectory", [])
    
    if not topics:
        return {"key": "task_completion", "score": 0.0, "comment": "No topics"}
    
    researched_topics = set()
    response_lower = final_response.lower()
    
    for topic in topics:
        topic_key = topic.replace("_", " ")
        if topic_key in response_lower or topic in response_lower:
            researched_topics.add(topic)
    
    for tool_call in trajectory:
        if tool_call.get("name") == "research_topic":
            topic_arg = tool_call.get("args", {}).get("topic", "")
            if topic_arg:
                researched_topics.add(topic_arg.replace(" ", "_"))
    
    completion_rate = len(researched_topics) / len(topics) if topics else 0.0
    return {"key": "task_completion", "score": completion_rate, "comment": f"Researched {len(researched_topics)}/{len(topics)} topics"}

ALL_EVALUATORS = [recall_accuracy_evaluator, context_length_evaluator, task_completion_evaluator]
print(f"âœ“ Defined {len(ALL_EVALUATORS)} evaluators")


## Agent Wrapper


In [None]:
def run_research_agent_with_trajectory(agent, query: str) -> dict:
    """Run agent and return structured output."""
    result = agent.invoke({"messages": [("user", query)]})
    
    final_response = ""
    trajectory = []
    messages = result.get("messages", [])
    
    for msg in messages:
        if isinstance(msg, dict) and msg.get("type") == "ai":
            if "tool_calls" in msg and msg["tool_calls"]:
                for tc in msg["tool_calls"]:
                    trajectory.append({"name": tc.get("name", ""), "args": tc.get("args", {})})
            if msg.get("content"):
                final_response = msg["content"]
        elif hasattr(msg, 'tool_calls') and msg.tool_calls:
            for tc in msg.tool_calls:
                trajectory.append({
                    "name": tc.name if hasattr(tc, 'name') else tc.get("name", ""),
                    "args": tc.args if hasattr(tc, 'args') else tc.get("args", {})
                })
        if hasattr(msg, 'content') and msg.content:
            final_response = msg.content
    
    return {"final_response": final_response, "trajectory": trajectory}

print("âœ“ Created agent wrapper")


## Run Evaluations


In [None]:
# Run evaluations
print("Running evaluations...\n")

standard_experiment = evaluate(
    lambda inputs: run_research_agent_with_trajectory(standard_agent, inputs["query"]),
    data=dataset_name,
    evaluators=ALL_EVALUATORS,
    experiment_prefix="context-distraction-standard",
    metadata={"agent_type": "standard"},
)

deep_experiment = evaluate(
    lambda inputs: run_research_agent_with_trajectory(deep_agent, inputs["query"]),
    data=dataset_name,
    evaluators=ALL_EVALUATORS,
    experiment_prefix="context-distraction-deep",
    metadata={"agent_type": "deep"},
)

print("âœ“ Evaluations complete")


## Running Evaluations

Now let's run evaluations on tasks of increasing complexity to measure context distraction:


In [None]:
# Run evaluation on research agent
print("Running evaluation on research tasks...")
print("This may take several minutes as the agent conducts comprehensive research.\n")

research_experiment = evaluate(
    lambda inputs: run_research_agent_with_trajectory(research_agent, inputs["query"]),
    data=dataset_name,
    evaluators=ALL_EVALUATORS,
    experiment_prefix="context-distraction-research",
    metadata={"agent_type": "full_research", "tools": len(all_research_tools)},
)

print(f"\nâœ“ Evaluation complete!")
print(f"   View results: https://smith.langchain.com/o/{client._get_tenant_id()}/datasets/{dataset.id}")


## Visualize Results


In [None]:
try:
    standard_metrics = get_metrics_from_experiment(standard_experiment)
    deep_metrics = get_metrics_from_experiment(deep_experiment)
    
    complexities = list(standard_metrics.keys())
    standard_recall = [standard_metrics[c]["recall_accuracy"] for c in complexities]
    deep_recall = [deep_metrics[c]["recall_accuracy"] for c in complexities]
    standard_context = [standard_metrics[c]["context_length"] for c in complexities]
    deep_context = [deep_metrics[c]["context_length"] for c in complexities]
    
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=("Recall Accuracy", "Context Length"),
        specs=[[{"secondary_y": False}, {"secondary_y": False}]]
    )
    
    fig.add_trace(go.Bar(x=complexities, y=standard_recall, name="Standard", marker_color='#d62728'), row=1, col=1)
    fig.add_trace(go.Bar(x=complexities, y=deep_recall, name="Deep", marker_color='#2ca02c'), row=1, col=1)
    fig.add_trace(go.Bar(x=complexities, y=standard_context, name="Standard", marker_color='#d62728', showlegend=False), row=1, col=2)
    fig.add_trace(go.Bar(x=complexities, y=deep_context, name="Deep", marker_color='#2ca02c', showlegend=False), row=1, col=2)
    
    fig.update_xaxes(title_text="Task Complexity", row=1, col=1)
    fig.update_xaxes(title_text="Task Complexity", row=1, col=2)
    fig.update_yaxes(title_text="Recall Accuracy", row=1, col=1, range=[0, 1])
    fig.update_yaxes(title_text="Context Length (K chars)", row=1, col=2)
    
    fig.update_layout(title_text="Standard Agent vs Deep Agent Comparison", height=400, barmode='group')
    fig.show()
    
except Exception as e:
    print(f"Error: {e}")


## Summary

**Standard Agent**: Accumulates all tool results in context â†’ context distraction â†’ degraded recall

**Deep Agent**: Stores tool results in filesystem â†’ reduced context â†’ maintained recall accuracy

**Solution**: Dropping tool call results from context (via filesystem storage) mitigates context distraction.


In [None]:
# Create visualization of context distraction
try:
    metrics = get_metrics_from_experiment(research_experiment)
    
    # Prepare data for plotting
    complexities = list(metrics.keys())
    recall_scores = [metrics[c]["recall_accuracy"] for c in complexities]
    context_lengths = [metrics[c]["context_length"] for c in complexities]
    completion_scores = [metrics[c]["task_completion"] for c in complexities]
    
    # Create figure with subplots
    from plotly.subplots import make_subplots
    
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            "Recall Accuracy vs Task Complexity",
            "Context Length Growth",
            "Task Completion Rate",
            "Recall Accuracy vs Context Length"
        ),
        specs=[[{"secondary_y": False}, {"secondary_y": False}],
               [{"secondary_y": False}, {"secondary_y": False}]]
    )
    
    # Plot 1: Recall accuracy by complexity
    fig.add_trace(
        go.Bar(
            x=complexities,
            y=recall_scores,
            name="Recall Accuracy",
            marker_color='#1f77b4',
            text=[f"{s:.1%}" for s in recall_scores],
            textposition='outside'
        ),
        row=1, col=1
    )
    
    # Plot 2: Context length growth
    fig.add_trace(
        go.Bar(
            x=complexities,
            y=context_lengths,
            name="Context Length (K chars)",
            marker_color='#ff7f0e',
            text=[f"{c:.0f}K" for c in context_lengths],
            textposition='outside'
        ),
        row=1, col=2
    )
    
    # Plot 3: Task completion
    fig.add_trace(
        go.Bar(
            x=complexities,
            y=completion_scores,
            name="Task Completion",
            marker_color='#2ca02c',
            text=[f"{s:.1%}" for s in completion_scores],
            textposition='outside'
        ),
        row=2, col=1
    )
    
    # Plot 4: Recall vs Context Length (scatter)
    fig.add_trace(
        go.Scatter(
            x=context_lengths,
            y=recall_scores,
            mode='lines+markers',
            name="Recall Accuracy",
            marker=dict(size=12, color='#d62728'),
            line=dict(width=3, color='#d62728')
        ),
        row=2, col=2
    )
    
    # Update layout
    fig.update_xaxes(title_text="Task Complexity", row=1, col=1)
    fig.update_xaxes(title_text="Task Complexity", row=1, col=2)
    fig.update_xaxes(title_text="Task Complexity", row=2, col=1)
    fig.update_xaxes(title_text="Context Length (K chars)", row=2, col=2)
    
    fig.update_yaxes(title_text="Recall Accuracy", row=1, col=1, range=[0, 1])
    fig.update_yaxes(title_text="Context Length (K chars)", row=1, col=2)
    fig.update_yaxes(title_text="Completion Rate", row=2, col=1, range=[0, 1])
    fig.update_yaxes(title_text="Recall Accuracy", row=2, col=2, range=[0, 1])
    
    fig.update_layout(
        title_text="Context Distraction: Performance Degradation Analysis",
        height=800,
        showlegend=False
    )
    
    fig.show()
    
    print("\nðŸ“ˆ Chart Analysis:")
    print("   - Recall Accuracy vs Complexity: Shows degradation as tasks get larger")
    print("   - Context Length Growth: Demonstrates linear/exponential growth")
    print("   - Task Completion: Indicates if agent completes all research steps")
    print("   - Recall vs Context Length: Reveals the distraction threshold")
    
except Exception as e:
    print(f"Error creating visualization: {e}")
    print("Run the evaluation above first to see charts.")


## Summary: Context Distraction Findings

Based on the evaluation results, we can observe:

1. **Recall Accuracy Degrades**: As context length increases, the agent's ability to recall specific facts from early research steps decreases significantly.

2. **Context Length Grows**: Each research step adds substantial context (~10K-20K chars per tool call), leading to very long contexts for large tasks.

3. **Task Completion Suffers**: For very large tasks, the agent may miss some research steps or provide incomplete synthesis.
