# Paradigm 03: Agile Sprints Research Agent

This notebook implements the **Agile ResearchOps** paradigm from the Research Paradigms document.

## Core Concept

The Agile approach transforms the agent from an open-loop system to a closed-loop control system with:
- **Sprint-based execution**: Time-boxed research iterations
- **Retrospective reflection**: Course correction after each sprint
- **Backlog management**: Dynamic re-prioritization of research questions

## Literature Validation

> "Search-o1, R1-Searcher, DeepResearcher, WebDancer... exemplify this paradigm through iterative cycles of explicit reasoning, action, and reflection, aligning with the ReAct framework." —[Survey-3]

## Technology Stack

- **LLM**: `gpt-5-mini-2025-08-07`
- **Web Search**: Tavily API
- **Tracing**: LangSmith
- **Framework**: LangGraph

## 1. Setup and Configuration

In [None]:
import os
import operator
import asyncio
from pathlib import Path
from typing import List, Annotated, TypedDict, Literal

from dotenv import load_dotenv
from pydantic import BaseModel, Field

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from tavily import TavilyClient

from langgraph.graph import StateGraph, START, END

# Load environment variables
env_path = Path("../.env")
load_dotenv(env_path)

# Configure LangSmith tracing
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "deep_research_new"

print("Environment configured successfully")

In [None]:
# Initialize LLM and Tavily client
MODEL_NAME = "gpt-5-mini-2025-08-07"
llm = ChatOpenAI(model=MODEL_NAME, temperature=0)
tavily_client = TavilyClient()

# Sprint configuration
MAX_SPRINTS = 3
SEARCHES_PER_SPRINT = 5
MAX_TOKENS_PER_SPRINT = 15000  # ~50k total budget

print(f"Using model: {MODEL_NAME}")
print(f"Max sprints: {MAX_SPRINTS}")
print(f"Searches per sprint: {SEARCHES_PER_SPRINT}")

## 2. State Definitions

In [None]:
class ResearchQuestion(BaseModel):
    """A research question in the backlog."""
    question: str = Field(description="The research question to investigate")
    priority: int = Field(default=1, description="Priority level (1=highest)")
    status: str = Field(default="pending", description="pending, in_progress, or completed")

class SprintFinding(BaseModel):
    """A finding from a sprint."""
    question: str = Field(description="The question this finding addresses")
    finding: str = Field(description="The key finding or insight")
    sources: List[str] = Field(default_factory=list, description="Source URLs")

class RetrospectiveOutput(BaseModel):
    """Output from the retrospective analysis."""
    what_we_learned: str = Field(description="Key learnings from this sprint")
    what_is_still_unclear: str = Field(description="Gaps or uncertainties remaining")
    should_continue: bool = Field(description="Whether to continue with another sprint")
    updated_priorities: List[str] = Field(description="Re-prioritized questions for next sprint")

In [None]:
class AgileResearchState(TypedDict):
    """State for the Agile Research Agent."""
    # Input
    question: str
    
    # Sprint management
    backlog: Annotated[List[str], operator.add]  # Research questions to investigate
    current_sprint: int
    max_sprints: int
    
    # Findings accumulator
    sprint_findings: Annotated[List[str], operator.add]  # Findings from all sprints
    
    # Retrospective notes
    retrospective_notes: Annotated[List[str], operator.add]
    
    # Output
    final_report: str

## 3. Node Functions

In [None]:
# Prompts
PLAN_BACKLOG_PROMPT = """You are a research planning expert. Given a research question, 
decompose it into 3-5 specific sub-questions that need to be investigated.

Research Question: {question}

Generate a list of specific, focused research questions that together will answer the main question.
Each question should be independently searchable.

Return your response as a numbered list:
1. [First sub-question]
2. [Second sub-question]
...
"""

SPRINT_RESEARCH_PROMPT = """You are a research agent conducting Sprint {sprint_num} of {max_sprints}.

Current research focus: {current_question}

Based on the search results below, extract the key findings that address the research question.
Be specific and cite sources where relevant.

Search Results:
{search_results}

Provide a comprehensive summary of findings (400-600 words).
"""

RETROSPECTIVE_PROMPT = """You are conducting a sprint retrospective for a research project.

Original Question: {original_question}

Sprint {sprint_num} of {max_sprints} has completed.

Findings so far:
{all_findings}

Remaining questions in backlog:
{remaining_backlog}

Analyze the progress and provide:
1. What did we learn this sprint?
2. What is still unclear or needs more investigation?
3. Should we continue with another sprint? (consider if we have enough to answer the question)
4. If continuing, what should be the priority for the next sprint?

Be honest about gaps and uncertainties.
"""

FINAL_REPORT_PROMPT = """You are a senior research analyst writing a final report.

Original Question: {original_question}

All Research Findings:
{all_findings}

Retrospective Notes:
{retrospective_notes}

Write a comprehensive research report that:
1. Directly answers the original question
2. Synthesizes findings across all sprints
3. Acknowledges any remaining uncertainties
4. Includes citations to sources

The report should be well-structured with sections and approximately 1000-1500 words.
"""

In [None]:
def search_web(query: str, max_results: int = 5) -> str:
    """Execute web search using Tavily."""
    try:
        # Truncate query if too long (Tavily limit)
        if len(query) > 400:
            query = query[:400]
        
        response = tavily_client.search(
            query=query,
            max_results=max_results,
            include_answer=True
        )
        
        results = []
        if response.get("answer"):
            results.append(f"Summary: {response['answer']}")
        
        for r in response.get("results", []):
            results.append(f"- {r.get('title', 'No title')}: {r.get('content', '')[:500]}... (Source: {r.get('url', 'N/A')})")
        
        return "\n\n".join(results)
    except Exception as e:
        return f"Search error: {str(e)}"

In [None]:
async def plan_backlog(state: AgileResearchState) -> dict:
    """Plan the initial research backlog by decomposing the question."""
    question = state["question"]
    
    prompt = PLAN_BACKLOG_PROMPT.format(question=question)
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    
    # Parse numbered list from response
    lines = response.content.strip().split("\n")
    backlog = []
    for line in lines:
        line = line.strip()
        if line and (line[0].isdigit() or line.startswith("-")):
            # Remove numbering and clean up
            clean = line.lstrip("0123456789.-) ").strip()
            if clean:
                backlog.append(clean)
    
    print(f"Created backlog with {len(backlog)} research questions")
    
    return {
        "backlog": backlog,
        "current_sprint": 1,
        "max_sprints": MAX_SPRINTS
    }

In [None]:
async def execute_sprint(state: AgileResearchState) -> dict:
    """Execute a research sprint on the current backlog item."""
    backlog = state.get("backlog", [])
    current_sprint = state.get("current_sprint", 1)
    max_sprints = state.get("max_sprints", MAX_SPRINTS)
    
    if not backlog:
        return {"sprint_findings": ["No questions in backlog to research."]}
    
    # Get current question (first in backlog)
    current_question = backlog[0]
    print(f"\n{'='*60}")
    print(f"Sprint {current_sprint}/{max_sprints}: {current_question[:80]}...")
    print(f"{'='*60}")
    
    # Execute searches
    search_results = []
    queries = [current_question]  # Main query
    
    # Generate additional search queries for depth
    if SEARCHES_PER_SPRINT > 1:
        query_prompt = f"""Generate {SEARCHES_PER_SPRINT - 1} specific web search queries to investigate this question from different angles:
        Question: {current_question}
        
        Return only the search queries, one per line."""
        query_response = await llm.ainvoke([HumanMessage(content=query_prompt)])
        additional_queries = [q.strip() for q in query_response.content.split("\n") if q.strip()]
        queries.extend(additional_queries[:SEARCHES_PER_SPRINT - 1])
    
    # Execute all searches
    for query in queries:
        print(f"  Searching: {query[:60]}...")
        result = search_web(query)
        search_results.append(result)
    
    combined_results = "\n\n---\n\n".join(search_results)
    
    # Synthesize findings
    synthesis_prompt = SPRINT_RESEARCH_PROMPT.format(
        sprint_num=current_sprint,
        max_sprints=max_sprints,
        current_question=current_question,
        search_results=combined_results
    )
    
    synthesis = await llm.ainvoke([HumanMessage(content=synthesis_prompt)])
    
    finding = f"## Sprint {current_sprint} Findings: {current_question}\n\n{synthesis.content}"
    print(f"  Synthesized {len(synthesis.content)} characters of findings")
    
    # Remove processed question from backlog
    updated_backlog = backlog[1:] if len(backlog) > 1 else []
    
    return {
        "sprint_findings": [finding],
        "backlog": updated_backlog
    }

In [None]:
async def retrospective(state: AgileResearchState) -> dict:
    """Conduct retrospective analysis after sprint."""
    original_question = state["question"]
    current_sprint = state.get("current_sprint", 1)
    max_sprints = state.get("max_sprints", MAX_SPRINTS)
    all_findings = "\n\n".join(state.get("sprint_findings", []))
    remaining_backlog = state.get("backlog", [])
    
    prompt = RETROSPECTIVE_PROMPT.format(
        original_question=original_question,
        sprint_num=current_sprint,
        max_sprints=max_sprints,
        all_findings=all_findings,
        remaining_backlog="\n".join(remaining_backlog) if remaining_backlog else "None"
    )
    
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    
    retro_note = f"### Sprint {current_sprint} Retrospective\n\n{response.content}"
    print(f"  Retrospective complete")
    
    return {
        "retrospective_notes": [retro_note],
        "current_sprint": current_sprint + 1
    }

In [None]:
def should_continue_sprinting(state: AgileResearchState) -> Literal["execute_sprint", "write_report"]:
    """Decide whether to continue with another sprint or write the final report."""
    current_sprint = state.get("current_sprint", 1)
    max_sprints = state.get("max_sprints", MAX_SPRINTS)
    backlog = state.get("backlog", [])
    
    # Stop conditions:
    # 1. Reached max sprints
    # 2. Backlog is empty
    if current_sprint > max_sprints:
        print(f"\nMax sprints ({max_sprints}) reached. Moving to final report.")
        return "write_report"
    
    if not backlog:
        print(f"\nBacklog empty after sprint {current_sprint - 1}. Moving to final report.")
        return "write_report"
    
    print(f"\nContinuing to sprint {current_sprint}. {len(backlog)} questions remaining.")
    return "execute_sprint"

In [None]:
async def write_report(state: AgileResearchState) -> dict:
    """Write the final research report."""
    original_question = state["question"]
    all_findings = "\n\n".join(state.get("sprint_findings", []))
    retrospective_notes = "\n\n".join(state.get("retrospective_notes", []))
    
    prompt = FINAL_REPORT_PROMPT.format(
        original_question=original_question,
        all_findings=all_findings,
        retrospective_notes=retrospective_notes
    )
    
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    
    print(f"\nFinal report generated: {len(response.content)} characters")
    
    return {
        "final_report": response.content
    }

## 4. Graph Construction

In [None]:
# Build the Agile Research Agent graph
agile_builder = StateGraph(AgileResearchState)

# Add nodes
agile_builder.add_node("plan_backlog", plan_backlog)
agile_builder.add_node("execute_sprint", execute_sprint)
agile_builder.add_node("retrospective", retrospective)
agile_builder.add_node("write_report", write_report)

# Add edges
agile_builder.add_edge(START, "plan_backlog")
agile_builder.add_edge("plan_backlog", "execute_sprint")
agile_builder.add_edge("execute_sprint", "retrospective")

# Conditional edge: continue sprinting or write report
agile_builder.add_conditional_edges(
    "retrospective",
    should_continue_sprinting,
    {
        "execute_sprint": "execute_sprint",
        "write_report": "write_report"
    }
)

agile_builder.add_edge("write_report", END)

# Compile
agile_graph = agile_builder.compile()

print("Agile Research Agent compiled successfully")

In [None]:
# Visualize the graph
from IPython.display import Image, display

try:
    display(Image(agile_graph.get_graph().draw_mermaid_png()))
except Exception as e:
    print(f"Could not display graph: {e}")

## 5. Agent Wrapper for Evaluation

In [None]:
def agile_sprints_agent(inputs: dict) -> dict:
    """
    Wrapper function for Agile Sprints research agent.
    
    Compatible with evaluation harness.
    
    Args:
        inputs: Dictionary with 'question' key
        
    Returns:
        Dictionary with 'output' key containing final report
    """
    question = inputs.get("question", "")
    
    # Run with recursion limit
    result = asyncio.run(
        agile_graph.ainvoke(
            {"question": question},
            config={"recursion_limit": 50}
        )
    )
    
    return {
        "output": result.get("final_report", ""),
        "sprint_findings": result.get("sprint_findings", []),
        "retrospective_notes": result.get("retrospective_notes", [])
    }

## 6. Manual Test

Run this cell to verify the agent works correctly with a simple test question.

In [None]:
# Simple test
test_question = "What are the key benefits and challenges of using large language models in enterprise applications?"

print(f"Testing Agile Sprints Agent with question:\n{test_question}\n")
print("Running sprint-based research (this may take several minutes)...\n")

try:
    result = agile_sprints_agent({"question": test_question})
    
    print("=" * 80)
    print("FINAL REPORT")
    print("=" * 80)
    print(result["output"][:3000] + "..." if len(result["output"]) > 3000 else result["output"])
    print("\n" + "=" * 80)
    print(f"Report length: {len(result['output'])} characters")
    print(f"Number of sprint findings: {len(result.get('sprint_findings', []))}")
    print(f"Number of retrospectives: {len(result.get('retrospective_notes', []))}")
    print("Agent test PASSED ✓")
except Exception as e:
    print(f"Agent test FAILED: {e}")
    import traceback
    traceback.print_exc()
    raise

## 7. Evaluation Harness Integration

Once the manual test passes, uncomment and run the cells below for full evaluation.

In [None]:
# Import evaluation harness and metrics
import sys
sys.path.insert(0, "..")
from evaluation import (
    ExperimentHarness, 
    fact_recall, 
    citation_precision,
    coherence_judge, 
    depth_judge, 
    relevance_judge,
    minimum_sources_check
)

# Initialize harness with the golden test dataset
harness = ExperimentHarness(
    dataset_path="../data/deep_research_agent_test_dataset.yaml",
    langsmith_dataset_name="deep-research-golden-v2"
)

print("Evaluation harness initialized successfully!")
print(f"Dataset: {harness.dataset_path}")
print(f"LangSmith dataset name: {harness.langsmith_dataset_name}")

### 7.1 Full Evaluation

This runs the complete evaluation on all 20 questions.

**⚠️ WARNING:** This is expensive and time-consuming!
- **Expected runtime:** 1-2 hours
- Single run to reduce cost

In [None]:
# Full Evaluation on All 20 Questions
# ⚠️ EXPENSIVE - Only uncomment when ready for full evaluation
# Uncomment to run:

# # Define comprehensive evaluator suite
# evaluators = [
#     fact_recall,              # Required facts coverage
#     citation_precision,       # Citation URL validity
#     minimum_sources_check,    # Minimum source count
#     coherence_judge,          # Logical structure
#     depth_judge,              # Analysis depth
#     relevance_judge,          # Addresses question
# ]
# 
# # Run full evaluation
# print("Starting FULL evaluation on all 20 questions...")
# print("Agile Sprints Agent - this will take 1-2 hours.")
# print("=" * 80 + "\n")
# 
# results = harness.run_evaluation(
#     agent_fn=agile_sprints_agent,
#     evaluators=evaluators,
#     experiment_name="agile_sprints_v1",
#     monte_carlo_runs=1,  # Single run to reduce cost
#     max_concurrency=2,   # Lower concurrency for stability
#     description="Agile Sprints paradigm evaluation on all difficulty tiers"
# )
# 
# # Display comprehensive results
# print("\n" + "=" * 80)
# print("FULL EVALUATION RESULTS")
# print("=" * 80)
# print(f"Experiment: {results.experiment_name}")
# print(f"Questions evaluated: {results.num_questions}")
# print(f"Runs per question: {results.num_runs}")
# 
# print(f"\n{'Metric':<30} {'Mean':<10}")
# print("-" * 40)
# for metric_name in sorted(results.metrics.keys()):
#     if not metric_name.endswith('_std'):
#         value = results.metrics.get(metric_name, 0)
#         print(f"{metric_name:<30} {value:<10.3f}")
# 
# # Save results to file
# import json
# from datetime import datetime
# 
# results_file = Path("../results") / f"agile_sprints_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
# results_file.parent.mkdir(exist_ok=True)
# 
# with open(results_file, 'w') as f:
#     json.dump({
#         "experiment_name": results.experiment_name,
#         "num_questions": results.num_questions,
#         "num_runs": results.num_runs,
#         "metrics": results.metrics,
#         "per_question": results.per_question_results
#     }, f, indent=2)
# 
# print(f"\nResults saved to: {results_file}")

print("Full evaluation cell ready. Uncomment to run when ready.")

### 7.2 Viewing Results in LangSmith

After running an evaluation, you can view detailed results in the LangSmith UI:

1. Go to https://smith.langchain.com
2. Navigate to your project (`deep_research_new`)
3. Click on "Datasets" to see your test dataset
4. Click on "Experiments" to see evaluation runs
5. Compare Agile Sprints vs Baseline results