# Paradigm 05: Iterative Refinement Research Agent

This notebook implements the **Iterative Refinement (Diffusion-Based Reasoning)** paradigm from the Research Paradigms document.

## Core Concept

Iterative refinement applies a Generate-Critique-Revise loop to improve output quality:
- **Generate**: Create initial research and draft
- **Critique**: Identify weak claims, gaps, and inconsistencies
- **Revise**: Fix identified issues with targeted improvements
- **Converge**: Stop when quality is sufficient or max iterations reached

## Literature Validation

> "WebThinker, LongDPO, CycleResearcher... iterating between generating and refining outputs to achieve higher quality results. This is the only universally validated component across all paradigms." —Feasibility Report

## Technology Stack

- **LLM**: `gpt-5-mini-2025-08-07`
- **Web Search**: Tavily API
- **Tracing**: LangSmith
- **Framework**: LangGraph

## 1. Setup and Configuration

In [None]:
import os
import operator
import asyncio
from pathlib import Path
from typing import List, Annotated, TypedDict, Literal

from dotenv import load_dotenv
from pydantic import BaseModel, Field

from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from tavily import TavilyClient

from langgraph.graph import StateGraph, START, END

# Load environment variables
env_path = Path("../.env")
load_dotenv(env_path)

# Configure LangSmith tracing
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "deep_research_new"

print("Environment configured successfully")

In [None]:
# Initialize LLM and Tavily client
MODEL_NAME = "gpt-5-mini-2025-08-07"
llm = ChatOpenAI(model=MODEL_NAME, temperature=0)
tavily_client = TavilyClient()

# Refinement Configuration
MAX_REVISIONS = 3  # Maximum critique-revise iterations
MIN_QUALITY_THRESHOLD = 7.0  # Stop if quality exceeds this (1-10 scale)

print(f"Using model: {MODEL_NAME}")
print(f"Max revisions: {MAX_REVISIONS}")
print(f"Quality threshold: {MIN_QUALITY_THRESHOLD}/10")

## 2. State Definitions

In [None]:
class CritiqueIssue(BaseModel):
    """An issue identified during critique."""
    issue_type: str = Field(description="Type: weak_claim, missing_evidence, logical_gap, unclear")
    location: str = Field(description="Where in the text this issue appears")
    description: str = Field(description="Description of the issue")
    suggestion: str = Field(description="Suggested fix")

class CritiqueResult(BaseModel):
    """Result of critique analysis."""
    issues: List[CritiqueIssue] = Field(default_factory=list)
    overall_quality: float = Field(description="Quality score 1-10")
    summary: str = Field(description="Summary of critique")

In [None]:
class IterativeRefinementState(TypedDict):
    """State for the Iterative Refinement Research Agent."""
    # Input
    question: str
    
    # Research phase
    search_results: Annotated[List[str], operator.add]
    source_urls: Annotated[List[str], operator.add]
    
    # Draft management
    current_draft: str
    draft_history: Annotated[List[str], operator.add]  # Track all versions
    
    # Critique tracking
    current_critique: str
    critique_history: Annotated[List[str], operator.add]
    revision_count: int
    quality_scores: Annotated[List[float], operator.add]
    
    # Output
    final_report: str

## 3. Helper Functions

In [None]:
def search_web(query: str, max_results: int = 10) -> tuple[List[str], List[str]]:
    """Execute web search using Tavily. Returns (results, urls)."""
    try:
        if len(query) > 400:
            query = query[:400]
        
        response = tavily_client.search(
            query=query,
            max_results=max_results,
            include_answer=True
        )
        
        results = []
        urls = []
        
        if response.get("answer"):
            results.append(f"Summary: {response['answer']}")
        
        for r in response.get("results", []):
            url = r.get('url', '')
            urls.append(url)
            results.append(f"- {r.get('title', 'No title')}: {r.get('content', '')[:500]}... (Source: {url})")
        
        return results, urls
    except Exception as e:
        return [f"Search error: {str(e)}"], []

## 4. Node Functions

In [None]:
# Prompts
RESEARCH_PROMPT = """You are a research expert. Conduct comprehensive research on this question.

Question: {question}

Generate 5 focused search queries to gather information from different angles.
Return ONLY the search queries, one per line.
"""

DRAFT_PROMPT = """You are a research analyst writing a comprehensive report.

Question: {question}

Research Findings:
{research_findings}

Write a detailed research report (1000-1500 words) that:
1. Directly addresses the research question
2. Synthesizes findings from multiple sources
3. Provides evidence for all claims
4. Includes proper citations
5. Acknowledges limitations or uncertainties

Structure with clear sections and be comprehensive.
"""

CRITIQUE_PROMPT = """You are a critical reviewer evaluating a research report.

Original Question: {question}

Report to Critique:
{draft}

Analyze this report and identify issues:

1. WEAK CLAIMS: Statements without sufficient evidence
2. MISSING EVIDENCE: Important points that need citations
3. LOGICAL GAPS: Missing connections or reasoning holes
4. UNCLEAR SECTIONS: Confusing or ambiguous passages

For each issue, provide:
- Type (weak_claim/missing_evidence/logical_gap/unclear)
- Location (quote the problematic text)
- Description of the problem
- Specific suggestion for improvement

Also provide an OVERALL QUALITY SCORE (1-10) and a summary.

Format your response as:
QUALITY SCORE: [1-10]

ISSUES:
1. [Type]: "[Location]" - [Description]. Suggestion: [Fix]
2. ...

SUMMARY: [Overall assessment]
"""

REVISE_PROMPT = """You are a research analyst revising a report based on critical feedback.

Original Question: {question}

Current Draft:
{current_draft}

Critique Feedback:
{critique}

Revise the report to address ALL identified issues:
- Fix weak claims with evidence
- Add missing citations and evidence
- Fill logical gaps with explanations
- Clarify unclear sections

Maintain the overall structure but improve quality. Do NOT shorten the report.
Output the complete revised report.
"""

In [None]:
async def conduct_research(state: IterativeRefinementState) -> dict:
    """Conduct initial research phase."""
    question = state["question"]
    
    print(f"\n{'='*60}")
    print(f"Research Phase")
    print(f"{'='*60}")
    
    # Generate search queries
    prompt = RESEARCH_PROMPT.format(question=question)
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    queries = [q.strip() for q in response.content.split("\n") if q.strip()][:5]
    
    # Execute searches
    all_results = []
    all_urls = []
    
    for query in queries:
        print(f"  Searching: {query[:50]}...")
        results, urls = search_web(query)
        all_results.extend(results)
        all_urls.extend(urls)
    
    print(f"  Collected {len(all_results)} results from {len(set(all_urls))} sources")
    
    return {
        "search_results": all_results,
        "source_urls": all_urls,
        "revision_count": 0
    }

In [None]:
async def generate_draft(state: IterativeRefinementState) -> dict:
    """Generate initial draft from research."""
    question = state["question"]
    search_results = state.get("search_results", [])
    
    print(f"\n--- Generating Initial Draft ---")
    
    prompt = DRAFT_PROMPT.format(
        question=question,
        research_findings="\n\n".join(search_results[:25])
    )
    
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    
    print(f"  Draft generated: {len(response.content)} characters")
    
    return {
        "current_draft": response.content,
        "draft_history": [f"V0 (Initial): {len(response.content)} chars"]
    }

In [None]:
async def critique_draft(state: IterativeRefinementState) -> dict:
    """Critique the current draft."""
    question = state["question"]
    current_draft = state.get("current_draft", "")
    revision_count = state.get("revision_count", 0)
    
    print(f"\n--- Critique Phase (Revision {revision_count}) ---")
    
    prompt = CRITIQUE_PROMPT.format(
        question=question,
        draft=current_draft
    )
    
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    critique = response.content
    
    # Extract quality score
    quality_score = 5.0  # Default
    try:
        import re
        match = re.search(r'QUALITY SCORE:\s*(\d+\.?\d*)', critique)
        if match:
            quality_score = float(match.group(1))
    except:
        pass
    
    print(f"  Quality score: {quality_score}/10")
    
    return {
        "current_critique": critique,
        "critique_history": [f"R{revision_count}: Score {quality_score}/10"],
        "quality_scores": [quality_score]
    }

In [None]:
def should_continue_refining(state: IterativeRefinementState) -> Literal["revise", "finalize"]:
    """Decide whether to continue refining or finalize."""
    revision_count = state.get("revision_count", 0)
    quality_scores = state.get("quality_scores", [])
    
    latest_score = quality_scores[-1] if quality_scores else 0
    
    # Stop conditions
    if revision_count >= MAX_REVISIONS:
        print(f"  Max revisions ({MAX_REVISIONS}) reached. Finalizing.")
        return "finalize"
    
    if latest_score >= MIN_QUALITY_THRESHOLD:
        print(f"  Quality threshold ({MIN_QUALITY_THRESHOLD}) met. Finalizing.")
        return "finalize"
    
    print(f"  Continuing refinement (score {latest_score} < threshold {MIN_QUALITY_THRESHOLD})")
    return "revise"

In [None]:
async def revise_draft(state: IterativeRefinementState) -> dict:
    """Revise the draft based on critique."""
    question = state["question"]
    current_draft = state.get("current_draft", "")
    critique = state.get("current_critique", "")
    revision_count = state.get("revision_count", 0)
    
    print(f"\n--- Revision Phase ({revision_count + 1}/{MAX_REVISIONS}) ---")
    
    prompt = REVISE_PROMPT.format(
        question=question,
        current_draft=current_draft,
        critique=critique
    )
    
    response = await llm.ainvoke([HumanMessage(content=prompt)])
    revised_draft = response.content
    
    # Calculate improvement
    len_change = len(revised_draft) - len(current_draft)
    print(f"  Revised draft: {len(revised_draft)} chars ({'+' if len_change >= 0 else ''}{len_change})")
    
    return {
        "current_draft": revised_draft,
        "draft_history": [f"V{revision_count + 1}: {len(revised_draft)} chars"],
        "revision_count": revision_count + 1
    }

In [None]:
async def finalize_report(state: IterativeRefinementState) -> dict:
    """Finalize the report."""
    current_draft = state.get("current_draft", "")
    revision_count = state.get("revision_count", 0)
    quality_scores = state.get("quality_scores", [])
    
    print(f"\n{'='*60}")
    print(f"Report Finalized")
    print(f"{'='*60}")
    print(f"  Total revisions: {revision_count}")
    print(f"  Quality progression: {' -> '.join([f'{s:.1f}' for s in quality_scores])}")
    print(f"  Final length: {len(current_draft)} characters")
    
    return {
        "final_report": current_draft
    }

## 5. Graph Construction

In [None]:
# Build the Iterative Refinement Research Agent graph
ir_builder = StateGraph(IterativeRefinementState)

# Add nodes
ir_builder.add_node("conduct_research", conduct_research)
ir_builder.add_node("generate_draft", generate_draft)
ir_builder.add_node("critique_draft", critique_draft)
ir_builder.add_node("revise_draft", revise_draft)
ir_builder.add_node("finalize_report", finalize_report)

# Add edges
ir_builder.add_edge(START, "conduct_research")
ir_builder.add_edge("conduct_research", "generate_draft")
ir_builder.add_edge("generate_draft", "critique_draft")

# Conditional edge: revise or finalize
ir_builder.add_conditional_edges(
    "critique_draft",
    should_continue_refining,
    {
        "revise": "revise_draft",
        "finalize": "finalize_report"
    }
)

# Loop back to critique after revision
ir_builder.add_edge("revise_draft", "critique_draft")
ir_builder.add_edge("finalize_report", END)

# Compile
iterative_refinement_graph = ir_builder.compile()

print("Iterative Refinement Research Agent compiled successfully")

In [None]:
# Visualize the graph
from IPython.display import Image, display

try:
    display(Image(iterative_refinement_graph.get_graph().draw_mermaid_png()))
except Exception as e:
    print(f"Could not display graph: {e}")

## 6. Agent Wrapper for Evaluation

In [None]:
def iterative_refinement_agent(inputs: dict) -> dict:
    """
    Wrapper function for Iterative Refinement research agent.
    
    Compatible with evaluation harness.
    
    Args:
        inputs: Dictionary with 'question' key
        
    Returns:
        Dictionary with 'output' key containing final report
    """
    question = inputs.get("question", "")
    
    # Run with recursion limit
    result = asyncio.run(
        iterative_refinement_graph.ainvoke(
            {"question": question},
            config={"recursion_limit": 50}
        )
    )
    
    return {
        "output": result.get("final_report", ""),
        "revision_count": result.get("revision_count", 0),
        "quality_scores": result.get("quality_scores", []),
        "source_urls": result.get("source_urls", [])
    }

## 7. Manual Test

Run this cell to verify the agent works correctly with a simple test question.

In [None]:
# Simple test
test_question = "What are the key benefits and challenges of using large language models in enterprise applications?"

print(f"Testing Iterative Refinement Agent with question:\n{test_question}\n")
print("Running iterative research (this may take several minutes)...\n")

try:
    result = iterative_refinement_agent({"question": test_question})
    
    print("\n" + "=" * 80)
    print("FINAL REPORT")
    print("=" * 80)
    print(result["output"][:3000] + "..." if len(result["output"]) > 3000 else result["output"])
    print("\n" + "=" * 80)
    print(f"Report length: {len(result['output'])} characters")
    print(f"Total revisions: {result.get('revision_count', 0)}")
    print(f"Quality progression: {result.get('quality_scores', [])}")
    print(f"Unique sources: {len(set(result.get('source_urls', [])))}")
    print("Agent test PASSED ✓")
except Exception as e:
    print(f"Agent test FAILED: {e}")
    import traceback
    traceback.print_exc()
    raise

## 8. Evaluation Harness Integration

Once the manual test passes, uncomment and run the cells below for full evaluation.

In [None]:
# Import evaluation harness and metrics
import sys
sys.path.insert(0, "..")
from evaluation import (
    ExperimentHarness, 
    fact_recall, 
    citation_precision,
    coherence_judge, 
    depth_judge, 
    relevance_judge,
    minimum_sources_check
)

# Initialize harness with the golden test dataset
harness = ExperimentHarness(
    dataset_path="../data/deep_research_agent_test_dataset.yaml",
    langsmith_dataset_name="deep-research-golden-v2"
)

print("Evaluation harness initialized successfully!")
print(f"Dataset: {harness.dataset_path}")
print(f"LangSmith dataset name: {harness.langsmith_dataset_name}")

In [None]:
# Full Evaluation on All 20 Questions
# ⚠️ EXPENSIVE - Only uncomment when ready for full evaluation
# Uncomment to run:

# # Define comprehensive evaluator suite
# evaluators = [
#     fact_recall,              # Required facts coverage
#     citation_precision,       # Citation URL validity
#     minimum_sources_check,    # Minimum source count
#     coherence_judge,          # Logical structure
#     depth_judge,              # Analysis depth
#     relevance_judge,          # Addresses question
# ]
# 
# # Run full evaluation
# print("Starting FULL evaluation on all 20 questions...")
# print("Iterative Refinement Agent - this will take 1-2 hours.")
# print("=" * 80 + "\n")
# 
# results = harness.run_evaluation(
#     agent_fn=iterative_refinement_agent,
#     evaluators=evaluators,
#     experiment_name="iterative_refinement_v1",
#     monte_carlo_runs=1,  # Single run to reduce cost
#     max_concurrency=2,   # Lower concurrency for stability
#     description="Iterative Refinement paradigm evaluation on all difficulty tiers"
# )
# 
# # Display comprehensive results
# print("\n" + "=" * 80)
# print("FULL EVALUATION RESULTS")
# print("=" * 80)
# print(f"Experiment: {results.experiment_name}")
# print(f"Questions evaluated: {results.num_questions}")
# print(f"Runs per question: {results.num_runs}")
# 
# print(f"\n{'Metric':<30} {'Mean':<10}")
# print("-" * 40)
# for metric_name in sorted(results.metrics.keys()):
#     if not metric_name.endswith('_std'):
#         value = results.metrics.get(metric_name, 0)
#         print(f"{metric_name:<30} {value:<10.3f}")
# 
# # Save results to file
# import json
# from datetime import datetime
# 
# results_file = Path("../results") / f"iterative_refinement_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
# results_file.parent.mkdir(exist_ok=True)
# 
# with open(results_file, 'w') as f:
#     json.dump({
#         "experiment_name": results.experiment_name,
#         "num_questions": results.num_questions,
#         "num_runs": results.num_runs,
#         "metrics": results.metrics,
#         "per_question": results.per_question_results
#     }, f, indent=2)
# 
# print(f"\nResults saved to: {results_file}")

print("Full evaluation cell ready. Uncomment to run when ready.")