# 01: Knowledge-Grounded QA Agent

This notebook introduces the **Knowledge-Grounded QA Agent** - a research assistant that uses
Google Search, web fetching, and intelligent planning to answer complex questions with citations.

## What You'll Learn

1. **Agent Architecture** - How the agent uses planning, tools, and reflection
2. **Tools** - Google Search, web fetching, file operations
3. **Running the Agent** - Ask questions and see it work
4. **Evaluation** - Run a single-sample evaluation with the LLM-as-judge

## Prerequisites

- Set `GOOGLE_API_KEY` in your `.env` file
- Run `uv sync` to install dependencies

In [None]:
# Setup - all imports at the top
from aieng.agent_evals.knowledge_agent import (
    DeepSearchQADataset,
    DeepSearchQAJudge,
    KnowledgeGroundedAgent,
)
from aieng.agent_evals.knowledge_agent.agent import SYSTEM_INSTRUCTIONS
from aieng.agent_evals.knowledge_agent.notebook import run_with_display
from dotenv import load_dotenv
from rich.console import Console
from rich.markdown import Markdown
from rich.panel import Panel
from rich.table import Table


load_dotenv(verbose=True)
console = Console(width=100)

## 1. Agent Architecture

The `KnowledgeGroundedAgent` is built on Google's Agent Development Kit (ADK) and uses a **ReAct loop**
(Reasoning + Acting) with **PlanReAct planning**.

### Key Components

```
┌─────────────────────────────────────────────────────────────────┐
│                    KnowledgeGroundedAgent                       │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
│  │  Planning   │───▶│  Execution  │───▶│    Reflection       │  │
│  │ (PlanReAct) │    │   (Tools)   │    │  (Update Plan)      │  │
│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
│         │                  │                     │              │
│         ▼                  ▼                     ▼              │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                      Tools                                  ││
│  │  • google_search  - Find URLs via web search                ││
│  │  • web_fetch      - Fetch HTML pages and PDFs               ││
│  │  • fetch_file     - Download data files (CSV, XLSX)         ││
│  │  • grep_file      - Search within downloaded files          ││
│  │  • read_file      - Read sections of downloaded files       ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
```

### How It Works

1. **Planning**: The agent creates a research plan with explicit steps using PlanReAct
2. **Execution**: Each step is executed using available tools
3. **Reflection**: After each step, the agent reflects on findings and may update the plan
4. **Synthesis**: Final step combines all findings into a comprehensive answer

In [None]:
# Create the agent

agent = KnowledgeGroundedAgent(enable_planning=True)

# Display configuration
config_table = Table(title="Agent Configuration", show_header=False)
config_table.add_column("Setting", style="cyan")
config_table.add_column("Value", style="white")
config_table.add_row("Model", agent.model)
config_table.add_row("Planning", "PlanReAct")
config_table.add_row("Planning Enabled", str(agent.enable_planning))

console.print(config_table)

## 2. Available Tools

The agent has access to these tools for research:

| Tool | Purpose | Example Use |
|------|---------|-------------|
| `google_search` | Find relevant URLs | Finding news articles, official sources |
| `web_fetch` | Get full page content | Reading articles, PDFs, documentation |
| `fetch_file` | Download data files | Getting CSV/XLSX datasets |
| `grep_file` | Search within files | Finding specific data in large files |
| `read_file` | Read file sections | Extracting specific parts of documents |

In [None]:
# View the system instructions that guide the agent
# Show first part of instructions
instructions_preview = SYSTEM_INSTRUCTIONS[:2000] + "\n\n[... truncated for display ...]"

console.print(
    Panel(
        Markdown(instructions_preview),
        title="System Instructions (Preview)",
        border_style="blue",
    )
)

## 3. Running the Agent

Let's ask the agent a question and observe how it works. We'll ask about a recent event
that requires web search to answer correctly.

In [None]:
# Ask a question with live progress display
question = "When was the highest single day snowfall recorded in Toronto?"

# Run the agent with live display showing plan and tool calls
response = await run_with_display(agent, question)

In [None]:
# Display the final answer
console.print(
    Panel(
        response.text,
        title="Answer",
        border_style="cyan",
        subtitle=f"Duration: {response.total_duration_ms / 1000:.1f}s | Sources: {len(response.sources)}",
    )
)

In [None]:
# View the research plan that was created
plan = response.plan

plan_table = Table(title="Research Plan")
plan_table.add_column("#", style="cyan", width=3)
plan_table.add_column("Step", style="white")
plan_table.add_column("Type", style="dim")
plan_table.add_column("Status", style="green")

for step in plan.steps:
    status_icon = {"completed": "✓", "skipped": "○", "failed": "✗"}.get(step.status, "?")
    plan_table.add_row(
        str(step.step_id),
        step.description[:60] + "..." if len(step.description) > 60 else step.description,
        step.step_type,
        f"{status_icon} {step.status}",
    )

console.print(plan_table)

In [None]:
# View the tool calls made during execution
if response.tool_calls:
    tools_table = Table(title="Tool Calls")
    tools_table.add_column("#", style="dim", width=3)
    tools_table.add_column("Tool", style="cyan")
    tools_table.add_column("Arguments", style="white")

    for i, tc in enumerate(response.tool_calls[:15], 1):  # Show first 15
        tool_name = tc.get("name", "unknown")
        args = tc.get("args", {})
        args_str = str(args)[:60] + "..." if len(str(args)) > 60 else str(args)
        tools_table.add_row(str(i), tool_name, args_str)

    if len(response.tool_calls) > 15:
        tools_table.add_row("...", f"({len(response.tool_calls) - 15} more)", "")

    console.print(tools_table)
else:
    console.print("[dim]No tool calls recorded[/dim]")

In [None]:
# View the sources used
if response.sources:
    sources_table = Table(title="Sources")
    sources_table.add_column("#", style="dim", width=3)
    sources_table.add_column("URL", style="blue")

    # Deduplicate sources by URL
    seen_urls = set()
    for _, src in enumerate(response.sources, 1):
        if src.uri and src.uri not in seen_urls:
            seen_urls.add(src.uri)
            url_display = src.uri[:80] + "..." if len(src.uri) > 80 else src.uri
            sources_table.add_row(str(len(seen_urls)), url_display)
        if len(seen_urls) >= 10:  # Show first 10 unique sources
            break

    console.print(sources_table)
else:
    console.print("[dim]No sources recorded[/dim]")

## 4. Single-Sample Evaluation

The **DeepSearchQA** benchmark contains 896 research questions for evaluating knowledge agents.
Let's evaluate the agent on a single sample using the **LLM-as-judge** approach.

### Evaluation Metrics

- **Precision**: Did the agent's answer only contain correct information?
- **Recall**: Did the agent find all the required information?
- **F1 Score**: Harmonic mean of precision and recall
- **Outcome**: `fully_correct`, `partially_correct`, `correct_with_extraneous`, or `fully_incorrect`

In [None]:
# Load the DeepSearchQA dataset
dataset = DeepSearchQADataset()

console.print(f"Loaded [cyan]{len(dataset)}[/cyan] examples")
console.print(f"Categories: [cyan]{len(dataset.get_categories())}[/cyan]")

In [None]:
# Pick an example from Finance & Economics category
finance_examples = dataset.get_by_category("Finance & Economics")
example = finance_examples[0]  # Pick first one for reproducibility

console.print(
    Panel(
        f"[bold]ID:[/bold] {example.example_id}\n"
        f"[bold]Category:[/bold] {example.problem_category}\n"
        f"[bold]Answer Type:[/bold] {example.answer_type}\n\n"
        f"[bold cyan]Question:[/bold cyan]\n{example.problem}\n\n"
        f"[bold yellow]Ground Truth:[/bold yellow]\n{example.answer}",
        title="Test Example",
        border_style="blue",
    )
)

In [None]:
# Run the agent on this example with live progress
eval_response = await run_with_display(agent, example.problem)

In [None]:
# Display the agent's answer
console.print(
    Panel(
        eval_response.text,
        title="Agent's Answer",
        border_style="cyan",
        subtitle=f"Duration: {eval_response.total_duration_ms / 1000:.1f}s",
    )
)

In [None]:
# Use the LLM-as-judge to evaluate the answer

judge = DeepSearchQAJudge()

console.print("[dim]Evaluating with LLM judge...[/dim]\n")

# Get detailed evaluation
score, result = judge.evaluate_with_details(
    question=example.problem,
    answer=eval_response.text,
    ground_truth=example.answer,
    answer_type=example.answer_type,
)

# Display results
outcome_colors = {
    "fully_correct": "green",
    "correct_with_extraneous": "yellow",
    "partially_correct": "orange1",
    "fully_incorrect": "red",
}
outcome_color = outcome_colors.get(result.outcome, "white")

metrics_table = Table(title="Evaluation Results")
metrics_table.add_column("Metric", style="cyan")
metrics_table.add_column("Value", style="white")

metrics_table.add_row("Outcome", f"[{outcome_color}]{result.outcome}[/{outcome_color}]")
metrics_table.add_row("Precision", f"{result.precision:.2f}")
metrics_table.add_row("Recall", f"{result.recall:.2f}")
metrics_table.add_row("F1 Score", f"[bold]{result.f1_score:.2f}[/bold]")

console.print(metrics_table)

In [None]:
# Show the judge's explanation
if result.explanation:
    console.print(Panel(result.explanation, title="Judge's Explanation", border_style="magenta"))

# Show correctness details if available
if result.correctness_details:
    details_table = Table(title="Correctness Details")
    details_table.add_column("Ground Truth Item", style="white")
    details_table.add_column("Found", style="cyan", justify="center")

    for item, found in result.correctness_details.items():
        icon = "[green]✓[/green]" if found else "[red]✗[/red]"
        details_table.add_row(item[:50] + "..." if len(item) > 50 else item, icon)

    console.print(details_table)

# Show extraneous items if any
if result.extraneous_items:
    console.print(
        Panel(
            "\n".join(f"• {item}" for item in result.extraneous_items),
            title="Extraneous Items (not in ground truth)",
            border_style="yellow",
        )
    )

## 5. Using the CLI

You can also run the agent and evaluations from the command line:

```bash
# Ask a question
knowledge-agent ask "What is quantum computing?" --show-plan

# Run evaluation on specific example IDs
knowledge-agent eval --ids 123 456 --show-plan

# Run evaluation on random samples
knowledge-agent eval --samples 5 --category "Finance & Economics"

# View dataset samples
knowledge-agent sample --category "Science & Technology" --count 3
```

The CLI shows a live display with:
- Research plan checklist with step statuses
- Tool calls as they happen
- Context usage indicator

## Summary

In this notebook, you learned:

1. **Agent Architecture** - Planning, execution, and reflection with multiple tools
2. **Tools** - Google Search, web fetching, and file operations
3. **Running the Agent** - How to ask questions and interpret responses
4. **Evaluation** - Using DeepSearchQA and LLM-as-judge for quality assessment

**Next**: In notebook 02, you'll learn about **Langfuse tracing** to observe agent behavior in detail.

In [None]:
console.print(
    Panel(
        "[green]✓[/green] Notebook complete!\n\n"
        "[cyan]Next:[/cyan] Open [bold]02_langfuse_tracing.ipynb[/bold] to learn about observability.",
        title="Done",
        border_style="green",
    )
)