# 03: Multi-Turn Conversations & Evaluation

This notebook demonstrates multi-turn conversation capabilities and
how to evaluate the agent on the DeepSearchQA benchmark.

## Learning Objectives

- Understand how ADK manages multi-turn conversations via sessions
- Use the `DeepSearchQAEvaluator` for systematic evaluation
- Analyze evaluation results with rich visualizations
- Understand evaluation metrics for research agents

In [None]:
# Setup: Load environment and configure rich console
import uuid

from aieng.agent_evals import (
    create_console,
    display_evaluation_result,
    display_metrics_table,
    display_success,
)
from aieng.agent_evals.knowledge_agent import (
    DeepSearchQADataset,
    DeepSearchQAEvaluator,
    KnowledgeGroundedAgent,
)
from dotenv import load_dotenv
from rich.panel import Panel
from rich.table import Table


console = create_console()
load_dotenv(verbose=True)

## 1. Multi-Turn Conversations with ADK

The `KnowledgeGroundedAgent` uses Google ADK's built-in session management via `InMemorySessionService`.
When you pass a `session_id` to `answer_async()`, ADK maintains conversation history automatically.

Key points:
- Each unique `session_id` creates a separate conversation thread
- ADK tracks all messages, tool calls, and context within that session
- No manual history tracking needed - ADK handles it internally

In [None]:
# Create agent and demonstrate multi-turn conversation
agent = KnowledgeGroundedAgent()

# Create a session ID for multi-turn conversation
session_id = str(uuid.uuid4())

console.print(
    Panel(
        f"[cyan]Session ID:[/cyan] {session_id}\n\nADK will track conversation history for this session automatically.",
        title="üó®Ô∏è New Session Created",
        border_style="green",
    )
)

In [None]:
# First turn - ask a question
response1 = await agent.answer_async("What is the capital of France?", session_id=session_id)
console.print(Panel(response1.text, title="Turn 1: Capital of France", border_style="blue"))

In [None]:
# Second turn - follow-up question (ADK remembers the context)
response2 = await agent.answer_async("What is its population?", session_id=session_id)
console.print(Panel(response2.text, title="Turn 2: Population (follow-up)", border_style="blue"))

## 2. Session Management in Applications

For web applications (like Gradio), you can store a session ID in the app's state:

```python
# In a Gradio app handler:
if "session_id" not in session_state:
    session_state["session_id"] = str(uuid.uuid4())

response = await agent.answer_async(query, session_id=session_state["session_id"])
```

See `gradio_app.py` for a complete example.

In [None]:
# For more details on ADK sessions, see:
# https://google.github.io/adk-docs/sessions/

display_success("Multi-turn conversation demo complete!", console=console)

## 3. Running DeepSearchQA Evaluation

The `DeepSearchQAEvaluator` provides a systematic way to evaluate the agent.

In [None]:
# Create evaluator using the existing agent
evaluator = DeepSearchQAEvaluator(agent)

display_success(f"Dataset size: {len(evaluator.dataset)} examples", console=console)

In [None]:
# Evaluate a small sample
console.print("[bold]üî¨ Running evaluation on 3 examples...[/bold]\n")

console.print("[dim]Evaluating...[/dim]")
results = await evaluator.evaluate_sample_async(n=3, random_state=42)

display_success(f"Completed {len(results)} evaluations", console=console)

In [None]:
# View results using the display utility
console.print("\n[bold]üìã Evaluation Results[/bold]\n")

for result in results:
    contains_answer = result.ground_truth.lower() in result.prediction.lower()
    display_evaluation_result(
        example_id=result.example_id,
        problem=result.problem,
        ground_truth=result.ground_truth,
        prediction=result.prediction,
        sources_used=result.sources_used,
        search_queries=result.search_queries,
        contains_answer=contains_answer,
        console=console,
    )

## 4. Analyzing Evaluation Results

In [None]:
# Convert to DataFrame for analysis
df = evaluator.results_to_dataframe(results)

# Calculate metrics
containment_correct = sum(1 for r in results if r.ground_truth.lower() in r.prediction.lower())
containment_accuracy = containment_correct / len(results) * 100

metrics = {
    "Total Examples": len(results),
    "Containment Accuracy": f"{containment_accuracy:.1f}%",
    "Avg Sources Used": df["sources_used"].mean(),
    "Avg Search Queries": df["search_queries"].apply(len).mean(),
}

display_metrics_table(metrics, title="Evaluation Metrics", console=console)

## 5. Understanding Evaluation Metrics

For research agents, we care about:

1. **Answer Correctness**: Does the prediction match the ground truth?
2. **Source Quality**: Are the sources relevant and authoritative?
3. **Comprehensiveness**: Did the agent find all necessary information?
4. **Search Efficiency**: How many searches were needed?

DeepSearchQA specifically measures:
- **Precision**: Quality of the answer
- **Recall**: Completeness of the answer (for list-type questions)

In [None]:
# Manual correctness check with better display
def check_answer_contains_ground_truth(prediction: str, ground_truth: str) -> bool:
    """Check if prediction contains the ground truth answer."""
    return ground_truth.lower() in prediction.lower()


# Check our results
console.print("\n[bold]üìä Correctness Check[/bold]\n")

result_table = Table(show_header=True, header_style="bold cyan")
result_table.add_column("Example", style="cyan")
result_table.add_column("Status", style="white")
result_table.add_column("Expected", style="dim")

for result in results:
    contains = check_answer_contains_ground_truth(result.prediction, result.ground_truth)
    status = "[green]‚úì MATCH[/green]" if contains else "[yellow]‚úó NO MATCH[/yellow]"
    result_table.add_row(
        str(result.example_id),
        status,
        result.ground_truth[:40] + "..." if len(result.ground_truth) > 40 else result.ground_truth,
    )

console.print(result_table)

## 6. Exploring Categories

In [None]:
# Get examples from a specific category
dataset = DeepSearchQADataset()
categories = dataset.get_categories()

cat_table = Table(title="üìÅ Available Categories", show_header=True, header_style="bold green")
cat_table.add_column("Category", style="white")
cat_table.add_column("Count", style="cyan", justify="right")

for cat in sorted(categories):
    count = len(dataset.get_by_category(cat))
    cat_table.add_row(cat, str(count))

console.print(cat_table)

## Summary

In this notebook, you learned:

1. How ADK manages multi-turn conversations via `InMemorySessionService`
2. How to use `session_id` for conversation continuity
3. How to run systematic evaluations with `DeepSearchQAEvaluator`
4. How to analyze evaluation results with rich visualizations
5. Key metrics for evaluating research agents

## Next Steps

- Run the Gradio app for interactive testing
- Experiment with different models (gemini-2.5-pro vs flash)
- Try the async evaluator for larger-scale evaluation
- Implement LLM-as-judge evaluation for more nuanced correctness checking

In [None]:
console.print(
    Panel(
        "[green]‚úì[/green] Notebook complete!\n\n"
        "[cyan]Next:[/cyan] Run [bold]gradio_app.py[/bold] for interactive testing.",
        title="üéâ Done",
        border_style="green",
    )
)