<a href="https://colab.research.google.com/github/angelatyk/tinytutor/blob/dev/notebooks/05_evaluation_and_observability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation and Observability

Prompt for TinyTutor Notebook 5: 05_evaluation_and_observability.ipynb
# TinyTutor Capstone Notebook 5: Evaluation and Observability
## Objective
Generate runnable Python code using the **ADK** to implement robust quality control checks (Evaluation) and deep debugging capabilities (Observability). This notebook enforces the principle that **"The Trajectory is the Truth"** [2], ensuring the entire decision-making process is visible and judged. This system integrates two specialized Critique Agents into the Lesson Coordinator pipeline developed in Notebook 4.

## Implementation Requirements

1.  **Agent and Pipeline Definitions:** Re-define the complete multi-agent pipeline (`TinyTutorCoordinator` and its sub-agents/tools) from Notebook 4.
2.  **Observability Setup:** Configure the ADK runner to display the **full internal trajectory** (trace/thought process) of the agents during execution. This includes the LLM's reasoning, tool selection, arguments passed, and observations received, as logging and tracing are the foundation for seeing inside the agent's mind.
3.  **Safety Checker Agent:** Define an `LlmAgent` named `SafetyCheckerAgent` (or `reviewer_agent`, as per the file structure).
    *   **Role:** Acts as a non-negotiable guardrail. Its primary instruction must be to review the `{final_script}` for **age-appropriateness, harmful content, and adherence to safety policies**.
    *   **Model:** This agent should simulate using a high-tier model (e.g., Gemini 1.5 Pro) for critical safety assessment.
    *   **Output:** Must return a structured pass/fail assessment (e.g., JSON) indicating the safety status and justification.
4.  **Evaluation Agent (LLM-as-a-Judge):** Define a specialized `LlmAgent` named `EvaluatorAgent`.
    *   **Role:** Acts as the automated quality judge. Provide it with a detailed **rubric** (e.g., scoring Simplicity, Coherence, and ELI5 adherence on a scale of 1-5).
    *   **Integration:** Incorporate both the `SafetyCheckerAgent` and `EvaluatorAgent` sequentially after the `ScriptwritingAgent` finishes its task.
5.  **Execution:** Run the complete multi-agent pipeline for a complex topic (e.g., "The mechanism of photosynthesis"). The final output must clearly show:
    *   The **full execution trace** (trajectory) proving observability.
    *   The structured quality scores and safety check results from the Judge Agents.

## Generation Prompt
"Generate the runnable Python code for the '05_evaluation_and_observability.ipynb' notebook for the **TinyTutor** project. The code must integrate a `SafetyCheckerAgent` and an `EvaluatorAgent` (LLM-as-a-Judge) into the multi-agent pipeline from Notebook 4. Configure the runner to display the **full agent execution trace/trajectory**, demonstrating observability. The final output must include the structured, rubric-based scores from the evaluator and the safety status from the checker."

### 5. `05_evaluation_and_observability.ipynb` - Final Checklist

| Category | Requirement | Sources & Justification |
| :--- | :--- | :--- |
| **Core Concept** | **Agent Ops:** Foundation of **Observability** and **Evaluation**. Evaluation must be an architectural pillar. |
| **Goal** | Integrate safety checks and quality scoring (LLM-as-a-Judge) and visualize the agent's decision-making process (trajectory). |
| **Dependencies** | The full coordinated pipeline (NB 4). |
| **Required Tools** | **ADK Core:** Logging and Tracing instrumentation. **Evaluation Agents:** `SafetyCheckerAgent` and `EvaluatorAgent` (simulating Gemini 1.5 Pro for higher quality judgment). |
| **Architecture** | The system must be built to be **evaluatable by design**. The critical step is Trajectory Evaluation, analyzing the path taken. |
| **Good Practices** | **Observability Pillars:** Ensure structured Logs (Agent's Diary), Traces (Narrative Thread/Footsteps), and Metrics (Health Report) are captured. Use `--log_level DEBUG` for local debugging. |
| **Good Practices** | **Safety Guardrails:** Implement safety features as explicit components using the ADK **Plugin** pattern (e.g., scanning input before model call and output after). |
| **Good Practices** | **LLM-as-a-Judge:** Use a separate LLM (Gemini Pro) to score the output against a defined rubric (Simplicity, Coherence). |
