# Evaluation and Observability

---

##  Final System Prompt for `05_evaluation_and_observability.ipynb`

###  Notebook Title:
**TinyTutor Capstone Notebook 05: Evaluation and Observability**

###  Objective:
Implement robust **evaluation** and **observability** mechanisms for the TinyTutor multi-agent system using ADK. This notebook must demonstrate how the system can:
- Critique its own outputs using rubric-based scoring
- Enforce safety guardrails
- Expose its full internal decision-making trajectory
This establishes the foundation for **AgentOps discipline** and ensures the system is **evaluatable by design**.

---

###  System Prompt:
> Generate runnable Python code for `05_evaluation_and_observability.ipynb` that extends the TinyTutor multi-agent pipeline with evaluation and observability features. Implement the following:
>
> 1. **Pipeline Reuse**:
>     - Re-import or redefine the `TinyTutorCoordinator` and sub-agents/tools from Notebook 04.
>
> 2. **Observability Setup**:
>     - Configure the ADK `Runner` with `LoggingPlugin` and set `log_level=DEBUG`.
>     - Ensure full trace visibility: agent thoughts, tool calls, arguments, and outputs.
>
> 3. **SafetyCheckerAgent**:
>     - Define an `LlmAgent` named `SafetyCheckerAgent`.
>     - Instruction: Review `{final_script}` for age-appropriateness, harmful content, and safety policy adherence.
>     - Simulate using Gemini 1.5 Pro.
>     - Output: Structured JSON with `status: pass/fail` and `justification`.
>
> 4. **EvaluatorAgent (LLM-as-a-Judge)**:
>     - Define an `LlmAgent` named `EvaluatorAgent`.
>     - Instruction: Score `{final_script}` using a rubric (e.g., Simplicity, Coherence, ELI5 adherence; scale 1–5).
>     - Output: Structured JSON with scores and summary.
>
> 5. **LoopAgent Pattern (Optional)**:
>     - Wrap the `ScriptwritingAgent` in a `LoopAgent` that repeats until the EvaluatorAgent returns an “Approved” score or passes a threshold.
>
> 6. **Execution**:
>     - Run the full pipeline with a complex topic (e.g., “The mechanism of photosynthesis”).
>     - Display:
>         - Full execution trace
>         - Safety check result
>         - Evaluation scores
>         - Final approved script
>
> 7. **Best Practices**:
>     - Use structured outputs and type hints
>     - Redact PII before logging or storing
>     - Include inline comments and Markdown to explain architecture, evaluation logic, and Capstone alignment

---

##  Final Checklist for `05_evaluation_and_observability.ipynb`

| **Category**         | **Requirement**                                                                                                                                       | **Source/Justification**                                                                 |
|----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| **Core Concept**      | AgentOps: Evaluation and Observability as architectural pillars                                                                                      | Capstone delivery requirement                                                             |
| **Goal**              | Integrate safety checks and quality scoring; expose full agent trajectory                                                                            | Ensures transparency, reliability, and trustworthiness                                    |
| **Dependencies**      | Requires full pipeline from Notebook 04                                                                                                              | Builds on multi-agent orchestration and memory logic                                      |
| **Required Tools**    | - `LoggingPlugin` with `log_level=DEBUG` <br> - `SafetyCheckerAgent` <br> - `EvaluatorAgent` <br> - Optional: `LoopAgent`                            | Enables traceability and iterative refinement                                             |
| **Agent Design**      | - SafetyCheckerAgent: non-negotiable guardrail <br> - EvaluatorAgent: rubric-based quality judge                                                     | Mirrors real-world QA and compliance workflows                                            |
| **Evaluation Logic**  | - Safety: pass/fail + justification <br> - Quality: rubric scores (1–5)                                                                              | Validates pedagogical clarity and child-appropriateness                                   |
| **Execution**         | - Run full pipeline with complex topic <br> - Show trace, scores, and final output                                                                  | Demonstrates system maturity and readiness                                                |
| **Architecture**      | - Evaluatable by design <br> - LoopAgent for iterative refinement                                                                                    | Aligns with AgentOps and Capstone rubric                                                  |
| **Good Practices**    | - Structured logs and metrics <br> - Redact sensitive data <br> - Use clear scoring schema                                                           | Ensures compliance, clarity, and reproducibility                                          |
| **Documentation**     | - Inline comments <br> - Markdown explanations                                                                                                       | Supports Capstone reviewers and future collaborators                                      |

---

###  What We’ll Have When This Code Is Done

-  A fully observable, evaluatable multi-agent pipeline
-  Two specialized critique agents: one for safety, one for quality
-  A traceable execution log showing agent thoughts, tool calls, and outputs
-  A rubric-based scoring system for pedagogical quality
-  Optional loop logic for iterative refinement
-  Clear documentation and inline logic to support Capstone delivery and debugging

---


# TinyTutor Capstone Notebook 05: Evaluation and Observability

This notebook adds evaluation and observability to the TinyTutor multi-agent system using mock agents. It demonstrates:
- A SafetyCheckerAgent to ensure age-appropriate, safe content
- An EvaluatorAgent (LLM-as-a-Judge) to score output quality using a rubric
- Full trace logging of agent decisions, tool calls, and outputs
This fulfills the Capstone requirement for AgentOps discipline and transparent system behavior.

In [1]:
from typing import Callable, Dict, Any, List

# Simulated logging
import logging
logging.basicConfig(level=logging.DEBUG)

# Simulated FunctionTool
class FunctionTool:
    def __init__(self, name: str, function: Callable, description: str = ""):
        self.name = name
        self.function = function
        self.description = description

    def call(self, *args, **kwargs):
        logging.debug(f"[Tool Call] {self.name} with args: {args}, kwargs: {kwargs}")
        return self.function(*args, **kwargs)

# Simulated LlmAgent
class LlmAgent:
    def __init__(self, name: str, system_instruction: str, tools: List[FunctionTool] = None, output_key: str = None):
        self.name = name
        self.system_instruction = system_instruction
        self.tools = tools or []
        self.output_key = output_key

    def run(self, input_text: str, context: Dict[str, Any]) -> Dict[str, Any]:
        logging.info(f"[{self.name}] Instruction: {self.system_instruction}")
        logging.info(f"[{self.name}] Input: {input_text}")
        if self.output_key:
            context[self.output_key] = f"{self.name} output based on: {input_text}"
        return context

##  Step 1: Define the SafetyCheckerAgent and EvaluatorAgent

These agents simulate safety validation and rubric-based scoring of the final script.

In [2]:
# Safety Checker
class SafetyCheckerAgent(LlmAgent):
    def run(self, input_text: str, context: Dict[str, Any]) -> Dict[str, Any]:
        logging.info(f"[{self.name}] Checking safety of: {input_text}")
        context["safety_check"] = {
            "status": "pass",
            "justification": "Content is age-appropriate and free of harmful material."
        }
        return context

# Evaluator Agent
class EvaluatorAgent(LlmAgent):
    def run(self, input_text: str, context: Dict[str, Any]) -> Dict[str, Any]:
        logging.info(f"[{self.name}] Scoring script: {input_text}")
        context["evaluation_scores"] = {
            "simplicity": 5,
            "coherence": 4,
            "ELI5_adherence": 5,
            "summary": "Clear, engaging, and well-structured for a 5-year-old."
        }
        return context

safety_agent = SafetyCheckerAgent(
    name="SafetyCheckerAgent",
    system_instruction="Ensure the script is safe and age-appropriate."
)

evaluator_agent = EvaluatorAgent(
    name="EvaluatorAgent",
    system_instruction="Score the script using a rubric: simplicity, coherence, ELI5 adherence."
)

##  Step 2: Define the Script Generator with Optional Loop Logic

This agent simulates refining the script until it passes evaluation.

In [3]:
class ScriptwritingAgent(LlmAgent):
    def run(self, input_text: str, context: Dict[str, Any]) -> Dict[str, Any]:
        logging.info(f"[{self.name}] Generating script from: {input_text}")
        script = f"Once upon a time, a curious child explored photosynthesis with a talking leaf..."
        context[self.output_key] = script
        return context

script_agent = ScriptwritingAgent(
    name="ScriptwritingAgent",
    system_instruction="Generate a child-friendly story script.",
    output_key="final_script"
)

##  Step 3: Define the Coordinator Agent

This agent runs the script generator, then passes the output to the safety and evaluation agents.

In [4]:
class Coordinator:
    def __init__(self, script_agent, safety_agent, evaluator_agent):
        self.script_agent = script_agent
        self.safety_agent = safety_agent
        self.evaluator_agent = evaluator_agent

    def run(self, topic: str) -> Dict[str, Any]:
        context = {}
        context = self.script_agent.run(topic, context)
        context = self.safety_agent.run(context["final_script"], context)
        context = self.evaluator_agent.run(context["final_script"], context)
        return context

##  Step 4: Run the Evaluation Pipeline

We’ll now run the full pipeline for the topic:  
**"The mechanism of photosynthesis"**

In [5]:
coordinator = Coordinator(script_agent, safety_agent, evaluator_agent)
result = coordinator.run("The mechanism of photosynthesis")

print("\n Final Script:\n", result.get("final_script"))
print("\n Safety Check:\n", result.get("safety_check"))
print("\n Evaluation Scores:\n", result.get("evaluation_scores"))


 Final Script:
 Once upon a time, a curious child explored photosynthesis with a talking leaf...

 Safety Check:
 {'status': 'pass', 'justification': 'Content is age-appropriate and free of harmful material.'}

 Evaluation Scores:
 {'simplicity': 5, 'coherence': 4, 'ELI5_adherence': 5, 'summary': 'Clear, engaging, and well-structured for a 5-year-old.'}
