# Part 7: Evaluating AI Agents
In [Part-6-Observability](https://github.com/asanyaga/ai-agents-tutorial/blob/main/part-6-agent-observability.ipynb) we built logging, tracing, metrics, and token tracking into our code review agent.  
We can now see *what* our agent does. But, observability alone does not tell you whether the agent is doing things correctly.

This tutorial addresses the question; *How do we systematically measure whether an agent works*

In this tutorial we will be building very simple evaluations for the code review agent. We are building from scratch and keeping it simple to demostrate key concepts rather than full implementations.

## From Observability to Evaluation
Observability and evaluation are complementary
- **Observability** answers: "What happened during this execution?"
- **Evaluation** answers: "Was that execution correct?"

Observability gives you the data. Evaluation gives you the judgement.

Traditional software testing is straightforward; given input X, expect output Y. The vast majority of traditional software is deterministic, meaning that given the same input and in the same conditions we can expect the output is always going to be the same.
AI Agents rely on outputs from large language models which are non deterministic which means the output from an AI agent is also not deterministic;
- The same query might produce different tool sequences
- Valid outputs can vary in phrasing while meaning the same thing
- Correct output and behaviour from agents often requires judgement not just exact output matching
- The path that the agent takes to get to the result matthers, not just the final output

## Simple Eval: End to End Task Success
End to end evaluations are simple; run the agent, check if the final output meetis expectation.

### Setting up the test environment
We will create a simple test case which will be a python file with a known bug and a way to verify the fix.

In [1]:
import os
import tempfile
from dataclasses import dataclass
from typing import Optional

@dataclass
class SimpleTestCase:
    """A single end to end test case"""
    name: str
    description: str
    input_query: str
    file_content: str
    file_name: str
    expected_behaviour: str

def create_test_file(directory: str, filename: str, content: str) -> str:
    """Create a test file and return its path"""
    file_path = os.path.join(directory, filename)
    with open(file_path,"w") as f:
        f.write(content)
    return file_path

# Test Case: division function missing zero check
first_test = SimpleTestCase(
    name="division_zero_check",
    description="Fix missing division by zero check",
    input_query="Review and fix sample.py",
    file_content= """
def divide(a,b):
    return a/b
""",
    file_name="sample.py",
    expected_behaviour="Should add a check for division by zero"
)

print(f"Test Case: {first_test.name}")
print(f"Description: {first_test.description}")
print(f"Input Query: {first_test.input_query}")


Test Case: division_zero_check
Description: Fix missing division by zero check
Input Query: Review and fix sample.py


# A Simple Pass/Fail Checker
Let us write a function that checks whether the agent's fix is correct

In [None]:
def check_division_fix(filepath: str) -> tuple[bool, str]:
    """ 
    Check if the division function now handles zero division.

    Returns:
        (passed, reason): Whether the test passed and why
    """

    try:
        with open(filepath,"r") as f:
            code = f.read()

            # Check for zero-handling patterns

            zero_checks = [
                "b == 0",
                "b != 0",
                "if not b",
                "if b:",
                "ZeroDivisionError",
                "division by zero"
            ]

            has_zero_check = any(check in code for check in zero_checks)

            if has_zero_check:
                return True, "Code now handles division by zero"
            else:
                return False, f"No zero division handling found. Current code:\n{code}"
    except FileNotFoundError:
        return False,f"File not found {filepath}"
    except Exception as e:
        return False, f"Error checking file {e}"
    
test_code_good = """def divide(a, b):
    if b == 0:
        raise ValueError("Cannot divide by zero)
    return a / b
"""

test_code_bad = """def divide(a, b):
    return a / b
"""

# Verify checker works
with tempfile.TemporaryDirectory() as tmpdir:
    good_path = create_test_file(tmpdir,"good.py", test_code_good)
    bad_path = create_test_file(tmpdir, "bad.py",test_code_bad)

    good_result = check_division_fix(good_path)
    bad_result = check_division_fix(bad_path)

    print(f"Good code check: passed={good_result[0]}, reason={good_result[1][:50]}...")
    print(f"Bad code check: passed={bad_result[0]}, reasong= {bad_result[1][:50]}...")


In [12]:
from code_review_agent_observable import CodeReviewAgentObservable,ToolRegistry,read_file,patch_file,print_review,run_test,write_test

import time

def run_single_evaluation(agent_class, tools_registry, test_case: SimpleTestCase,checker_func) -> dict:
    result = {
        "test_name": test_case.name,
        "passed": False,
        "reason": "",
        "duration_seconds":0,
        "agent_output": "",
        "error": None
    }

    # Create isolated test environment
    with tempfile.TemporaryDirectory() as test_dir:
        origininal_cwd = os.getcwd()
        os.chdir(test_dir)

        try:

            agent = agent_class(tools_registry=tools_registry,model="gpt-4.1")

            file_path = create_test_file(test_dir, test_case.file_name, test_case.file_content)
            # Run Agent
            start_time = time.time()
            agent_output = agent.run(test_case.input_query,max_iterations=10)
            result["duration_seconds"] = time.time() - start_time
            result["agent_output"] = agent_output

            passed, reason = checker_func(file_path)
            result["passed"] = passed
            result["reason"] = reason
        
        except Exception as e:
            result["error"] = str(e)
            result["reason"] = f"Exception during execution: {e}"
        
        finally:
            os.chdir(origininal_cwd)

    return result


In [13]:
registry = ToolRegistry()
registry.register("read_file",read_file)
registry.register("patch_file",patch_file)
registry.register("print_review",print_review)
registry.register("write_test",write_test)
registry.register("run_test",run_test)

result = run_single_evaluation(CodeReviewAgentObservable,registry,first_test,check_division_fix)

print(f"Test: {result["test_name"]}")
print(f"Passed: {result["passed"]}")
print(f"Duration: {result["duration_seconds"]:.2f}s")
print(f"Reason: {result["reason"]}")


[INFO] AGENT_INIT: Agent initialized

------------------------------------------------------------
Step 1 of 10
------------------------------------------------------------
[INFO] THINK_START: Starting Reasoning

Agent's LLM Response:
{"thought": "I've analyzed the task and created an execution plan.", "plan_created": true, "plan": "Step 1: Read the contents of sample.py to review its current implementation and identify issues.\nStep 2: Patch sample.py to fix any identified errors or improve the code as necessary.\nStep 3: Write or update a test file named test_sample.py to cover the functionality of sample.py.\nStep 4: Run tests in test_sample.py to verify that sample.py is working correctly after modifications."}

Plan Created:
Step 1: Read the contents of sample.py to review its current implementation and identify issues.
Step 2: Patch sample.py to fix any identified errors or improve the code as necessary.
Step 3: Write or update a test file named test_sample.py to cover the functi