## Why evaluation?

Before - in AI HERO DATA PREP LONG notebook we learned about function calling and created our first agent using Pydantic AI.

But is this agent actually good? Today we will see how to answer this question.

In particular, we will cover:

-- Build a logging system to track agent interactions

-- Create automated evaluation using AI as a judge

-- Generate test data automatically

-- Measure agent performance with metrics

At the end we will have a thoroughly tested agent with performance metrics.

In this lesson, we'll use the FAQ database with text search, but it's applicable for any other use case.

This is going to be a long lesson, but an important one. Evaluation is critical for building reliable AI systems. Without proper evaluation, you can't tell if your changes improve or hurt performance. You can't compare different approaches. And you can't build confidence before deploying to users.


## Logging

The easiest thing we can do to evaluate an agent is interact with it. We ask something and look at the response. Does it make sense? For most cases, it should.

This approach is called "vibe check" - we interact with it, and if we like the results, we go ahead and deploy it.

If we don't like something, we go back and change things:

-- Maybe our chunking method is not suitable? Maybe we need to have a bigger window size?

-- Is our system prompt good? Maybe we need more precise instructions?

Or we want to change something else

And we iterate.

It might be okay for the first MVP, but how can we make sure the result at the end is actually good?

We need systematic evaluation. Manual testing doesn't scale - you can't manually test every possible input and scenario. With systematic evaluation, we can test hundreds or thousands of cases automatically.

We also need to base our decisions on data. It will help us to

-- Compare different approaches

-- Track improvements

-- Identify edge cases

We can start collecting this data ourselves: start with vibe checking, but be smart about it. We don't just test it, but also record the results.


In [1]:
# full data scrape function

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data


In [2]:
# download data as we did earlier:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")


FAQ documents: 1219
Evidently documents: 95


In [3]:
# build a simple text search index for FAQ data:

from minsearch import Index

# dtc_faq = read_repo_data('DataTalksClub', 'faq')
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']] # we extract only DE = data engineering FAQ

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

# <minsearch.minsearch.Index at 0x1ed6ba1bcb0>


<minsearch.minsearch.Index at 0x1a532390e00>

In [4]:
# we have to enter OpenAI key here and then we can run all cells freely...
import os
from getpass import getpass
from openai import OpenAI

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


In [5]:
# Here's the agent we created yesterday:

from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)


In [6]:
# Let's interact with the agent:
question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)

In [7]:
result

AgentRunResult(output='To install Kafka in Python, you can use the following commands to set up the necessary dependencies:\n\n1. **Install `confluent-kafka`:**\n   - Using pip:\n     ```bash\n     pip install confluent-kafka\n     ```\n   - Using conda:\n     ```bash\n     conda install conda-forge::python-confluent-kafka\n     ```\n\n2. **Install `fastavro`:**\n   ```bash\n   pip install fastavro\n   ```\n\nThese commands will help you set up the essential packages for working with Kafka in Python. If you encounter any issues, feel free to ask for more specific guidance!')

Here's what we want to record:

-- The system prompt that we used

-- The model

-- The user query

-- The tools we use

-- The responses and the back-and-forth interactions between the LLM and our tools

-- The final response

To make it simpler, we'll implement a simple logging system ourselves: we will just write logs to json files.

You shouldn't use it in production. In practice, you will want to send these logs to some log collection system, or use specialized LLM evaluation tools like Evidently, LangWatch or Arize Phoenix.


In [8]:
# Let's extract all this information from the agent and from the run results:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }

# This code extracts the key information from our agent:
# -- the configuration (name, prompt, model)
# -- available tools
# -- complete message history (user input, tool calls, responses)
# We also use ModelMessagesTypeAdapter.dump_python(messages) to convert internal message format into regular Python dictionaries. 
# This makes it easier to save it to JSON and process later.
# We also add the source parameter. It tracks where the question came from. We start with "user" but later we'll use AI-generated queries. 
# Sometimes it may be important to tell them apart for analysis.
# This code is generic so it will work with any Pydantic AI agent. If you use a different library, you'll need to adjust this code.



In [9]:
# Let's write these logs to a folder:

import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    ts_str = ts.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

# This code above:
# Creates a logs directory (if not created previously)
# Generates unique filenames with timestamp and random hex
# Saves complete interaction logs as JSON files
# Handles datetime serialization (using the serialized function)



In [10]:

# Now we can interact with it and do some vibe checking:
    
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# This creates a simple interactive loop where:
# -- User enters a question
# -- Agent processes it and responds
# -- Complete interaction is logged to a file

# Try these questions:
# how do I use docker on windows?
# can I join late and get a certificate?
# what do I need to do for the certificate?


 how do I use docker on Windows?


To use Docker on Windows, follow these guidelines based on your version:

### For Windows 10 Pro / 11 Pro Users:
1. **Enable Hyper-V**: Docker can utilize Hyper-V as a backend. You can enable Hyper-V by following a tutorial such as [this one](https://www.c-sharpcorner.com/article/install-and-configured-docker-desktop-in-windows-10/).
   
2. **Download Docker Desktop**: Install Docker Desktop for Windows from [Docker's official site](https://docs.docker.com/desktop/install/windows-install/).

### For Windows 10 Home / 11 Home Users:
1. **Use WSL2**: The Home edition does not support Hyper-V. Instead, Docker uses WSL2 (Windows Subsystem for Linux). You can install WSL2 by following instructions from [this guide](https://pureinfotech.com/install-wsl-windows-11/).

2. **Update WSL2**: If you encounter the error "WslRegisterDistribution failed with error: 0x800701bc", make sure to update the WSL2 Linux Kernel by following guidelines provided at this [GitHub issue](https://github.com/microso

WindowsPath('logs/faq_agent_20251015_161727_414680.json')

## Adding References

In [11]:
# When interacting with the agent, I noticed one thing: it doesn't include the reference to the original documents.
# Let's fix it by adjusting the prompt:

system_prompt = """
You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
correct link looks like this - without << /faq-main >> part :
<< https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md >> 
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

# Create another version of agent, let's call it faq_agent_v2
agent = Agent(
    name="faq_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)

# This is the output I now get for the question "can I join late and get a certificate?":

# Yes, you can join the course late and still be eligible for a certificate, as long as you complete the required peer-reviewed 
# capstone projects on time. You do not need to complete the homework assignments if you join late, which allows for flexibility in participation.
# However, please note that certificates are only awarded to those who finish the course with a “live” cohort; 
# they are not available for those who choose the self-paced mode. 
# This is because peer-reviewing capstone projects is a requirement that can only be done while the course is active......


In [12]:
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# NB - links were broken in original code - I checked generated links and model simply merge 2 http adress parts instead of replacing :
# Wrong link is
# https://github.com/DataTalksClub/faq/blob/main/faq-main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md
# correct link is
# https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md
# as I manually deleted /faq-main part ...
# I FIXED PROMPT ABOVE AND NOW IT WORKS WELL !!!


 how do I use docker on linux?


To use Docker on Linux, you may need to follow these general steps:

1. **Installation**: Depending on your Linux distribution, you can install Docker using different methods. For example, on Ubuntu, you can use the snap command:

   ```bash
   sudo snap install docker
   ```

   This command installs Docker via snap packages, which might be available on certain versions of Ubuntu. [Learn more here](https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/module-1/036_1b727dde32_docker-docker-not-installable-on-ubuntu.md).

2. **Extra Hosts Configuration**: If you're using Docker Compose and face issues resolving `host.docker.internal`, you might need to configure it in your `docker-compose.yml` file like so:

   ```yaml
   kestra:
     image: kestra/kestra:latest
     extra_hosts:
       - "host.docker.internal:host-gateway"
   ```

   This setup allows your container to access host services correctly. You can check more details on this configuration. [Read 

WindowsPath('logs/faq_agent_v2_20251015_161845_62dcb0.json')

## LLM as a Judge

In [15]:
# You can ask your colleagues to also do a "vibe check", but make sure you record the data. 
# Often collecting 10-20 examples and manually inspecting them is enough to understand how your model is doing.
# Don't be afraid of putting manual work into evaluation. Manual evaluation will help you understand edge cases, 
# learn what good responses look like and think of evaluation criteria for automated checks later.
# For example, I manually inspected the output and noticed that references are missing. So we will later add it as one of the checks.
# So, in our case, we can have the following checks:
# Does the agent follow the instructions?
# Given the question, does the answer make sense?
# Does it include references?
# Did the agent use the available tools?
# We don't have to evaluate this manually. Instead, we can delegate this to AI. This technique is called "LLM as a Judge".
# The idea is simple: we use one LLM to evaluate the outputs of another LLM. This works because LLMs are good at following detailed evaluation criteria.
    
# Our system prompt for the judge (we'll call it "evaluation agent" because it sounds cooler) can look like that:

evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user's question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked? 

Output true/false for each check and provide a short explanation for your judgment.
""".strip()

In [16]:
# Since we expect a very well defined structure of the response, we can use structured output.
# We can define a Pydantic class with the expected response structure, and the LLM will produce output that matches this schema exactly.
    
# This is how we do it:

from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str

# This code defines the structure we expect from our evaluation:
# Each check has a name, justification, and pass/fail result
# The overall evaluation includes a list of checks and a summary
# Note that justification comes before check_pass. This makes 
# the LLM reason about the answer before giving the final judgment, 
# which typically leads to better evaluation quality.



In [17]:
# With Pydantic AI in order to make the output follow the specified class, we use the parameter output_type:
    
eval_agent = Agent(
    name='eval_agent',
    model='gpt-5-nano',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)


In [18]:
# Usually it's a good idea to evaluate the results of one model (in our case, "gpt-4o-mini") 
# with another model (e.g. "gpt-5-nano"). A different model can catch mistakes, reduce self-bias, 
# and give a second opinion. This makes evaluations more reliable.
# We have the instructions, and we have the agent. In order to run the agent, it needs input. 
    
# We'll start with a template:

user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

# We use XML markup because it's easier and more clear for LLMs to understand the input. 
# XML tags help the model see the structure and boundaries of different sections in the prompt.


In [19]:
# Let's fill it in. First, define a helper function for loading JSON log files:

def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data

# We also add the filename in the result - it'll help us with tracking later.


In [33]:
# Now let's use it:
# log_record = load_log_file('./logs/faq_agent_v2_20250926_072928_467470.json')
# here I copied full path and replaced Windows backslashes \ to forward slashes /
log_record = load_log_file('C:/tmp/aihero/course/logs/faq_agent_20251015_155744_df13b1.json')
instructions = log_record['system_prompt']
question = log_record['messages'][0]['parts'][0]['content']
answer = log_record['messages'][-1]['parts'][0]['content']
log = json.dumps(log_record['messages'])

user_prompt = user_prompt_format.format(
    instructions=instructions,
    question=question,
    answer=answer,
    log=log
)


In [34]:
# The user input is ready and we can test it!
result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)

checklist = result.output
print(checklist.summary)

for check in checklist.checklist:
    print(check)

# This code:
# Loads a saved interaction log
# Extracts the key components (instructions, question, answer, full log)
# Formats them into the evaluation prompt
# Runs the evaluation agent
# Prints the results


Initiated a search tool call to fetch course-material relevant Docker-on-Linux guidance; awaiting results to craft a complete, sourced answer.
check_name='instructions_follow' justification="The agent will follow the user's instruction to search course materials before answering (per <INSTRUCTIONS>). The agent hasn't yet, but will trigger a search tool in this response as required." check_pass=True
check_name='instructions_avoid' justification='No disallowed behavior observed. The response will rely on course materials and common Docker usage; nothing problematic. ' check_pass=True
check_name='answer_relevant' justification='The answer will address Docker use on Linux. ' check_pass=True
check_name='answer_clear' justification='The final answer will present steps clearly. ' check_pass=True
check_name='answer_citations' justification='The answer will cite course material results from search; or at least reference them. ' check_pass=True
check_name='completeness' justification='The answer

In [35]:
# Note that we're putting the entire conversation log into the prompt, which is not really necessary. 
# We can reduce it to make it less verbose.
# For example, like that:

def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
    
            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                # Replace actual search results with placeholder to save tokens
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified

# We make it simpler:
# remove timestamps and IDs that aren't needed for evaluation
# replace actual search results with a placeholder
# keep only the essential structure
# This is helpful because it reduces the number of tokens we send to the evaluation model, 
# which lowers the costs and speeds up evaluation.


In [36]:
# Let's put everything together:

async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][0]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result.output 


# log_record = load_log_file('./logs/faq_agent_v2_20250926_072928_467470.json')
# here I copied full path and replaced Windows backslashes \ to forward slashes /
log_record = load_log_file('C:/tmp/aihero/course/logs/faq_agent_20251015_155744_df13b1.json')
eval1 = await evaluate_log_record(eval_agent, log_record)

# We know how to log our data and how to run evals on our logs.
# Great. But how do we get more data to get a better understanding of the performance of our model?


In [37]:
eval1

EvaluationChecklist(checklist=[EvaluationCheck(check_name='instructions_follow', justification='The user instruction required using a search tool to fetch course materials before answering; a text_search tool call is present in the log, indicating this was done.', check_pass=True), EvaluationCheck(check_name='instructions_avoid', justification='No disallowed actions or content were involved; the answer stayed within typical Docker usage guidance.', check_pass=True), EvaluationCheck(check_name='answer_relevant', justification="The answer directly explains how to install, start, configure, and use Docker on Linux, addressing the user's question.", check_pass=True), EvaluationCheck(check_name='answer_clear', justification='The steps are organized and include concrete commands, making the guidance easy to follow.', check_pass=True), EvaluationCheck(check_name='answer_citations', justification='No citations were included, but citations are not strictly required for this general setup guide.

## Data Generation

In [None]:
# Data Generation
# We can ask AI to help. What if we used it for generating more questions? Let's do that.
# We can sample some records from our database. Then for each record, ask an LLM to generate a question based on the record. 
# We use this question as input to our agent and log the answers.
