## Why evaluation?

Before - in AI HERO DATA PREP LONG notebook we learned about function calling and created our first agent using Pydantic AI.

But is this agent actually good? Today we will see how to answer this question.

In particular, we will cover:

-- Build a logging system to track agent interactions

-- Create automated evaluation using AI as a judge

-- Generate test data automatically

-- Measure agent performance with metrics

At the end we will have a thoroughly tested agent with performance metrics.

In this lesson, we'll use the FAQ database with text search, but it's applicable for any other use case.

This is going to be a long lesson, but an important one. Evaluation is critical for building reliable AI systems. Without proper evaluation, you can't tell if your changes improve or hurt performance. You can't compare different approaches. And you can't build confidence before deploying to users.


## Logging

The easiest thing we can do to evaluate an agent is interact with it. We ask something and look at the response. Does it make sense? For most cases, it should.

This approach is called "vibe check" - we interact with it, and if we like the results, we go ahead and deploy it.

If we don't like something, we go back and change things:

-- Maybe our chunking method is not suitable? Maybe we need to have a bigger window size?

-- Is our system prompt good? Maybe we need more precise instructions?

Or we want to change something else

And we iterate.

It might be okay for the first MVP, but how can we make sure the result at the end is actually good?

We need systematic evaluation. Manual testing doesn't scale - you can't manually test every possible input and scenario. With systematic evaluation, we can test hundreds or thousands of cases automatically.

We also need to base our decisions on data. It will help us to

-- Compare different approaches

-- Track improvements

-- Identify edge cases

We can start collecting this data ourselves: start with vibe checking, but be smart about it. We don't just test it, but also record the results.


In [1]:
# full data scrape function

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data


In [2]:
# download data as we did earlier:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")


FAQ documents: 1219
Evidently documents: 95


In [3]:
# build a simple text search index for FAQ data:

from minsearch import Index

# dtc_faq = read_repo_data('DataTalksClub', 'faq')
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']] # we extract only DE = data engineering FAQ

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

# <minsearch.minsearch.Index at 0x1ed6ba1bcb0>


<minsearch.minsearch.Index at 0x1a532390e00>

In [4]:
# we have to enter OpenAI key here and then we can run all cells freely...
import os
from getpass import getpass
from openai import OpenAI

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


In [5]:
# Here's the agent we created yesterday:

from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)


In [6]:
# Let's interact with the agent:
question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)

In [7]:
result

AgentRunResult(output='To install Kafka in Python, you can use the following commands to set up the necessary dependencies:\n\n1. **Install `confluent-kafka`:**\n   - Using pip:\n     ```bash\n     pip install confluent-kafka\n     ```\n   - Using conda:\n     ```bash\n     conda install conda-forge::python-confluent-kafka\n     ```\n\n2. **Install `fastavro`:**\n   ```bash\n   pip install fastavro\n   ```\n\nThese commands will help you set up the essential packages for working with Kafka in Python. If you encounter any issues, feel free to ask for more specific guidance!')

Here's what we want to record:

-- The system prompt that we used

-- The model

-- The user query

-- The tools we use

-- The responses and the back-and-forth interactions between the LLM and our tools

-- The final response

To make it simpler, we'll implement a simple logging system ourselves: we will just write logs to json files.

You shouldn't use it in production. In practice, you will want to send these logs to some log collection system, or use specialized LLM evaluation tools like Evidently, LangWatch or Arize Phoenix.


In [8]:
# Let's extract all this information from the agent and from the run results:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }

# This code extracts the key information from our agent:
# -- the configuration (name, prompt, model)
# -- available tools
# -- complete message history (user input, tool calls, responses)
# We also use ModelMessagesTypeAdapter.dump_python(messages) to convert internal message format into regular Python dictionaries. 
# This makes it easier to save it to JSON and process later.
# We also add the source parameter. It tracks where the question came from. We start with "user" but later we'll use AI-generated queries. 
# Sometimes it may be important to tell them apart for analysis.
# This code is generic so it will work with any Pydantic AI agent. If you use a different library, you'll need to adjust this code.



In [9]:
# Let's write these logs to a folder:

import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    ts_str = ts.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

# This code above:
# Creates a logs directory (if not created previously)
# Generates unique filenames with timestamp and random hex
# Saves complete interaction logs as JSON files
# Handles datetime serialization (using the serialized function)



In [10]:

# Now we can interact with it and do some vibe checking:
    
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# This creates a simple interactive loop where:
# -- User enters a question
# -- Agent processes it and responds
# -- Complete interaction is logged to a file

# Try these questions:
# how do I use docker on windows?
# can I join late and get a certificate?
# what do I need to do for the certificate?


 how do I use docker on Windows?


To use Docker on Windows, follow these guidelines based on your version:

### For Windows 10 Pro / 11 Pro Users:
1. **Enable Hyper-V**: Docker can utilize Hyper-V as a backend. You can enable Hyper-V by following a tutorial such as [this one](https://www.c-sharpcorner.com/article/install-and-configured-docker-desktop-in-windows-10/).
   
2. **Download Docker Desktop**: Install Docker Desktop for Windows from [Docker's official site](https://docs.docker.com/desktop/install/windows-install/).

### For Windows 10 Home / 11 Home Users:
1. **Use WSL2**: The Home edition does not support Hyper-V. Instead, Docker uses WSL2 (Windows Subsystem for Linux). You can install WSL2 by following instructions from [this guide](https://pureinfotech.com/install-wsl-windows-11/).

2. **Update WSL2**: If you encounter the error "WslRegisterDistribution failed with error: 0x800701bc", make sure to update the WSL2 Linux Kernel by following guidelines provided at this [GitHub issue](https://github.com/microso

WindowsPath('logs/faq_agent_20251015_161727_414680.json')

## Adding References

In [11]:
# When interacting with the agent, I noticed one thing: it doesn't include the reference to the original documents.
# Let's fix it by adjusting the prompt:

system_prompt = """
You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
correct link looks like this - without << /faq-main >> part :
<< https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md >> 
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

# Create another version of agent, let's call it faq_agent_v2
agent = Agent(
    name="faq_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)

# This is the output I now get for the question "can I join late and get a certificate?":

# Yes, you can join the course late and still be eligible for a certificate, as long as you complete the required peer-reviewed 
# capstone projects on time. You do not need to complete the homework assignments if you join late, which allows for flexibility in participation.
# However, please note that certificates are only awarded to those who finish the course with a “live” cohort; 
# they are not available for those who choose the self-paced mode. 
# This is because peer-reviewing capstone projects is a requirement that can only be done while the course is active......


In [12]:
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# NB - links were broken in original code - I checked generated links and model simply merge 2 http adress parts instead of replacing :
# Wrong link is
# https://github.com/DataTalksClub/faq/blob/main/faq-main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md
# correct link is
# https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md
# as I manually deleted /faq-main part ...
# I FIXED PROMPT ABOVE AND NOW IT WORKS WELL !!!


 how do I use docker on linux?


To use Docker on Linux, you may need to follow these general steps:

1. **Installation**: Depending on your Linux distribution, you can install Docker using different methods. For example, on Ubuntu, you can use the snap command:

   ```bash
   sudo snap install docker
   ```

   This command installs Docker via snap packages, which might be available on certain versions of Ubuntu. [Learn more here](https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/module-1/036_1b727dde32_docker-docker-not-installable-on-ubuntu.md).

2. **Extra Hosts Configuration**: If you're using Docker Compose and face issues resolving `host.docker.internal`, you might need to configure it in your `docker-compose.yml` file like so:

   ```yaml
   kestra:
     image: kestra/kestra:latest
     extra_hosts:
       - "host.docker.internal:host-gateway"
   ```

   This setup allows your container to access host services correctly. You can check more details on this configuration. [Read 

WindowsPath('logs/faq_agent_v2_20251015_161845_62dcb0.json')

## LLM as a Judge

In [15]:
# You can ask your colleagues to also do a "vibe check", but make sure you record the data. 
# Often collecting 10-20 examples and manually inspecting them is enough to understand how your model is doing.
# Don't be afraid of putting manual work into evaluation. Manual evaluation will help you understand edge cases, 
# learn what good responses look like and think of evaluation criteria for automated checks later.
# For example, I manually inspected the output and noticed that references are missing. So we will later add it as one of the checks.
# So, in our case, we can have the following checks:
# Does the agent follow the instructions?
# Given the question, does the answer make sense?
# Does it include references?
# Did the agent use the available tools?
# We don't have to evaluate this manually. Instead, we can delegate this to AI. This technique is called "LLM as a Judge".
# The idea is simple: we use one LLM to evaluate the outputs of another LLM. This works because LLMs are good at following detailed evaluation criteria.
    
# Our system prompt for the judge (we'll call it "evaluation agent" because it sounds cooler) can look like that:

evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user's question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked? 

Output true/false for each check and provide a short explanation for your judgment.
""".strip()

In [16]:
# Since we expect a very well defined structure of the response, we can use structured output.
# We can define a Pydantic class with the expected response structure, and the LLM will produce output that matches this schema exactly.
    
# This is how we do it:

from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str

# This code defines the structure we expect from our evaluation:
# Each check has a name, justification, and pass/fail result
# The overall evaluation includes a list of checks and a summary
# Note that justification comes before check_pass. This makes 
# the LLM reason about the answer before giving the final judgment, 
# which typically leads to better evaluation quality.



In [17]:
# With Pydantic AI in order to make the output follow the specified class, we use the parameter output_type:
    
eval_agent = Agent(
    name='eval_agent',
    model='gpt-5-nano',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)


In [18]:
# Usually it's a good idea to evaluate the results of one model (in our case, "gpt-4o-mini") 
# with another model (e.g. "gpt-5-nano"). A different model can catch mistakes, reduce self-bias, 
# and give a second opinion. This makes evaluations more reliable.
# We have the instructions, and we have the agent. In order to run the agent, it needs input. 
    
# We'll start with a template:

user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

# We use XML markup because it's easier and more clear for LLMs to understand the input. 
# XML tags help the model see the structure and boundaries of different sections in the prompt.


In [19]:
# Let's fill it in. First, define a helper function for loading JSON log files:

def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data

# We also add the filename in the result - it'll help us with tracking later.


In [33]:
# Now let's use it:
# log_record = load_log_file('./logs/faq_agent_v2_20250926_072928_467470.json')
# here I copied full path and replaced Windows backslashes \ to forward slashes /
log_record = load_log_file('C:/tmp/aihero/course/logs/faq_agent_20251015_155744_df13b1.json')
instructions = log_record['system_prompt']
question = log_record['messages'][0]['parts'][0]['content']
answer = log_record['messages'][-1]['parts'][0]['content']
log = json.dumps(log_record['messages'])

user_prompt = user_prompt_format.format(
    instructions=instructions,
    question=question,
    answer=answer,
    log=log
)


In [34]:
# The user input is ready and we can test it!
result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)

checklist = result.output
print(checklist.summary)

for check in checklist.checklist:
    print(check)

# This code:
# Loads a saved interaction log
# Extracts the key components (instructions, question, answer, full log)
# Formats them into the evaluation prompt
# Runs the evaluation agent
# Prints the results


Initiated a search tool call to fetch course-material relevant Docker-on-Linux guidance; awaiting results to craft a complete, sourced answer.
check_name='instructions_follow' justification="The agent will follow the user's instruction to search course materials before answering (per <INSTRUCTIONS>). The agent hasn't yet, but will trigger a search tool in this response as required." check_pass=True
check_name='instructions_avoid' justification='No disallowed behavior observed. The response will rely on course materials and common Docker usage; nothing problematic. ' check_pass=True
check_name='answer_relevant' justification='The answer will address Docker use on Linux. ' check_pass=True
check_name='answer_clear' justification='The final answer will present steps clearly. ' check_pass=True
check_name='answer_citations' justification='The answer will cite course material results from search; or at least reference them. ' check_pass=True
check_name='completeness' justification='The answer

In [35]:
# Note that we're putting the entire conversation log into the prompt, which is not really necessary. 
# We can reduce it to make it less verbose.
# For example, like that:

def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
    
            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                # Replace actual search results with placeholder to save tokens
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified

# We make it simpler:
# remove timestamps and IDs that aren't needed for evaluation
# replace actual search results with a placeholder
# keep only the essential structure
# This is helpful because it reduces the number of tokens we send to the evaluation model, 
# which lowers the costs and speeds up evaluation.


In [36]:
# Let's put everything together:

async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][0]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result.output 


# log_record = load_log_file('./logs/faq_agent_v2_20250926_072928_467470.json')
# here I copied full path and replaced Windows backslashes \ to forward slashes /
log_record = load_log_file('C:/tmp/aihero/course/logs/faq_agent_20251015_155744_df13b1.json')
eval1 = await evaluate_log_record(eval_agent, log_record)

# We know how to log our data and how to run evals on our logs.
# Great. But how do we get more data to get a better understanding of the performance of our model?


In [37]:
eval1

EvaluationChecklist(checklist=[EvaluationCheck(check_name='instructions_follow', justification='The user instruction required using a search tool to fetch course materials before answering; a text_search tool call is present in the log, indicating this was done.', check_pass=True), EvaluationCheck(check_name='instructions_avoid', justification='No disallowed actions or content were involved; the answer stayed within typical Docker usage guidance.', check_pass=True), EvaluationCheck(check_name='answer_relevant', justification="The answer directly explains how to install, start, configure, and use Docker on Linux, addressing the user's question.", check_pass=True), EvaluationCheck(check_name='answer_clear', justification='The steps are organized and include concrete commands, making the guidance easy to follow.', check_pass=True), EvaluationCheck(check_name='answer_citations', justification='No citations were included, but citations are not strictly required for this general setup guide.

## Data Generation

In [39]:
# Data Generation
# We can ask AI to help. What if we used it for generating more questions? Let's do that.
# We can sample some records from our database. Then for each record, ask an LLM to generate a question based on the record. 
# We use this question as input to our agent and log the answers.

# Let’s start by defining the question generator:

question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about a data engineering course.

Based on the provided FAQ content, generate realistic questions that students might ask.

The questions should:

- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general course questions

Generate one question for each record.
""".strip()

class QuestionsList(BaseModel):
    questions: list[str]

question_generator = Agent(
    name="question_generator",
    instructions=question_generation_prompt,
    model='gpt-4o-mini',
    output_type=QuestionsList
)

# This prompt is designed for our specific use case (data engineering course FAQ). You should adjust it for your project.
# We will send it a bunch of records, and it will generate a question from each of them.
# Note: we use a simple way of generating questions. We can use a more complex approach where 
# we also track the source (filename) of the question. If we do it, we can later check 
# if this file was retrieved and cited in the answer. But we won't do it today to make things simpler.


In [42]:
# Now let's sample 10 records from our dataset using Python's built-in random.sample function:

import random

sample = random.sample(de_dtc_faq, 10)
prompt_docs = [d['content'] for d in sample]
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.output.questions

In [43]:
questions

['How do I install the `dlt[duckdb]` package to run the provided code?',
 'What does horizontal scaling involve in the context of data consumption?',
 'Where can I find the `rides.csv` data used by the `producer.py` Python program?',
 'Can you explain how to unzip a parquet file and use it with pandas?',
 'Does BigQuery support real-time data streaming capabilities?',
 'How can I switch to in-file storage for my dataset?',
 'What should I do if I face date type errors after uploading FHV 2019 CSV files to BigQuery?',
 'What can cause a `FileNotFoundException` when writing parquet files in PySpark, and how can I solve this?',
 'How does PySpark handle the difference between two timestamps, and how can I express that duration in hours?',
 'What is the purpose of the Key Vault in Azure Cloud, and how can it be used for storing sensitive information?']

In [44]:

# Now we simply iterate over each of the question, ask our agent and log the results:

from tqdm.auto import tqdm

for q in tqdm(questions):
    print(q)

    result = await agent.run(user_prompt=q)
    print(result.output)

    log_interaction_to_file(
        agent,
        result.new_messages(),
        source='ai-generated'
    )

    print()

# We can repeat it multiple times until we have enough data. 
# Around 100 should be good for a start, but today we can just 
# continue with the 10 log records we already generated.


  0%|          | 0/10 [00:00<?, ?it/s]

How do I install the `dlt[duckdb]` package to run the provided code?
To install the `dlt[duckdb]` package, you can run the following command in your terminal or Jupyter notebook:

```bash
pip install dlt[duckdb]
```

If you are using it locally, ensure that `duckdb` is installed before loading the `duckdb` package as follows:

```bash
pip install "dlt[duckdb]"
```

This will ensure that all the necessary dependencies for running your code are installed properly. 

For more details, you can refer to the source material [here](https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/workshop-1-dlthub/003_122d2b0aed_how-do-i-install-the-necessary-dependencies-to-run.md).

What does horizontal scaling involve in the context of data consumption?
In the context of data consumption, horizontal scaling involves distributing the process of consuming messages across multiple consumers. This approach allows for the parallel consumption of data, enhancing processing spee

In [45]:
# Using AI for generating test data is quite powerful. It can help us get data faster 
# and sometimes cover edge cases we won't think about.
# There are limitations too:
# AI-generated questions might not reflect real user behavior
# It may miss important edge cases that only real users encounter
# They may not capture the full complexity of real user queries
# The logs are ready, so we can run evaluation on them with our evaluation agent.
    
# First, collect all the AI-generated logs for the v2 agent:

eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if 'faq_agent_v2' not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    if log_record['source'] != 'ai-generated':
        continue

    eval_set.append(log_record)


In [47]:
# eval_set

In [48]:
# And evaluate them:

eval_results = []

for log_record in tqdm(eval_set):
    eval_result = await evaluate_log_record(eval_agent, log_record)
    eval_results.append((log_record, eval_result))

# This code:
# Loops through each AI-generated log
# Runs our evaluation agent on it
# Stores both the original log and evaluation result
# There are ways to speed this up, but we won't cover them in detail here. For example, you can try this:
# Don't ask for justification - this makes evaluation faster but slightly lower quality
# Parallelize execution - you can ask ChatGPT how to do this with async/await


  0%|          | 0/10 [00:00<?, ?it/s]

In [49]:
# The results are collected, but we need to display them and also calculate some statistics. 
# The best tool for doing this is Pandas. We already should have it because minsearch depends on it. 
# But we can make it an explicit dependency:

# uv add pandas

eval_results[0]

({'agent_name': 'faq_agent_v2',
  'system_prompt': ['You are a helpful assistant for a course.  \n\nUse the search tool to find relevant information from the course materials before answering questions.  \n\nIf you can find specific information through search, use it to provide accurate answers.\n\nAlways include references by citing the filename of the source material you used.  \nWhen citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"\ncorrect link looks like this - without << /faq-main >> part :\n<< https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/general/014_3774a79c13_certificate-do-i-need-to-do-the-homeworks-to-get-t.md >> \nFormat: [LINK TITLE](FULL_GITHUB_LINK)\n\nIf the search doesn\'t return relevant results, let the user know and provide general guidance.'],
  'provider': 'openai',
  'model': 'gpt-4o-mini',
  'tools': ['text_search'],
  'messages': [{'parts':

In [50]:

# Our data is not ready to be converted to a Pandas DataFrame. 
# We first need to transform it a little. Let’s do it:

rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']

    row = {
        'file': log_record['log_file'].name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }

    checks = {c.check_name: c.check_pass for c in eval_result.checklist}
    row.update(checks)

    rows.append(row)

# This code:
# Extracts key information from each log (file, question, answer)
# Converts the evaluation checks into a dictionary format


In [52]:
rows[2]

{'file': 'faq_agent_v2_20251015_200046_c56d1b.json',
 'question': 'Where can I find the `rides.csv` data used by the `producer.py` Python program?',
 'answer': 'You can find the `rides.csv` data used by the `producer.py` Python program at the following link: [Rides CSV File](https://raw.githubusercontent.com/DataTalksClub/data-engineering-zoomcamp/2bd33e89906181e424f7b12a299b70b19b7cfcd5/week_6_stream_processing/python/resources/rides.csv) [source](https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/module-6/011_759b5d80ec_kafka-python-videos-ridescsv.md). \n\nIf this file is missing, it is suggested to copy it from the Java example directory at this path: `data-engineering-zoomcamp/week_6_stream_processing/java/kafka_examples/src/main/resources/rides.csv` [source](https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/module-6/009_dcdd7eda4d_resourcesridescsv-is-missing.md).',
 'instructions_follow': True,
 'instructions_avo

In [55]:
# Now each row is a simple key-value dictionary, so we can create a DataFrame:
import pandas as pd

df_evals = pd.DataFrame(rows)
df_evals

Unnamed: 0,file,question,answer,instructions_follow,instructions_avoid,answer_relevant,answer_clear,answer_citations,completeness,tool_call_search
0,faq_agent_v2_20251015_200036_d284b1.json,How do I install the `dlt[duckdb]` package to ...,"To install the `dlt[duckdb]` package, you can ...",False,True,True,True,False,True,False
1,faq_agent_v2_20251015_200041_996004.json,What does horizontal scaling involve in the co...,"In the context of data consumption, horizontal...",True,True,True,True,True,True,True
2,faq_agent_v2_20251015_200046_c56d1b.json,Where can I find the `rides.csv` data used by ...,You can find the `rides.csv` data used by the ...,True,True,True,True,True,True,False
3,faq_agent_v2_20251015_200053_37c351.json,Can you explain how to unzip a parquet file an...,To unzip a Parquet file and use it with pandas...,False,True,True,True,False,True,False
4,faq_agent_v2_20251015_200103_84f227.json,Does BigQuery support real-time data streaming...,"Yes, BigQuery supports real-time data streamin...",True,True,True,True,True,True,False
5,faq_agent_v2_20251015_200107_8a1eb5.json,How can I switch to in-file storage for my dat...,To switch to in-file storage for your dataset ...,True,True,True,True,True,True,False
6,faq_agent_v2_20251015_200114_514722.json,What should I do if I face date type errors af...,If you encounter date type errors after upload...,True,True,True,True,True,True,True
7,faq_agent_v2_20251015_200123_9706cd.json,What can cause a `FileNotFoundException` when ...,A `FileNotFoundException` when writing Parquet...,False,True,True,True,True,False,False
8,faq_agent_v2_20251015_200133_50a0fd.json,How does PySpark handle the difference between...,"In PySpark, the difference between two `Timest...",True,True,True,True,True,False,True
9,faq_agent_v2_20251015_200141_922cb1.json,What is the purpose of the Key Vault in Azure ...,The Azure Key Vault is designed to securely st...,,,,,,,


In [62]:
# We can look at individual records and see which checks are False.
# But it's also useful to look at the overall stats:

# df_evals.mean(numeric_only=True)
# talking to ChatGPT to modify it:
df_evals2 = df_evals.replace({True: 1, False: 0})
round(df_evals2.mean(numeric_only=True), 2)
# df_evals2
# This calculates the average pass rate for each check

  df_evals2 = df_evals.replace({True: 1, False: 0})


instructions_follow    0.67
instructions_avoid     1.00
answer_relevant        1.00
answer_clear           1.00
answer_citations       0.78
completeness           0.78
tool_call_search       0.33
dtype: float64

In [63]:

# Code above calculates the average pass rate for each check like so:
# instructions_follow    0.3
# instructions_avoid     1.0
# answer_relevant        1.0
# answer_clear           1.0
# answer_citations       0.3
# completeness           0.7
# tool_call_search       1.0

# This tells us:
# Only 30% of responses follow instructions completely
# All responses avoid forbidden actions (good!)
# All responses are relevant and clear (great!)
# Only 30% include proper citations (needs improvement)
# 70% of responses are complete
# All responses use the search tool (as expected)
# For us, the most important check is answer_relevant. 
# This tells us whether the agent actually answers the user's question. 
# If this score was low, it’d mean that our agent is not ready.
# We now know how to evaluate our agent. What can we do with it now?
# Many things:
# Decide if this quality is good enough for deployment
# Evaluate different chunking approaches and search
# See if changing a prompt leads to any improvements.
# The algorithm is simple:
# Collect data for evaluation and keep this dataset fixed
# Run different versions of your agent for this dataset
# Compare key metrics to decide which version is better
# Evaluation is a very powerful tool and we should use it when possible. 


## Evaluating functions and tools

In [65]:
# Evaluating functions and tools
# Also, we can (and should) evaluate our tools separately from evaluating the agent.
# If it's code, we need to cover it with unit and integration tests.
# We also have the search function, which we can evaluate using standard information retrieval metrics. For example:
# Precision and Recall: How many relevant results were retrieved vs. how many relevant results were missed
# Hit Rate: Percentage of queries that return at least one relevant result
# MRR (Mean Reciprocal Rank): Reflects the position of the first relevant result in the ranking
# This is how we can implement hitrate and MRR calculation in Python:
# def evaluate_search_quality(search_function, test_queries):
#     results = []
    
#     for query, expected_docs in test_queries:
#         search_results = search_function(query, num_results=5)
        
#         # Calculate hit rate
#         relevant_found = any(doc['filename'] in expected_docs for doc in search_results)
        
#         # Calculate MRR
#         for i, doc in enumerate(search_results):
#             if doc['filename'] in expected_docs:
#                 mrr = 1 / (i + 1)
#                 break
#         else:
#             mrr = 0
            
#         results.append({
#             'query': query,
#             'hit': relevant_found,
#             'mrr': mrr
#         })
#     return results

# We won't do it today, but these ideas and the code will be useful when you implement a real agent project with search.
# It's useful because it'll helps us make guided decisions about:
# When to use text vs. vector vs. hybrid search
# What are the best parameters for our search
# You can ask ChatGPT to learn more about information retrieval evaluation metrics.

# This was a very long lesson, but an important one. We finished it, and evaluated our agent. It’s good for deployment, so tomorrow we’ll create an UI for it and deploy it to the internet.
# Homework
# Create an evaluation system for your agent
# Collect at least 10 interaction logs
# Set up automated evaluation using LLM as a judge
# Test different system prompts and compare results
# Make a post on social media about your evaluation process


In [66]:
# the end