## Why evaluation?

Before - in AI HERO DATA PREP LONG notebook we learned about function calling and created our first agent using Pydantic AI.

But is this agent actually good? Today we will see how to answer this question.

In particular, we will cover:

-- Build a logging system to track agent interactions

-- Create automated evaluation using AI as a judge

-- Generate test data automatically

-- Measure agent performance with metrics

At the end we will have a thoroughly tested agent with performance metrics.

In this lesson, we'll use the FAQ database with text search, but it's applicable for any other use case.

This is going to be a long lesson, but an important one. Evaluation is critical for building reliable AI systems. Without proper evaluation, you can't tell if your changes improve or hurt performance. You can't compare different approaches. And you can't build confidence before deploying to users.


## Logging

The easiest thing we can do to evaluate an agent is interact with it. We ask something and look at the response. Does it make sense? For most cases, it should.

This approach is called "vibe check" - we interact with it, and if we like the results, we go ahead and deploy it.

If we don't like something, we go back and change things:

-- Maybe our chunking method is not suitable? Maybe we need to have a bigger window size?

-- Is our system prompt good? Maybe we need more precise instructions?

Or we want to change something else

And we iterate.

It might be okay for the first MVP, but how can we make sure the result at the end is actually good?

We need systematic evaluation. Manual testing doesn't scale - you can't manually test every possible input and scenario. With systematic evaluation, we can test hundreds or thousands of cases automatically.

We also need to base our decisions on data. It will help us to

-- Compare different approaches

-- Track improvements

-- Identify edge cases

We can start collecting this data ourselves: start with vibe checking, but be smart about it. We don't just test it, but also record the results.


In [1]:
# full data scrape function

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data


In [3]:
# download data as we did earlier:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")


FAQ documents: 1219
Evidently documents: 95


In [5]:
# build a simple text search index for FAQ data:

from minsearch import Index

# dtc_faq = read_repo_data('DataTalksClub', 'faq')
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']] # we extract only DE = data engineering FAQ

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

# <minsearch.minsearch.Index at 0x1ed6ba1bcb0>


<minsearch.minsearch.Index at 0x1ed6ba1bcb0>

In [6]:
# we have to enter OpenAI key here and then we can run all cells freely...
import os
from getpass import getpass
from openai import OpenAI

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


In [7]:
# Here's the agent we created yesterday:

from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)


In [8]:
# Let's interact with the agent:
question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)

In [10]:
# result

Here's what we want to record:

-- The system prompt that we used

-- The model

-- The user query

-- The tools we use

-- The responses and the back-and-forth interactions between the LLM and our tools

-- The final response

To make it simpler, we'll implement a simple logging system ourselves: we will just write logs to json files.

You shouldn't use it in production. In practice, you will want to send these logs to some log collection system, or use specialized LLM evaluation tools like Evidently, LangWatch or Arize Phoenix.


In [11]:
# Let's extract all this information from the agent and from the run results:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }

# This code extracts the key information from our agent:
# -- the configuration (name, prompt, model)
# -- available tools
# -- complete message history (user input, tool calls, responses)
# We also use ModelMessagesTypeAdapter.dump_python(messages) to convert internal message format into regular Python dictionaries. 
# This makes it easier to save it to JSON and process later.
# We also add the source parameter. It tracks where the question came from. We start with "user" but later we'll use AI-generated queries. 
# Sometimes it may be important to tell them apart for analysis.
# This code is generic so it will work with any Pydantic AI agent. If you use a different library, you'll need to adjust this code.

In [12]:
# Let's write these logs to a folder:

import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    ts_str = ts.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

# This code above:
# Creates a logs directory (if not created previously)
# Generates unique filenames with timestamp and random hex
# Saves complete interaction logs as JSON files
# Handles datetime serialization (using the serialized function)

In [15]:

# Now we can interact with it and do some vibe checking:
    
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

# This creates a simple interactive loop where:
# -- User enters a question
# -- Agent processes it and responds
# -- Complete interaction is logged to a file

# Try these questions:
# how do I use docker on windows?
# can I join late and get a certificate?
# what do I need to do for the certificate?


 what do I need to do for the certificate?


To obtain your certificate, you need to follow these requirements:

1. **Complete the Peer-Reviewed Capstone Projects**: You must finish the peer-reviewed capstone projects on time. Completing the homework is not mandatory for receiving the certificate, especially if you join the course late.

2. **Participate in a Live Cohort**: You can only receive a certificate if you finish the course with a “live” cohort. Certificates are not awarded for self-paced course completion, as peer reviews need to be conducted while the course is ongoing.

3. **Check Your Profile**: There will be an announcement regarding the certificate on Telegram and the course channel. Ensure that your full name is displayed correctly in your course profile, as that is what will appear on the certificate.

4. **Access Your Certificate**: After grading is completed and announcements are made, you can follow the instructions provided for generating the certificate document yourself. You will find it in your course prof

WindowsPath('logs/faq_agent_20251010_192557_3d58d8.json')

## Adding References