## Why evaluation?

Before - in AI HERO DATA PREP LONG notebook we learned about function calling and created our first agent using Pydantic AI.

But is this agent actually good? Today we will see how to answer this question.

In particular, we will cover:

-- Build a logging system to track agent interactions

-- Create automated evaluation using AI as a judge

-- Generate test data automatically

-- Measure agent performance with metrics

At the end we will have a thoroughly tested agent with performance metrics.

In this lesson, we'll use the FAQ database with text search, but it's applicable for any other use case.

This is going to be a long lesson, but an important one. Evaluation is critical for building reliable AI systems. Without proper evaluation, you can't tell if your changes improve or hurt performance. You can't compare different approaches. And you can't build confidence before deploying to users.


## Logging

The easiest thing we can do to evaluate an agent is interact with it. We ask something and look at the response. Does it make sense? For most cases, it should.

This approach is called "vibe check" - we interact with it, and if we like the results, we go ahead and deploy it.

If we don't like something, we go back and change things:

-- Maybe our chunking method is not suitable? Maybe we need to have a bigger window size?

-- Is our system prompt good? Maybe we need more precise instructions?

Or we want to change something else

And we iterate.

It might be okay for the first MVP, but how can we make sure the result at the end is actually good?

We need systematic evaluation. Manual testing doesn't scale - you can't manually test every possible input and scenario. With systematic evaluation, we can test hundreds or thousands of cases automatically.

We also need to base our decisions on data. It will help us to

-- Compare different approaches

-- Track improvements

-- Identify edge cases

We can start collecting this data ourselves: start with vibe checking, but be smart about it. We don't just test it, but also record the results.


In [1]:
# full data scrape function

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data


In [3]:
# download data as we did earlier:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")


FAQ documents: 1219
Evidently documents: 95


In [5]:
# build a simple text search index for FAQ data:

from minsearch import Index

# dtc_faq = read_repo_data('DataTalksClub', 'faq')
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']] # we extract only DE = data engineering FAQ

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

# <minsearch.minsearch.Index at 0x1ed6ba1bcb0>


<minsearch.minsearch.Index at 0x1ed6ba1bcb0>

In [6]:
# we have to enter OpenAI key here and then we can run all cells freely...
import os
from getpass import getpass
from openai import OpenAI

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


In [7]:
# Here's the agent we created yesterday:

from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)


In [8]:
# Let's interact with the agent:
question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)

In [9]:
result

AgentRunResult(output='To install Kafka in Python, you can follow these steps based on the specific dependencies you need:\n\n1. **Install `confluent-kafka`**:\n   - Using pip:\n     ```bash\n     pip install confluent-kafka\n     ```\n   - Using conda:\n     ```bash\n     conda install conda-forge::python-confluent-kafka\n     ```\n\n2. **Install `fastavro`** (if your project requires it):\n   ```bash\n   pip install fastavro\n   ```\n\n3. **(Optional)** If you face issues or need to specifically use `kafka-python`, you can install a specific version:\n   - First, uninstall any existing `kafka-python` package:\n     ```bash\n     pip uninstall kafka-python\n     ```\n   - Then, install the desired version, for example:\n     ```bash\n     pip install kafka-python==1.4.6\n     ```\n\n4. **If you encounter errors** like "ModuleNotFoundError: No module named \'kafka.vendor.six.moves\'," it is suggested to install `kafka-python-ng` instead:\n   ```bash\n   pip install kafka-python-ng\n   