## **Day 1: Ingest and Index Your Data**

**Setting up the imports**

Bring in the libraries we need. ``requests`` grabs the repo from GitHub, ``frontmatter`` pulls metadata out of markdown files, and ``zipfile/io`` let us unpack the zip without saving it to disk.

In [1]:
import sys
print(sys.executable)

/Users/daniel/Documents/Projects/datatalks/AI Agents Crash Course/aihero/venv/bin/python


In [2]:
import io
import zipfile
import requests
import frontmatter

**Testing how frontmatter works**

Quick test to see how frontmatter works. The stuff between the ``---`` markers is YAML metadata, and everything below is the actual content.




In [3]:
example = """---
title: "Getting Started with AI"
author: "John Doe"
tags: ["ai", "machine-learning"]
---

# Getting Started with AI

Main content here.
"""

post = frontmatter.loads(example)

print("Title:", post.metadata['title'])
print("Tags:", post.metadata['tags'])
print("Content:", post.content)

Title: Getting Started with AI
Tags: ['ai', 'machine-learning']
Content: # Getting Started with AI

Main content here.


**Downloading the repo as a zip**

Download the DataTalksClub FAQ repo as a zip. GitHub gives us a handy URL pattern for this so we don't have to clone anything.

In [4]:
url = 'https://codeload.github.com/DataTalksClub/faq/zip/refs/heads/main'
resp = requests.get(url)
print(f"Status: {resp.status_code}")


Status: 200


**Processing the zip in memory**

Open the zip straight from memory, loop through every ``.md`` file, parse its frontmatter, and collect everything into a list.

In [5]:
repository_data = []

zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    if not filename.endswith('.md'):
        continue

    try:
        with zf.open(file_info) as f_in:
            content = f_in.read()
            post = frontmatter.loads(content)
            data = post.to_dict()
            data['filename'] = filename
            repository_data.append(data)
    except Exception as e:
        print(f"Skipped {filename}: {e}")
        continue

zf.close()

print(f"Found {len(repository_data)} markdown files")

Found 1271 markdown files


**Checking what we got**

Look at one record to see what we're working with.

In [6]:
print(repository_data[1])



**Building a reusable function**

Wrap everything into a function so we can reuse it on any repo. This version also handles ``.mdx`` files (React-flavored markdown that Evidently uses) and has basic error handling.

In [7]:
def read_repo_data(repo_owner, repo_name):
    prefix = 'https://codeload.github.com'
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)

    if resp.status_code != 200:
        raise Exception(f"Download failed with status {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))

    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') or filename_lower.endswith('.mdx')):
            continue

        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Skipped {filename}: {e}")
            continue

    zf.close()
    return repository_data

**Pulling data from both repos**

Put it to work. Pull docs from two repos and see how many files we got.

In [8]:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")

FAQ documents: 1271
Evidently documents: 95


That's all of Day 1. The FAQ data is small enough to use as-is. The Evidently docs are much larger and need chunking, which is Day 2.

## **Day 2: Chunking and Intelligent Processing for Data**

**Why we need chunking**

The Evidently docs are way too large to feed directly into an LLM. Large documents cause problems with token limits, cost, performance, and relevance. So we break them into smaller pieces — that's what chunking is.

**Looking at a sample document**

Before we chunk anything, let's see what we're dealing with. Document 45 from the Evidently repo is a good example of a large doc.

In [9]:
print(evidently_docs[45])
print(f"\nContent length: {len(evidently_docs[45]['content'])} characters")


Content length: 21712 characters


**1. Simple Chunking (Sliding Window)**

Cuts text into fixed-size pieces with overlap between them. The overlap (controlled by ``step`` being smaller than ``size``) means we don't lose context at the boundaries. With size 2000 and step 1000, neighboring chunks share 1000 characters.

In [10]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

**Testing it on one document 45**

In [11]:
text = evidently_docs[45]['content']
chunks = sliding_window(text, 2000, 1000)
print(f"Generated {len(chunks)} chunks")


Generated 21 chunks


**Chunking or Processing all the Evidently docs**

We loop through each document, pull out the content with ``pop('content'),`` chunk it, then attach the remaining metadata (title, description, etc.) to every chunk so we know where it came from.

In [12]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

print(f"Total chunks: {len(evidently_chunks)}")


Total chunks: 576


**2. Splitting by Paragraphs and Sections**

**Paragraph splitting**

Uses a regex pattern that matches double newlines — the standard way paragraphs are separated in text files. Works great for prose, less so for technical docs where paragraphs tend to be short.

In [13]:
import re

text = evidently_docs[45]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())
print(f"Paragraphs: {len(paragraphs)}")

Paragraphs: 153


**Section splitting function**

Instead of splitting by character count, this takes advantage of how markdown is actually structured. It finds ``##`` headers and splits on those, so each chunk maps to a real topic in the document.

In [14]:
def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        header = parts[i] + parts[i+1]
        header = header.strip()

        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

**Testing section splitting on document 45**

In [15]:
text = evidently_docs[45]['content']
sections = split_markdown_by_level(text, level=2)
print(f"Sections: {len(sections)}")

Sections: 8


**Applying section splitting to all docs**

Same pattern as before — pull out the content, split it, then reattach the metadata to each section.

In [16]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

print(f"Sections: {len(evidently_chunks)}")

Sections: 266


**3. Intelligent Chunking with LLM**

This is the expensive option — hand the document to GPT and let it decide how to split things up. Only worth doing when simpler methods aren't good enough and you have the budget for it.

**Loading your API key from .env**

Run this first so OpenAI can authenticate.

In [17]:
import os
from dotenv import load_dotenv

load_dotenv('../.env')

print("Key loaded:", "OPENAI_API_KEY" in os.environ)

Key loaded: True


**Setting up the OpenAI client and LLM helper**

In [18]:
from openai import OpenAI

openai_client = OpenAI()

def llm(prompt, model='gpt-4o-mini'):
    messages = [
        {"role": "user", "content": prompt}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages
    )

    return response.choices[0].message.content


**The prompt template**

Tells the LLM to break the document into logical, self-contained sections and separate them with ``---`` so we can parse the output easily.

In [19]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()


**The intelligent chunking function**

Sends the document to the LLM, gets back the split result, then chops it on the ``---`` separators.

In [20]:
def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

**Running it on all Evidently docs**

This takes a while and costs money. ``tqdm`` gives you a progress bar so you can track how far along it is.

In [21]:
from tqdm.auto import tqdm

evidently_chunks = []

for doc in tqdm(evidently_docs):
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

print(f"LLM chunks: {len(evidently_chunks)}")


  0%|          | 0/95 [00:00<?, ?it/s]

LLM chunks: 702


**How to Choose a Chunking Approach**

Start with the simplest one and only increase complexity if needed:

1. **Sliding window** — try this first, it handles most cases
2. **Section splitting** — use when your docs have clear markdown structure
3. **LLM chunking** — only when simpler methods produce poor results and you have the budget

That's Day 2. The takeaway is to start with the sliding window — it handles most cases fine. Only move to section splitting or LLM chunking if your search results aren't good enough after you evaluate them. Day 3 covers putting all this data into a search engine.

## **Day 3: Add Search**


Our data is downloaded and chunked. Now we put it inside a search engine so we can quickly find relevant information when users ask questions. Three approaches: text search, vector search, and hybrid search.

#### **1. Text Search**

**Install minsearch**

Run this in your terminal first:

``aihero/venv/bin/python -m pip install minsearch``

**Reminder — preparing the Evidently chunks**

If you're starting fresh or restarted your kernel, you need the chunks from Day 1 and Day 2 loaded first. This is what the data looks like going into search:

In [22]:
evidently_docs = read_repo_data('evidentlyai', 'docs')

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

**Indexing Evidently chunks**

We create an index that searches through four text fields: the chunk content, title, description, and filename. Then we feed all our chunks into it.

In [23]:
from minsearch import Index

index = Index(
    text_fields=["chunk", "title", "description", "filename"],
    keyword_fields=[]
)

index.fit(evidently_chunks)

<minsearch.minsearch.Index at 0x136491160>

**Searching the Evidently index**

Text search finds documents containing words from the query. More matching words means higher relevance.

In [24]:
query = 'What should be in a test dataset for AI evaluation?'
results = index.search(query)
print(results[0])

{'start': 0, 'chunk': 'Retrieval-Augmented Generation (RAG) systems rely on retrieving answers from a knowledge base before generating responses. To evaluate them effectively, you need a test dataset that reflects what the system *should* know.\n\nInstead of manually creating test cases, you can generate them directly from your knowledge source, ensuring accurate and relevant ground truth data.\n\n## Create a RAG test dataset\n\nYou can generate ground truth RAG dataset from your data source.\n\n### 1. Create a Project\n\nIn the Evidently UI, start a new Project or open an existing one.\n\n* Navigate to “Datasets” in the left menu.\n* Click “Generate” and select the “RAG” option.\n\n![](/images/synthetic/synthetic_data_select_method.png)\n\n### 2. Upload your knowledge base\n\nSelect a file containing the information your AI system retrieves from. Supported formats: Markdown (.md), CSV, TXT, PDFs. Choose how many inputs to generate.\n\n![](/images/synthetic/synthetic_data_inputs_exampl

**Indexing the FAQ data**

For the DataTalksClub FAQ, we don't need chunking — the entries are small enough already. We filter for just the data engineering course questions and search through the question and content fields.

In [25]:
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

<minsearch.minsearch.Index at 0x111733d90>

**Searching the FAQ index**

In [26]:
query = 'Can I still join the course after the start date?'
results = faq_index.search(query)
print(results[0])

{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}


#### **2. Vector Search**

Text search has a blind spot — if someone asks "can I still enroll?" instead of "can I still join?", text search might miss it because the words are different even though the meaning is the same. Vector search fixes this by turning text into numerical representations (embeddings) that capture meaning, not just exact words.

**Install sentence-transformers**

Run in your terminal:

``aihero/venv/bin/python -m pip install sentence-transformers``

**Loading the embedding model**

The ``multi-qa-distilbert-cos-v1`` model is trained specifically for question-answering tasks. It creates embeddings where similar meanings end up close together in vector space.

In [27]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

**How embeddings and similarity work**

We combine the question and answer text from one FAQ entry, turn it into a vector, do the same for a query, and compute how similar they are using the dot product. Values closer to 1 mean high similarity, closer to 0 means low.

In [28]:
record = de_dtc_faq[2]
text = record['question'] + ' ' + record['content']
v_doc = embedding_model.encode(text)

query = 'I just found out about the course. Can I enroll now?'
v_query = embedding_model.encode(query)

similarity = v_query.dot(v_doc)
print(f"Similarity: {similarity}")

Similarity: 0.5190933346748352


**Creating embeddings for all FAQ documents**

This loops through every FAQ entry, combines the question and answer text, and turns each one into a vector. Takes a bit of time so ``tqdm`` gives us a progress bar.

In [29]:
from tqdm.auto import tqdm
import numpy as np

faq_embeddings = []

for d in tqdm(de_dtc_faq):
    text = d['question'] + ' ' + d['content']
    v = embedding_model.encode(text)
    faq_embeddings.append(v)

faq_embeddings = np.array(faq_embeddings)

  0%|          | 0/484 [00:00<?, ?it/s]

**Building the vector search index**

``VectorSearch`` from minsearch takes our embeddings and the original documents, then lets us find the most similar ones to any query.

In [30]:
from minsearch import VectorSearch

faq_vindex = VectorSearch()
faq_vindex.fit(faq_embeddings, de_dtc_faq)

<minsearch.vector.VectorSearch at 0x1423bd160>

**Testing vector search**

We encode the query into a vector first, then search.

In [31]:
query = 'Can I join the course now?'
q = embedding_model.encode(query)
results = faq_vindex.search(q)
print(results[0])

{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}


**Creating a vector index for Evidently docs**

Same process but for the Evidently chunks. We only use the ``chunk`` field for embeddings here.

In [32]:
evidently_embeddings = []

for d in tqdm(evidently_chunks):
    v = embedding_model.encode(d['chunk'])
    evidently_embeddings.append(v)

evidently_embeddings = np.array(evidently_embeddings)

evidently_vindex = VectorSearch()
evidently_vindex.fit(evidently_embeddings, evidently_chunks)

  0%|          | 0/576 [00:00<?, ?it/s]

<minsearch.vector.VectorSearch at 0x1576f2fd0>

#### **3. Hybrid Search**

Text search is fast and catches exact keywords. Vector search understands meaning and handles synonyms. Combining both gives us the best of both worlds — that's hybrid search.

In [33]:
query = 'Can I join the course now?'

text_results = faq_index.search(query, num_results=5)

q = embedding_model.encode(query)
vector_results = faq_vindex.search(q, num_results=5)

final_results = text_results + vector_results
print(f"Total results: {len(final_results)}")

Total results: 10


#### **4. Putting This Together**

**Organizing search into reusable functions**

Before we can hand this off to an agent, we need clean functions it can call. The hybrid search also deduplicates results so we don't get the same document twice.

In [34]:
def text_search(query):
    return faq_index.search(query, num_results=5)

def vector_search(query):
    q = embedding_model.encode(query)
    return faq_vindex.search(q, num_results=5)

def hybrid_search(query):
    text_results = text_search(query)
    vector_results = vector_search(query)
    
    # Combine and deduplicate results
    seen_ids = set()
    combined_results = []

    for result in text_results + vector_results:
        if result['filename'] not in seen_ids:
            seen_ids.add(result['filename'])
            combined_results.append(result)
    
    return combined_results

**Testing all three search functions**

In [35]:
query = 'Can I still enroll in the course?'

print("Text search:")
print(text_search(query)[0])

print("\nVector search:")
print(vector_search(query)[0])

print("\nHybrid search:")
print(hybrid_search(query)[0])

Text search:
{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}

Vector search:
{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-star

#### **5. Selecting the Best Approach**

Start simple with text search — it's faster, easier to debug, and works well for many use cases. Only add vector or hybrid search when text search isn't cutting it. We'll cover evaluation methods later in the course to help make that decision with data instead of guessing.

That's all of Day 3. You now have a working search system with three approaches. Day 4 covers building the conversational agent that actually uses these search functions to answer questions.

## **Day 4: Agents and Tools**

We spent three days preparing data. Now we finally build the agent that uses it. An agent is just an LLM that can call external functions (tools) to look things up. Without tools, it's just a chatbot guessing. With tools, it can actually look up answers.

#### **1. LLM Without Tools**

**What happens when you ask a question with no search access**

The LLM has no idea about your specific course — it just gives a generic answer.

In [36]:
import openai

openai_client = openai.OpenAI()

user_prompt = "I just discovered the course, can I join now?"

chat_messages = [
    {"role": "user", "content": user_prompt}
]

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
)

print(response.output_text)

To determine if you can join the course at this point, please check the following:

1. **Enrollment Deadlines**: Look for any deadlines for enrollment. Many courses have specific start and end dates.

2. **Prerequisites**: Some courses may have prerequisites that you need to meet before joining.

3. **Availability**: Check if the course is still accepting new students. Some courses may have a cap on enrollment.

4. **Instructor's Policies**: Review the course policies provided by the instructor or institution regarding late enrollment.

If you're unsure, it's best to reach out directly to the course instructor or administrative office for guidance.


#### **2. Function Calling with OpenAI**

**Defining the search function**

This is the text search from Day 3. Make sure ``faq_index`` is already loaded.

In [37]:
def text_search(query):
    return faq_index.search(query, num_results=5)

**Describing the function for OpenAI**

We can't just pass a Python function to OpenAI. We need to describe it in a structured format so the LLM knows what it does, what parameters it takes, and when to call it.

In [38]:
text_search_tool = {
    "type": "function",
    "name": "text_search",
    "description": "Search the FAQ database",
    "parameters": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query text to look up in the course FAQ."
            }
        },
        "required": ["query"],
        "additionalProperties": False
    }
}

**Sending the question with the tool available**

Now when we ask a question, the LLM can decide to call our search function instead of guessing. We pass the tool description alongside the messages.

In [39]:
system_prompt = """
You are a helpful assistant for a course. 
"""

question = "I just discovered the course, can I join now?"

chat_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=[text_search_tool]
)

**Inspecting the tool call**

The LLM didn't answer directly — it decided it needs to search first. Let's see what it wants to search for.

In [40]:
print(response.output)

[ResponseFunctionToolCall(arguments='{"query":"join course"}', call_id='call_x6S47Hpd2ekhfHinRkVI9yGF', name='text_search', type='function_call', id='fc_0fb2e36e70b772c500698dd007e9d881948f815b3770721194', status='completed')]


**Executing the tool call and sending results back**

The LLM told us what function to call and with what arguments. We run the function, package the results, and send everything back so the LLM can formulate a proper answer.

LLMs are stateless — each API call is independent. So we need to send the full conversation history: the original system prompt, the user question, the tool call decision, and the tool results.

In [41]:
import json

call = response.output[0]

arguments = json.loads(call.arguments)
result = text_search(**arguments)

call_output = {
    "type": "function_call_output",
    "call_id": call.call_id,
    "output": json.dumps(result),
}

chat_messages.append(call)
chat_messages.append(call_output)

response = openai_client.responses.create(
    model='gpt-4o-mini',
    input=chat_messages,
    tools=[text_search_tool]
)

print(response.output_text)

Yes, you can still join the course even after it has started! Here are some details:

- **Eligibility:** You can submit homework even if you don't register, but do note that there will be deadlines for turning in homeworks and the final projects.
- **Next Start Date:** The next cohort starts on January 12th, 2026. You can register before this date using the [registration link](https://airtable.com/shr6oVXeQvSI5HuWD).

Feel free to join the course announcements on [Telegram](https://t.me/dezoomcamp) and make sure to register in DataTalks.Club's Slack! If you have more questions or need further assistance, feel free to ask.


#### **3. System Prompt: Instructions**

**Why the system prompt matters**

The system prompt controls how the agent behaves. More complete instructions generally produce better results. Here's a more detailed version:

In [42]:
system_prompt = """
You are a helpful assistant for a course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

**Encouraging multiple searches**

If you want the agent to try harder when the first search doesn't give enough information, you can adjust the prompt like this:

In [43]:
system_prompt = """
You are a helpful assistant for a course. 

Always search for relevant information before answering. 
If the first search doesn't give you enough information, try different search terms.

Make multiple searches if needed to provide comprehensive answers.
"""

#### **4. Pydantic AI**

**Why use a framework**

Manually handling function calls, parsing responses, and managing conversation history is tedious and error-prone. Pydantic AI handles all of that for us.

**Install pydantic-ai**

Run in your terminal:


 ``aihero/venv/bin/python -m pip install pydantic-ai``
 

**Redefining text_search with type hints**

Pydantic AI needs type hints and docstrings on the function — it uses those to automatically generate the tool description (instead of us writing the JSON by hand).

In [44]:
from typing import List, Any

def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)

**Creating the agent**

We pass the function directly — no JSON description needed. Pydantic AI figures it out from the type hints and docstring.

In [45]:
from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)

**Running the agent**

In a Jupyter notebook we use ``await`` because Pydantic AI is asynchronous.

In [46]:
question = "I just discovered the course, can I join now?"

result = await agent.run(user_prompt=question)
print(result.output)

Yes, you can still join the course even after it has started. While registration is encouraged, you are eligible to submit homework assignments even if you don't officially register. However, keep in mind that there are deadlines for submitting homework and final projects, so it's best to stay on schedule and not leave things until the last minute.


**If you're NOT in Jupyter**

If you're running this as a regular Python script instead of a notebook, use ``asyncio.run():``


``import asyncio``

``result = asyncio.run(agent.run(user_prompt=question))``

``print(result.output)``


**Looking inside the agent's reasoning**

This shows the full breakdown of what happened — the user prompt, the tool call decision, the search results, and the final answer.

In [47]:
result.new_messages()

[ModelRequest(parts=[UserPromptPart(content='I just discovered the course, can I join now?', timestamp=datetime.datetime(2026, 2, 12, 13, 5, 23, 860682, tzinfo=datetime.timezone.utc))], instructions="You are a helpful assistant for a course. \n\nAlways search for relevant information before answering. \nIf the first search doesn't give you enough information, try different search terms.\n\nMake multiple searches if needed to provide comprehensive answers.", run_id='667491e8-90e5-4729-84a9-0abfaf1d9fe8'),
 ModelResponse(parts=[ToolCallPart(tool_name='text_search', args='{"query":"Can I join the course now?"}', tool_call_id='call_2vcAebfxvb89AgJmxUJmXh1D')], usage=RequestUsage(input_tokens=146, output_tokens=20, details={'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}), model_name='gpt-4o-mini-2024-07-18', timestamp=datetime.datetime(2026, 2, 12, 13, 5, 24, tzinfo=TzInfo(0)), provider_name='openai', provider_url='https://api.ope

That's all of Day 4. You now have a working agent that can look up real answers from your data instead of guessing. Day 5 covers how to evaluate whether the agent is actually doing a good job.

## **Day 5: Evaluation**

We built the agent yesterday. But is it actually good? Today we build a proper evaluation system: logging, LLM as a judge, test data generation, and metrics.

#### **Logging**

**Setting up the agent (recap from Day 4)**

If you restarted your kernel, you need this running first.

In [48]:
from typing import List, Any
from pydantic_ai import Agent


def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)


system_prompt = """
You are a helpful assistant for a  course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

from pydantic_ai import Agent

agent = Agent(
    name="faq_agent",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)


**Testing the agent**

In [49]:
question = "how do I install Kafka in Python?"
result = await agent.run(user_prompt=question)
print(result.output)

To install Kafka-related libraries in Python, you can follow these steps:

1. **Install `confluent-kafka`:**
   - Using pip:
     ```bash
     pip install confluent-kafka
     ```
   - Using conda:
     ```bash
     conda install conda-forge::python-confluent-kafka
     ```

2. **Install `fastavro` (if required):**
   ```bash
   pip install fastavro
   ```

3. **(Optional) Install `kafka-python`:** 
   If needed, you can install a specific version of `kafka-python`:
   ```bash
   pip install kafka-python==1.4.6
   ```

4. **(Alternative) If you encounter issues:**
   You might consider installing an alternative package:
   ```bash
   pip install kafka-python-ng
   ```

These commands will set up the necessary dependencies to work with Kafka in Python. Make sure you have Python and pip installed on your system before executing these commands.


**Building the log entry extractor**

This pulls out everything we want to record from the agent and the run result: the config, tools, and full message history. ``ModelMessagesTypeAdapter`` converts internal Pydantic AI message objects into regular Python dictionaries so we can save them as JSON.

In [50]:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }


**Writing logs to files**

Creates a ``logs`` folder, generates unique filenames using timestamp and random hex, and saves each interaction as a JSON file.

In [51]:
import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    ts_obj = datetime.fromisoformat(ts.replace("Z", "+00:00"))
    ts_str = ts_obj.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

**Interactive vibe checking with logging**

Type a question, get a response, and the interaction gets saved to a log file automatically. Try questions like: "how do I use docker on windows?", "can I join late and get a certificate?", "what do I need to do for the certificate?"

In [52]:
def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    
    if isinstance(ts, str):
        ts_obj = datetime.fromisoformat(ts.replace("Z", "+00:00"))
    else:
        ts_obj = ts
    
    ts_str = ts_obj.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

In [55]:
question = input()
result = await agent.run(user_prompt=question)
print(result.output)
log_interaction_to_file(agent, result.new_messages())

 history


I couldn't find specific information related to "history" in the course materials. 

If you're looking for the historical context of a certain topic within the course, please provide a bit more detail so I can assist you better. Otherwise, I can offer general guidance on history-related topics relevant to the course subject.


PosixPath('logs/faq_agent_v2_20260212_130616_9ac490.json')

#### **Adding References**

**Updating the system prompt to include citations**

When vibe checking, we noticed the agent doesn't include references to source documents. We fix that by telling it to cite filenames and link to the GitHub repo.

In [56]:
system_prompt = """
You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

# Create another version of agent, let's call it faq_agent_v2
agent = Agent(
    name="faq_agent_v2",
    instructions=system_prompt,
    tools=[text_search],
    model='gpt-4o-mini'
)


#### **LLM as a Judge**

**The evaluation prompt**

Instead of manually checking every response, we use another LLM to evaluate our agent's output. The prompt defines a checklist of things to check: did it follow instructions, is the answer relevant, does it include citations, etc.

In [57]:
evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user's question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked? 

Output true/false for each check and provide a short explanation for your judgment.
""".strip()


**Defining the structured output**

We use Pydantic models so the evaluation LLM returns a well-defined structure we can process programmatically. Putting ``justification`` before ``check_pass`` makes the LLM reason about the answer before giving its final judgment.

In [58]:
from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str


**Creating the evaluation agent**

We use a different model ``(gpt-5-nano)`` to evaluate our agent's output ``(gpt-4o-mini)``. Using a different model reduces self-bias and gives a second opinion.

In [59]:
eval_agent = Agent(
    name='eval_agent',
    model='gpt-5-nano',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)


**The evaluation input template**

We use XML tags so the LLM can clearly see the boundaries between the instructions, the question, the answer, and the full conversation log.

In [60]:
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

**Loading log files**

A helper to load a saved JSON log and add the filename for tracking.




In [61]:
def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data


**Check what log files exist**

In [63]:
import os
for f in os.listdir('./logs'):
    print(f)

faq_agent_v2_20260212_130616_9ac490.json
faq_agent_20260212_130551_00991a.json
faq_agent_20260210_205544_311005.json


**Running the evaluation on a single log**

Load a log file, extract the key pieces, format them into the evaluation prompt, and run the eval agent. Replace the filename with one of your actual log files from the ``logs/`` folder.

In [64]:
log_record = load_log_file('./logs/faq_agent_v2_20260212_130616_9ac490.json')

instructions = log_record['system_prompt']
question = log_record['messages'][0]['parts'][0]['content']
answer = log_record['messages'][-1]['parts'][0]['content']
log = json.dumps(log_record['messages'])

user_prompt = user_prompt_format.format(
    instructions=instructions,
    question=question,
    answer=answer,
    log=log
)

result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)

checklist = result.output
print(checklist.summary)

for check in checklist.checklist:
    print(check)

Tool search initiated to locate 'history' references in course materials. Awaiting results to craft a precise answer with citations or provide general guidance if not found.
check_name='instructions_follow' justification="We will use a search tool as instructed to look for 'history' in course materials before answering." check_pass=True
check_name='instructions_avoid' justification='No disallowed content; we will only search and provide guidance.' check_pass=True
check_name='answer_relevant' justification='We will search and then answer with relevant course material references if found; otherwise provide general guidance.' check_pass=True
check_name='answer_clear' justification='We will provide a concise, clear response with references or guidance.' check_pass=True
check_name='answer_citations' justification='We will cite the filename(s) from the course materials using the required format when used.' check_pass=True
check_name='completeness' justification='We will cover both search res

**Simplifying log messages to save tokens**

The full conversation log has a lot of noise (timestamps, IDs, full search results). We strip that out to reduce token cost and speed up evaluation. The actual search results get replaced with a placeholder since the eval agent doesn't need them.




In [65]:
def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
    
            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified


**Putting evaluation together into one function**

Combines loading, simplifying, formatting, and running the eval into a single reusable function.

In [66]:
async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][0]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result.output


**Testing the combined function**

In [71]:
log_record = load_log_file('./logs/faq_agent_v2_20260212_130616_9ac490.json')
eval1 = await evaluate_log_record(eval_agent, log_record)

**Viewing the evaluation results**

In [70]:
print(eval1.summary)

for check in eval1.checklist:
    print(check)

Preparing to search course materials for 'history' to provide cited references.


#### **Data Generation**

**Creating a question generator**

Instead of manually typing questions, we get AI to generate realistic test questions based on our FAQ content. We sample records from the database, feed them to the generator, and it creates questions a student might actually ask.

In [72]:
question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about a data engineering course.

Based on the provided FAQ content, generate realistic questions that students might ask.

The questions should:

- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general course questions

Generate one question for each record.
""".strip()

class QuestionsList(BaseModel):
    questions: list[str]

question_generator = Agent(
    name="question_generator",
    instructions=question_generation_prompt,
    model='gpt-4o-mini',
    output_type=QuestionsList
)


**Generating questions from sampled records**

We pick 10 random FAQ entries, extract their content, and ask the generator to produce one question per entry.

In [73]:
import random

sample = random.sample(de_dtc_faq, 10)
prompt_docs = [d['content'] for d in sample]
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.output.questions


**Running the agent on generated questions and logging results**

Each generated question gets sent to our agent, the response gets printed, and everything gets logged with the source marked as ``'ai-generated'`` so we can filter for it later.

In [74]:
from tqdm.auto import tqdm

for q in tqdm(questions):
    print(q)

    result = await agent.run(user_prompt=q)
    print(result.output)

    log_interaction_to_file(
        agent,
        result.new_messages(),
        source='ai-generated'
    )

    print()


  0%|          | 0/4 [00:00<?, ?it/s]

What should I do if I encounter a java.lang.IllegalAccessError when running `spark-shell` after installing Java, Hadoop, and Spark?
If you encounter a `java.lang.IllegalAccessError` when running `spark-shell` after installing Java, Hadoop, and Spark, it’s likely due to an unsupported Java version. Specifically, this error can occur because:

- Spark 3.x requires Java 8, 11, or 16. If you are using Java 17 or 19, it will not be compatible.
  
To resolve the issue:

1. **Install Java 11**: You can download and install Java 11 from the official website.
2. After installation, ensure that your `JAVA_HOME` environment variable is set to point to the Java 11 installation and that your `PATH` variable includes the bin directory of the Java installation.

Here’s a reference for detailed guidance: [Spark-shell: unable to load native-hadoop library for platform - Windows](https://github.com/DataTalksClub/faq/blob/main/_questions/data-engineering-zoomcamp/module-5/004_a3790900cc_spark-shell-unabl

**Collecting AI-generated logs for evaluation**

We filter the log files to only include AI-generated interactions from the v2 agent.

In [75]:
eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if 'faq_agent_v2' not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    if log_record['source'] != 'ai-generated':
        continue

    eval_set.append(log_record)


**Running evaluation on all logs**

In [76]:
eval_results = []

for log_record in tqdm(eval_set):
    eval_result = await evaluate_log_record(eval_agent, log_record)
    eval_results.append((log_record, eval_result))


  0%|          | 0/4 [00:00<?, ?it/s]

**Converting results to a DataFrame**

We extract the question, answer, and each check's pass/fail result into rows so Pandas can work with them.




In [77]:
rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']

    row = {
        'file': log_record['log_file'].name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }

    checks = {c.check_name: c.check_pass for c in eval_result.checklist}
    row.update(checks)

    rows.append(row)


**Viewing results and calculating metrics**

Install pandas if you haven't:


``aihero/venv/bin/python -m pip install pandas``

Then create the DataFrame and calculate average pass rates per check.

In [78]:
import pandas as pd

df_evals = pd.DataFrame(rows)
df_evals.head()

Unnamed: 0,file,question,answer,instructions_follow,instructions_avoid,answer_relevant,answer_clear,answer_citations,completeness,tool_call_search
0,faq_agent_v2_20260212_131434_c5f21b.json,Where can I find deadlines for the Data Engine...,Deadlines for the Data Engineering course in 2...,True,True,True,True,True,True,True
1,faq_agent_v2_20260212_131420_421684.json,Are there any recommendations for resources on...,Here are some recommended resources on Data En...,False,True,True,True,True,True,True
2,faq_agent_v2_20260212_131427_955b66.json,What precautions should I take when using BigQ...,"When using BigQuery, there are several precaut...",True,,,,,,
3,faq_agent_v2_20260212_131414_1b4d72.json,What should I do if I encounter a java.lang.Il...,If you encounter a `java.lang.IllegalAccessErr...,True,True,True,True,True,True,True


**Overall pass rates**

This gives you the big picture — what percentage of responses pass each check. The most important one is ``answer_relevant:`` does the agent actually answer the user's question?


``df_evals.mean(numeric_only=True)``

#### **Evaluating Search Quality (Bonus)**

**Hit rate and MRR calculation**

This is a standalone function you can use to evaluate your search function separately from the agent. It checks whether at least one relevant document shows up in results (hit rate) and where it ranks (MRR). You'd need to prepare test queries with expected document filenames.

In [79]:
def evaluate_search_quality(search_function, test_queries):
    results = []
    
    for query, expected_docs in test_queries:
        search_results = search_function(query, num_results=5)
        
        # Calculate hit rate
        relevant_found = any(doc['filename'] in expected_docs for doc in search_results)
        
        # Calculate MRR
        for i, doc in enumerate(search_results):
            if doc['filename'] in expected_docs:
                mrr = 1 / (i + 1)
                break
        else:
            mrr = 0
            
        results.append({
            'query': query,
            'hit': relevant_found,
            'mrr': mrr
        })
    return results


That's all of Day 5. The big takeaway: start with manual vibe checks, log everything, then automate evaluation with LLM as a judge. Use the metrics to compare different prompts, search methods, and chunking strategies. Day 6 covers building a UI and deploying the agent.

## **Day 6: Publish Your Agent**

The agent lives in a Jupyter notebook right now. Time to clean it up, give it a proper UI, and put it on the internet.

### **Cleaning Up**

**Project structure**

First, create a folder called app inside your ``course`` directory (you may already have one). Inside it, initialize a new uv project:


``cd app``

``uv init``

**Dependencies in pyproject.toml**

Update the ``dependencies`` section in your ``app/pyproject.toml:``

In [81]:
dependencies = [
    "minsearch>=0.0.5",
    "openai>=1.108.2",
    "pydantic-ai==1.0.9",
    "python-frontmatter>=1.1.0",
    "requests>=2.32.5",
]

``Then install them:``

``uv sync``

``ingest.py``

Handles downloading repos, chunking, and indexing. This combines everything from Days 1-3 into one file. A couple of improvements: the zip archive prefix gets stripped from filenames, and the sliding window uses ``content`` instead of ``chunk`` so the field name is consistent whether you chunk or not.

``search_tools.py``

The search function wrapped in a class instead of relying on a global variable. The index gets passed in at creation time.

``search_agent.py``

Creates the Pydantic AI agent. The system prompt is now a template so it works with any GitHub repo, not just hardcoded to one.

``logs.py``

Same logging from Day 5, but the log directory is now configurable through an environment variable. Useful when deploying.

``main.py`` **— Command-line interface**

Brings everything together. Downloads the data, indexes it, creates the agent, and runs an interactive loop in the terminal. Type ``stop`` to exit.

**Running the CLI**

From the ``app`` folder:

``uv run python main.py``

##### **Streamlit UI**

**Install Streamlit**

``uv add streamlit``

**app.py — Basic version**

This is the simple version where the full response appears at once.

**Running the basic Streamlit app**

``uv run streamlit run app.py``

**app.py — Streaming version**

This version streams the response word by word instead of showing everything at once. Feels much more responsive.

#### **Deployment**

**Export dependencies for Streamlit Cloud**

Streamlit Cloud doesn't always work with uv directly, so export to requirements.txt:


``uv export --no-dev > requirements.txt``

#### **Deploy to Streamlit Cloud**

1. Push all your code to GitHub
2. Go to share.streamlit.io
3. Click "New app"
4. Point it to your GitHub repo and the app.py file
5. In the app settings, add your secret under "Secrets":

``OPENAI_API_KEY="your-key"``

Your app is now live and anyone with the link can interact with your agent.

That's all of Day 6. You went from a messy notebook to a clean, modular codebase with a web UI deployed on the internet. Day 7 covers wrapping up and sharing your results.