# GitHub Repository Data Ingestion

Building a conversational AI agent for GitHub repositories. This implementation downloads and processes markdown documentation from any repository.

**Target repositories:**
- DataTalksClub/faq - Course FAQ documentation
- evidentlyai/docs - Evidently AI library documentation

## Import Libraries

In [120]:
import io
import zipfile
import requests
import frontmatter

## Understanding Frontmatter

Frontmatter is YAML metadata in markdown files, commonly used in Jekyll, Hugo, and Next.js.

Structure:
```yaml
---
title: "Getting Started with AI"
author: "John Doe"
tags: ["ai", "machine-learning"]
---

# Content here

In [63]:
# Parsing frontmatter
example = """---
title: "Getting Started with AI"
author: "John Doe"
tags: ["ai", "machine-learning"]
---

# Getting Started with AI

Main content here.
"""

post = frontmatter.loads(example)

print(post.metadata['title'])
print(post.metadata['tags'])
print(post.content)

Getting Started with AI
['ai', 'machine-learning']
# Getting Started with AI

Main content here.


## Download Repository as Zip

GitHub provides zip downloads via:

In [64]:
url = 'https://codeload.github.com/DataTalksClub/faq/zip/refs/heads/main'
resp = requests.get(url)

## Process Zip in Memory

In [65]:
repository_data = []

zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    if not filename.endswith('.md'):
        continue

    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()

In [66]:
print(repository_data[1])

{'content': '# DataTalks.Club FAQ\n\nA static site generator for DataTalks.Club course FAQs with automated AI-powered FAQ maintenance.\n\n## Features\n\n- **Static Site Generation**: Converts markdown FAQs to a beautiful, searchable HTML site\n- **Automated FAQ Management**: AI-powered bot that processes new FAQ proposals\n- **Intelligent Triage**: Automatically determines if proposals should create new entries, update existing ones, or are duplicates\n- **GitHub Integration**: Seamless workflow via GitHub Issues and Pull Requests\n\n## Project Structure\n\n```\nfaq/\n├── _questions/              # FAQ content organized by course\n│   ├── machine-learning-zoomcamp/\n│   │   ├── _metadata.yaml   # Course configuration\n│   │   ├── general/         # General course questions\n│   │   ├── module-1/        # Module-specific questions\n│   │   └── ...\n│   ├── data-engineering-zoomcamp/\n│   └── ...\n├── _layouts/                # Jinja2 HTML templates\n│   ├── base.html\n│   ├── course.htm

## Reusable Function

Handles both `.md` and `.mdx` files for broader compatibility.

In [67]:
def read_repo_data(repo_owner, repo_name):
    """
    Download and parse markdown files from GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries with content and metadata
    """
    url = f'https://codeload.github.com/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data

## Download Target Repositories

In [68]:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")

FAQ documents: 1232
Evidently documents: 95


## Next Steps

FAQ data is ready for indexing. Evidently docs require chunking due to large document size (20k+ characters).

Chunking breaks large documents into smaller pieces for:
- Better search relevance
- Improved AI model performance
- Token limit compliance

## Document Chunking

Large documents require splitting into smaller pieces. Evidently docs average 20k+ characters, which creates problems with token limits, costs, and performance.

In [69]:
# Inspect sample document
print(evidently_docs[45])
print(f"\nContent length: {len(evidently_docs[45]['content'])} characters")


Content length: 21712 characters


## 1. Sliding Window Chunking

Simple approach: cut into equal pieces with overlap.

Without overlap (problematic):
- Chunk 1: 0..2000
- Chunk 2: 2000..4000
- Issues: context loss, incomplete sentences

With overlap (recommended):
- Chunk 1: 0..2000
- Chunk 2: 1000..3000
- Chunk 3: 2000..4000

In [70]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [71]:
# Test on document 45
text = evidently_docs[45]['content']
chunks = sliding_window(text, 2000, 1000)
print(f"Generated {len(chunks)} chunks")

Generated 21 chunks


In [102]:
# Process all documents
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

print(f"Total chunks: {len(evidently_chunks)}")

Total chunks: 576


Result: 575 chunks from 95 documents. Size 2000 is typically sufficient for RAG applications.

## 2. Paragraph Splitting

Uses regex pattern `\n\s*\n` to split on double newlines.

In [73]:
import re

text = evidently_docs[45]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())
print(f"Paragraphs: {len(paragraphs)}")

Paragraphs: 153


Works for literature but less effective for technical docs with short paragraphs.

## 3. Section Splitting

Splits by markdown headers using document structure.

In [74]:
def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        header = parts[i] + parts[i+1]
        header = header.strip()

        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

In [75]:
# Test section splitting
text = evidently_docs[45]['content']
sections = split_markdown_by_level(text, level=2)
print(f"Sections: {len(sections)}")

Sections: 8


In [76]:
# Process all documents
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

print(f"Sections: {len(evidently_chunks)}")

Sections: 266


## 4. LLM-Based Chunking

Use when:
- Complex, non-standard document structure
- Semantic coherence required
- Quality prioritized over cost

Requires OpenAI API key:
```bash
export OPENAI_API_KEY='your-api-key'
uv add openai

In [77]:
from openai import OpenAI

openai_client = OpenAI()

def llm(prompt, model='gpt-4o-mini'):
    messages = [
        {"role": "user", "content": prompt}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages
    )

    return response.choices[0].message.content

In [78]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()

In [79]:
def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

In [80]:
from tqdm.auto import tqdm

evidently_chunks = []

for doc in tqdm(evidently_docs):
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

print(f"LLM chunks: {len(evidently_chunks)}")

  0%|          | 0/95 [00:00<?, ?it/s]

LLM chunks: 741


## Chunking Strategy

Start simple with sliding window. Evaluate performance, then consider section-based or LLM chunking if needed. Next: indexing data into

## Search Implementation

This section covers indexing and searching the processed data. Three approaches are implemented:
- Text search for exact matches
- Vector search for semantic similarity
- Hybrid search combining both methods

## 1. Text Search

Text search finds documents containing query words. More matching words indicate higher relevance.

Install minsearch:
```bash
uv add minsearch

In [81]:
from minsearch import Index

# Index Evidently chunks
index = Index(
    text_fields=["chunk", "title", "description", "filename"],
    keyword_fields=[]
)

index.fit(evidently_chunks)

<minsearch.minsearch.Index at 0x11f0ae850>

In [82]:
# Test search
query = 'What should be in a test dataset for AI evaluation?'
results = index.search(query)
print(results[0])

{'title': 'RAG evaluation dataset', 'description': 'Synthetic data for RAG.', 'filename': 'docs-main/synthetic-data/rag_data.mdx', 'section': '## Create a RAG Test Dataset\n\nYou can generate a ground truth RAG dataset from your data source through the following steps.\n\n### 1. Create a Project\n\nIn the Evidently UI, start a new Project or open an existing one.\n\n* Navigate to “Datasets” in the left menu.\n* Click “Generate” and select the “RAG” option.\n\n![Select Method](https://via.placeholder.com/150)\n\n### 2. Upload Your Knowledge Base\n\nSelect a file containing the information your AI system retrieves from. Supported formats include Markdown (.md), CSV, TXT, and PDFs. Choose how many inputs to generate.\n\n![Upload Example](https://via.placeholder.com/150)\n\nSimply drop the file, then:\n\n* Choose the number of inputs to generate.\n* Choose if you want to include the context used to generate the answer.\n\n![Upload Example](https://via.placeholder.com/150)\n\nThe system aut

In [83]:
# Index FAQ data
de_dtc_faq = [d for d in dtc_faq if 'data-engineering' in d['filename']]

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(de_dtc_faq)

<minsearch.minsearch.Index at 0x11eff5940>

In [84]:
query = 'Can I still join the course after the start date?'
results = faq_index.search(query)
print(results[0])

{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}


## 2. Vector Search

Vector search uses embeddings to find semantically similar documents, even when words differ.

Install sentence-transformers:
```bash
uv add sentence-transformers

In [92]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-distilbert-cos-v1')

In [93]:
# Create document embedding
record = de_dtc_faq[2]
text = record['question'] + ' ' + record['content']
v_doc = embedding_model.encode(text)

# Create query embedding
query = 'I just found out about the course. Can I enroll now?'
v_query = embedding_model.encode(query)

# Compute similarity
similarity = v_query.dot(v_doc)
print(f"Similarity: {similarity}")

Similarity: 0.5190933346748352


In [94]:
from tqdm.auto import tqdm
import numpy as np

# Create embeddings for all FAQ documents
faq_embeddings = []

for d in tqdm(de_dtc_faq):
    text = d['question'] + ' ' + d['content']
    v = embedding_model.encode(text)
    faq_embeddings.append(v)

faq_embeddings = np.array(faq_embeddings)

  0%|          | 0/449 [00:00<?, ?it/s]

In [95]:
from minsearch import VectorSearch

faq_vindex = VectorSearch()
faq_vindex.fit(faq_embeddings, de_dtc_faq)

<minsearch.vector.VectorSearch at 0x32810ccd0>

In [96]:
# Test vector search
query = 'Can I join the course now?'
q = embedding_model.encode(query)
results = faq_vindex.search(q)
print(results[0])

{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}


In [103]:
# Create vector index for Evidently docs
evidently_embeddings = []

for d in tqdm(evidently_chunks):
    v = embedding_model.encode(d['chunk'])
    evidently_embeddings.append(v)

evidently_embeddings = np.array(evidently_embeddings)

evidently_vindex = VectorSearch()
evidently_vindex.fit(evidently_embeddings, evidently_chunks)

  0%|          | 0/576 [00:00<?, ?it/s]

<minsearch.vector.VectorSearch at 0x32810cb90>

## 3. Hybrid Search

Combines text and vector search for optimal results.

**Text search advantages:**
- Fast and efficient
- Works well for exact matches
- No model inference needed

**Vector search advantages:**
- Captures semantic meaning
- Handles paraphrased questions
- Works with synonyms

**Hybrid search:** Best of both approaches.

In [99]:
query = 'Can I join the course now?'

# Get results from both methods
text_results = faq_index.search(query, num_results=5)

q = embedding_model.encode(query)
vector_results = faq_vindex.search(q, num_results=5)

# Combine results
final_results = text_results + vector_results
print(f"Total results: {len(final_results)}")

Total results: 10


## 4. Search Functions

Organizing search code into reusable functions.

In [100]:
def text_search(query):
    return faq_index.search(query, num_results=5)

def vector_search(query):
    q = embedding_model.encode(query)
    return faq_vindex.search(q, num_results=5)

def hybrid_search(query):
    text_results = text_search(query)
    vector_results = vector_search(query)
    
    # Combine and deduplicate results
    seen_ids = set()
    combined_results = []

    for result in text_results + vector_results:
        if result['id'] not in seen_ids:
            seen_ids.add(result['id'])
            combined_results.append(result)
    
    return combined_results

In [101]:
# Test search functions
query = 'Can I still enroll in the course?'

print("Text search:")
print(text_search(query)[0])

print("\nVector search:")
print(vector_search(query)[0])

print("\nHybrid search:")
print(hybrid_search(query)[0])

Text search:
{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md'}

Vector search:
{'id': '3f1424af17', 'question': 'Course: Can I still join the course after the start date?', 'sort_order': 3, 'content': "Yes, even if you don't register, you're still eligible to submit the homework.\n\nBe aware, however, that there will be deadlines for turning in homeworks and the final projects. So don't leave everything for the last minute.", 'filename': 'faq-main/_questions/data-engineering-zoomcamp/general/003_3f1424af17_course-can-i-still-join-the-course-after-the-star

## Search Strategy

Start with text search - it's faster, easier to debug, and works well for many cases. Add vector or hybrid search only when text search proves insufficient. Next: building the conversational agent.

## Agents and Tools

An agent is an LLM that can invoke tools - external functions for retrieving information, performing calculations, or taking actions. Tool access makes systems "agentic".

This section implements an agent using the search function created previously.

## 1. LLM Without Tools

Testing LLM response without tool access:

In [104]:
import openai

openai_client = openai.OpenAI()

user_prompt = "I just discovered the course, can I join now?"

chat_messages = [
    {"role": "user", "content": user_prompt}
]

response = openai_client.chat.completions.create(
    model='gpt-4o-mini',
    messages=chat_messages,
)

print(response.choices[0].message.content)

It depends on the specific course you’re interested in. Many courses have set enrollment periods or deadlines, while others may allow rolling admissions or open enrollment. If it's an online course, you might be able to join anytime. I recommend checking the course provider's website or contacting their support team for the most accurate information regarding enrollment options.


Response is generic without access to course-specific information.

## 2. Function Calling with OpenAI

Implementing tool access through OpenAI's function calling API.

In [105]:
# Function description for OpenAI
text_search_tool = {
    "type": "function",
    "function": {
        "name": "text_search",
        "description": "Search the FAQ database",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query text to look up in the course FAQ."
                }
            },
            "required": ["query"],
            "additionalProperties": False
        }
    }
}

In [106]:
system_prompt = """
You are a helpful assistant for a course. 
"""

question = "I just discovered the course, can I join now?"

chat_messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": question}
]

response = openai_client.chat.completions.create(
    model='gpt-4o-mini',
    messages=chat_messages,
    tools=[text_search_tool]
)

print(response.choices[0].message.tool_calls)

[ChatCompletionMessageFunctionToolCall(id='call_j91VXQdG5aumf9KJmOFVfRzx', function=Function(arguments='{"query":"join the course"}', name='text_search'), type='function')]


In [107]:
import json

# Execute the tool call
tool_call = response.choices[0].message.tool_calls[0]
arguments = json.loads(tool_call.function.arguments)
result = text_search(**arguments)

# Format response
chat_messages.append(response.choices[0].message)
chat_messages.append({
    "role": "tool",
    "tool_call_id": tool_call.id,
    "content": json.dumps(result)
})

# Get final response
response = openai_client.chat.completions.create(
    model='gpt-4o-mini',
    messages=chat_messages,
    tools=[text_search_tool]
)

print(response.choices[0].message.content)

Yes, you can still join the course even after the start date. It is recommended that you register before the course starts, which for the next cohort is on January 13th, 2025. However, if you miss the registration, you are still eligible to submit homework assignments. Do keep in mind that there will be deadlines for submitting homework and final projects, so it's a good idea not to procrastinate.

If you're interested, you can register using [this link](https://airtable.com/shr6oVXeQvSI5HuWD) and join the [course Telegram channel](https://t.me/dezoomcamp) for announcements.


## 3. System Prompt Instructions

System prompts control agent behavior. More detailed instructions typically yield better results.

In [108]:
system_prompt = """
You are a helpful assistant for a course. 

Use the search tool to find relevant information from the course materials before answering questions.

If you can find specific information through search, use it to provide accurate answers.
If the search doesn't return relevant results, let the user know and provide general guidance.
"""

## 4. Pydantic AI

Pydantic AI simplifies agent implementation by handling function calling complexity.

Install:
```bash
uv add pydantic-ai

In [109]:
from typing import List, Any

def text_search(query: str) -> List[Any]:
    """
    Perform a text-based search on the FAQ index.

    Args:
        query (str): The search query string.

    Returns:
        List[Any]: A list of up to 5 search results returned by the FAQ index.
    """
    return faq_index.search(query, num_results=5)

In [110]:
from pydantic_ai import Agent

agent = Agent(
    'openai:gpt-4o-mini',
    system_prompt=system_prompt,
    tools=[text_search]
)

In [114]:
question = "I just discovered the course, can I join now?"

#result = agent.run_sync(question)
result = await agent.run(question)
print(result.output)

Yes, you can still join the course even after it has started. Although registering is preferred, you are eligible to submit homework without registration. However, keep in mind that there will be deadlines for submitting homework and final projects, so it’s advisable to manage your time effectively and not leave everything until the last minute.

For more information, you can check [here](https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html) to register and get updates.


In [115]:
# Inspect agent reasoning
result.all_messages()

[ModelRequest(parts=[SystemPromptPart(content="\nYou are a helpful assistant for a course. \n\nUse the search tool to find relevant information from the course materials before answering questions.\n\nIf you can find specific information through search, use it to provide accurate answers.\nIf the search doesn't return relevant results, let the user know and provide general guidance.\n", timestamp=datetime.datetime(2026, 1, 20, 13, 26, 3, 870509, tzinfo=datetime.timezone.utc)), UserPromptPart(content='I just discovered the course, can I join now?', timestamp=datetime.datetime(2026, 1, 20, 13, 26, 3, 870520, tzinfo=datetime.timezone.utc))], run_id='bfc8e3a6-4d37-4499-89fc-6642f93f4554'),
 ModelResponse(parts=[ToolCallPart(tool_name='text_search', args='{"query":"join course late enrollment"}', tool_call_id='call_UGkTsGAUjM2eE1Jh5XH7enT5')], usage=RequestUsage(input_tokens=162, output_tokens=17, details={'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_

Pydantic AI handles conversation history, tool execution, and response parsing automatically. Next: evaluating agent performance.

## Evaluation

Agent performance requires systematic measurement. This section covers:
- Logging system for tracking interactions
- Automated evaluation using LLM as judge
- Test data generation
- Performance metrics calculation

## 1. Logging System

Manual testing doesn't scale. A logging system records agent interactions for systematic evaluation.|

In [116]:
from pydantic_ai.messages import ModelMessagesTypeAdapter


def log_entry(agent, messages, source="user"):
    tools = []

    for ts in agent.toolsets:
        tools.extend(ts.tools.keys())

    dict_messages = ModelMessagesTypeAdapter.dump_python(messages)

    return {
        "agent_name": agent.name,
        "system_prompt": agent._instructions,
        "provider": agent.model.system,
        "model": agent.model.model_name,
        "tools": tools,
        "messages": dict_messages,
        "source": source
    }

In [117]:
import json
import secrets
from pathlib import Path
from datetime import datetime


LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)


def serializer(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    raise TypeError(f"Type {type(obj)} not serializable")


def log_interaction_to_file(agent, messages, source='user'):
    entry = log_entry(agent, messages, source)

    ts = entry['messages'][-1]['timestamp']
    ts_obj = datetime.fromisoformat(ts.replace("Z", "+00:00"))
    ts_str = ts_obj.strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)

    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
    filepath = LOG_DIR / filename

    with filepath.open("w", encoding="utf-8") as f_out:
        json.dump(entry, f_out, indent=2, default=serializer)

    return filepath

In [None]:
# Test interactive logging
question = input("Enter question: ")
result = await agent.run(question)
print(result.output)
log_interaction_to_file(agent, result.all_messages())

## 2. Adding References

Adjusting system prompt to include source citations.

In [None]:
system_prompt = """
You are a helpful assistant for a course.  

Use the search tool to find relevant information from the course materials before answering questions.  

If you can find specific information through search, use it to provide accurate answers.

Always include references by citing the filename of the source material you used.  
When citing the reference, replace "faq-main" by the full path to the GitHub repository: "https://github.com/DataTalksClub/faq/blob/main/"
Format: [LINK TITLE](FULL_GITHUB_LINK)

If the search doesn't return relevant results, let the user know and provide general guidance.  
""".strip()

# Create v2 agent
agent = Agent(
    'openai:gpt-4o-mini',
    system_prompt=system_prompt,
    tools=[text_search]
)

## 3. LLM as Judge

Using one LLM to evaluate another's outputs through structured criteria.

In [None]:
evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met. 

Checklist:

- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do  
- answer_relevant: The response directly addresses the user's question  
- answer_clear: The answer is clear and correct  
- answer_citations: The response includes proper citations or sources when required  
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked? 

Output true/false for each check and provide a short explanation for your judgment.
""".strip()

In [None]:
from pydantic import BaseModel

class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str

In [None]:
eval_agent = Agent(
    'openai:gpt-4o-mini',
    system_prompt=evaluation_prompt,
    output_type=EvaluationChecklist
)

In [None]:
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

In [None]:
def load_log_file(log_file):
    with open(log_file, 'r') as f_in:
        log_data = json.load(f_in)
        log_data['log_file'] = log_file
        return log_data

In [None]:
# Load and evaluate a log
log_record = load_log_file('./logs/faq_agent_v2_20250120_120000_abc123.json')

instructions = log_record['system_prompt']
question = log_record['messages'][0]['parts'][0]['content']
answer = log_record['messages'][-1]['parts'][0]['content']
log = json.dumps(log_record['messages'])

user_prompt = user_prompt_format.format(
    instructions=instructions,
    question=question,
    answer=answer,
    log=log
)

result = await eval_agent.run(user_prompt)

checklist = result.data
print(checklist.summary)

for check in checklist.checklist:
    print(check)

In [None]:
def simplify_log_messages(messages):
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
    
            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified

In [None]:
async def evaluate_log_record(eval_agent, log_record):
    messages = log_record['messages']

    instructions = log_record['system_prompt']
    question = messages[0]['parts'][0]['content']
    answer = messages[-1]['parts'][0]['content']

    log_simplified = simplify_log_messages(messages)
    log = json.dumps(log_simplified)

    user_prompt = user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

    result = await eval_agent.run(user_prompt)
    return result.data

## 4. Data Generation

Generating test questions using AI to expand evaluation dataset.

In [None]:
question_generation_prompt = """
You are helping to create test questions for an AI agent that answers questions about a data engineering course.

Based on the provided FAQ content, generate realistic questions that students might ask.

The questions should:

- Be natural and varied in style
- Range from simple to complex
- Include both specific technical questions and general course questions

Generate one question for each record.
""".strip()

class QuestionsList(BaseModel):
    questions: list[str]

question_generator = Agent(
    'openai:gpt-4o-mini',
    system_prompt=question_generation_prompt,
    output_type=QuestionsList
)

In [None]:
import random

sample = random.sample(de_dtc_faq, 10)
prompt_docs = [d['content'] for d in sample]
prompt = json.dumps(prompt_docs)

result = await question_generator.run(prompt)
questions = result.data.questions

In [None]:
from tqdm.auto import tqdm

for q in tqdm(questions):
    print(q)

    result = await agent.run(q)
    print(result.output)

    log_interaction_to_file(
        agent,
        result.all_messages(),
        source='ai-generated'
    )

    print()

## 5. Batch Evaluation

Evaluating multiple logs and calculating aggregate metrics.

In [None]:
eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if 'faq_agent_v2' not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    if log_record['source'] != 'ai-generated':
        continue

    eval_set.append(log_record)

In [None]:
eval_results = []

for log_record in tqdm(eval_set):
    eval_result = await evaluate_log_record(eval_agent, log_record)
    eval_results.append((log_record, eval_result))

In [None]:
rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']

    row = {
        'file': log_record['log_file'].name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }

    checks = {c.check_name: c.check_pass for c in eval_result.checklist}
    row.update(checks)

    rows.append(row)

In [None]:
import pandas as pd

df_evals = pd.DataFrame(rows)
df_evals.head()

In [119]:
# Calculate average pass rates
df_evals.mean(numeric_only=True)

NameError: name 'df_evals' is not defined

## Evaluation Results

Average pass rates indicate agent performance across criteria. Key metric: `answer_relevant` shows whether the agent addresses user questions correctly. These metrics guide improvements to prompts, search methods, and chunking approaches.