In [1]:
import io
import zipfile
import requests
import frontmatter

# Day 1: Ingest and Index Your Data
Welcome to our crash course!

In this course, you'll learn how to build intelligent systems that can understand and interact with your data.

We'll create a conversational agent that can answer questions about any GitHub repository - think of it as your personal AI assistant for documentation and code. If you know DeepWiki, it's something similar, but tailored to your GitHub repo.

For that, we need to:
- Download and process data from the repo
- Put it inside a search engine
- Make the search engine available to our agent

In the first half of the course, we will focus on data preparation.

Today, we will do the first part: downloading the data.
## GitHub Repo Data
On the first day, we will learn how to download and process data from any GitHub repository. We will download the data as a zip archive, process all the text data from there, and make it available for ingesting it later into a search engine.

Think of this as preparing a meal - we need to gather and prep all our ingredients (the data) before we can cook (build our AI agent).

Today, we will deal with simple cases, when documents are not large.

Tomorrow we will deal with more complex cases when documents are big and we also have code.
## Environment Setup
First, let's prepare the environment. We need Python 3.10 or higher.

We will use uv as the package manager. If you don't have uv, let's install it:
```
pip install uv
```
Next, create a folder aihero with two subfolders:
- course - here you will reproduce all the examples from this email course
- project - here you will create your own project
Now go to course and run:
```
uv init
uv add requests python-frontmatter
uv add --dev jupyter
```
This will initialize an empty-  Python project with uv and install multiple libraries:
- requests for downloading data from GitHub
- python-frontmatter for parsing structured metadata in markdown files
- jupyter (in dev mode)
- 
The reason we need jupyter in dev mode is because it's only used for development and experimentation, not in the final production code.

Let's start Jupyter:
```
uv run jupyter notebook
```

## Understanding Frontmatter
We will also need a library for parsing frontmatter - a popular documentation format commonly used for modern frameworks like Jekyll, Hugo, and Next.js.

It looks like this:
```
---
title: "Getting Started with AI"
author: "John Doe"
date: "2024-01-15"
tags: ["ai", "machine-learning", "tutorial"]
difficulty: "beginner"
---

# Getting Started with AI

This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.

```
This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.

This format is called "frontmatter". The section between the --- markers contains YAML metadata that describes the document, while everything below is regular Markdown content. This is very useful because we can extract structured information (like title, tags, difficulty level) along with the content.

This is how we read it:


In [2]:
with open('example.md', 'r', encoding='utf-8') as f:
    post = frontmatter.load(f)

In [3]:
# Access metadata
print(post.metadata['title'])  # "Getting Started with AI"
print(post.metadata['tags'])   # ["ai", "machine-learning", "tutorial"]

Getting Started with AI
['ai', 'machine-learning', 'tutorial']


In [4]:
# Access content
print(post.content)  # The markdown content without frontmatter

# Getting Started with AI

This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.


We can also get all the metadata and content at the same time using the post.to_dict() method.
## Sample Repositories
Now that we know how to process a single markdown file, let's find a repo with multiple files that we will use as our knowledge base.

We will work with multiple repositories:
- https://github.com/DataTalksClub/faq (source for https://datatalks.club/faq/) - FAQ for DataTalks.Club courses
- https://github.com/evidentlyai/docs/ - docs for Evidently AI library

There are multiple ways you can download a GitHub repo.
First, you can clone it using git, then we process each file and prepare it for ingestion into our search system.

Alternatively, we can download the entire repository as a zip file and process all the content.
## Working with Zip Archives
The second option is easier and more efficient for our use case.
We don't even need to save the zip archive - we can load it into our Python process memory and extract all the data we need from there.
So the plan:
- Use requests for downloading the zip archive from GitHub
- Open the archive using built-in zipfile and io modules
- Iterate over all .md and .mdx files in the repo
- Collect the results into a list

Let's implement it step by step.

Next, we download the repository as a zip file. GitHub provides a convenient URL format for this:

In [5]:
url = 'https://codeload.github.com/DataTalksClub/faq/zip/refs/heads/main'
resp = requests.get(url)

Next, we download the repository as a zip file. GitHub provides a convenient URL format for this:

In [6]:
repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    # Only process markdown files
    if not filename.endswith('.md'):
        continue

    # Read and parse each file
    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()

Let's look at what we got:

In [7]:
print(repository_data[1])

{'content': '# DataTalks.Club FAQ\n\nA static site generator for DataTalks.Club course FAQs with automated AI-powered FAQ maintenance.\n\n## Features\n\n- **Static Site Generation**: Converts markdown FAQs to a beautiful, searchable HTML site\n- **Automated FAQ Management**: AI-powered bot that processes new FAQ proposals\n- **Intelligent Triage**: Automatically determines if proposals should create new entries, update existing ones, or are duplicates\n- **GitHub Integration**: Seamless workflow via GitHub Issues and Pull Requests\n\n## Project Structure\n\n```\nfaq/\n├── _questions/              # FAQ content organized by course\n│   ├── machine-learning-zoomcamp/\n│   │   ├── _metadata.yaml   # Course configuration\n│   │   ├── general/         # General course questions\n│   │   ├── module-1/        # Module-specific questions\n│   │   └── ...\n│   ├── data-engineering-zoomcamp/\n│   └── ...\n├── _layouts/                # Jinja2 HTML templates\n│   ├── base.html\n│   ├── course.htm

For processing Evidently docs we also need .mdx files (React markdown), so we can modify the code like this:

In [8]:
url = 'https://codeload.github.com/evidentlyai/docs/zip/refs/heads/main'
resp = requests.get(url)

In [9]:
repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    if not (filename.endswith('.md') or filename.endswith('.mdx')):
        continue

    # Read and parse each file
    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()

In [10]:
print(repository_data[1])

{'title': 'Delete Plant', 'openapi': 'DELETE /plants/{id}', 'content': '', 'filename': 'docs-main/api-reference/endpoint/delete.mdx'}


## Complete Implementation
Let's now put everything together into a reusable function:

In [11]:
def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data

We can now use this function for different repositories:

In [12]:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")

FAQ documents: 1228
Evidently documents: 95


## Data Processing Considerations
For FAQ, the data is ready to use. These are small records that we can index (put into a search engine) as is.

For Evidently docs, the documents are very large. We need extra processing called "chunking" - breaking large documents into smaller, manageable pieces. This is important because:
1. Search relevance: Smaller chunks are more specific and relevant to user queries
2. Performance: AI models work better with shorter text segments
3. Memory limits: Large documents might exceed token limits of language models

We will cover chunking techniques in tomorrow's lesson.

If you have any suggestions about the course content or want to improve something, let me know!
## Homework
- Create a new uv project in the project directory
- Select a GitHub repo with documentation (preferably with .md files)
- Download the data from there using the techniques we've learned
- Make a post on social media about what you're building
# Day 2: Chunking and Intelligent Processing for Data
Welcome to Day 2 of our 7-Day AI Agents Email Crash-Course.

In the first part of the course, we focus on data preparation – the process of properly preparing data before it can be used for AI agents.

**Small and Large Documents**

Yesterday (Day 1), we downloaded the data from a GitHub repository and processed it. For some use cases, like the FAQ database, this is sufficient. The questions and answers are small enough. We can put them directly into the search engine.

But it's different for the Evidently documentation. These documents are quite large. Let's take a look at this one: https://github.com/evidentlyai/docs/blob/main/docs/library/descriptors.mdx.

We could use it as is, but we risk overwhelming our LLMs.

**Why We Need to Prepare Large Documents Before Using Them**

Large documents create several problems:
- Token limits: Most LLMs have maximum input token limits
- Cost: Longer prompts cost more money
- Performance: LLMs perform worse with very long contexts
- Relevance: Not all parts of a long document are relevant to a specific question

So we need to split documents into smaller subdocuments. For AI applications like RAG (which we will discuss tomorrow), this process is referred to as "chunking."
## Today’s Tasks (Day 2)
Today, we will cover multiple ways of chunking data:
- Simple character-based chunking
- Paragraph and section-based chunking
- Intelligent chunking with LLM

Just so you know, for the last section, you will need an OpenAI account or an account from an alternative LLM provider such as Groq.
## 1. Simple Chunking
Let's start with simple chunking. This will be sufficient for most cases.

We can continue with the notebook from Day 1. We already downloaded the data from Evidently docs. We put them into the evidently_docs list.

This is how the document at index 45 looks like:
```
{'title': 'LLM regression testing',
 'description': 'How to run regression testing for LLM outputs.',
 'content': 'In this tutorial, you will learn...'
}
```
The content field is 21,712 characters long. The simplest thing we can do is cut it into pieces of equal length. For example, for size of 2000 characters, we will have:
- chunk 1: 0..2000
- Chunk 2: 2000..4000
- Chunk 3: 4000..6000

And so on.

However, this approach has disadvantages:
- Context loss: Important information might be split in the middle
- Incomplete sentences: Chunks might end mid-sentence
- Missing connections: Related information might end up in different chunks

That's why, in practice, we usually make sure there's overlap between chunks. For size 2000 and overlap 1000, we will have:
- Chunk 1: 0..2000
- Chunk 2: 1000..3000
- Chunk 3: 2000..4000
- ...

This is better for AI because:
- Continuity: Important information isn't lost at chunk boundaries
- Context preservation: Related sentences stay together in at least one chunk
- Better search: Queries can match information even if it spans chunk boundaries

This approach is known as the "sliding window" method. This is how we implement it in Python:

In [13]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

Let's apply it for document 45. This gives us 21 chunks:
- 0..2000
- 1000..3000
- ...
- 19000..21000
- 20000..21712

Let's process all the documents:

In [14]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

Note that we use copy() and pop() operations:
- doc.copy() creates a shallow copy of the document dictionary
- doc_copy.pop('content') removes the 'content' key and returns its value
- This way we preserve the original dictionary keys that we can use later in the chunks.

This way, we obtain 575 chunks from 95 documents.

We can play with the parameters by including more or less content. 2000 characters is usually good enough for RAG applications.

There are some alternative approaches:
- Token-based chunking: You first tokenize the content (turn it into a sequence of words) and then do a sliding window over tokens
    - Advantages: More precise control over LLM input size
    - Disadvantages: Doesn't work well for documents with code
- Paragraph splitting: Split by paragraphs
- Section splitting: Split by sections
- AI-powered splitting: Let AI split the text intelligently

We won't cover token-based chunking here, as we're working with documents that contain code. But it's easy to implement - ask ChatGPT for help if you need it for text-only content.

We will implement the others.
## 2. Splitting by Paragraphs and Sections
Splitting by paragraphs is relatively easy:

In [15]:
import re
text = evidently_docs[45]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())

We use \n\s*\n regex pattern for splitting:
- \n matches a newline
- \s* matches zero or more whitespace characters
- \n matches another newline
- So \n\s*\n matches two newlines with optional whitespace between them

This works well for literature, but it doesn't work well for documents. Most paragraphs in technical documentation are very short.

You can combine sliding window and paragraph splitting for more intelligent processing. We won't do it here, but it's a good exercise to try.

Let's now look at section splitting. Here, we take advantage of the documents' structure. Markdown documents have this structure:

In [16]:
# Heading 1
## Heading 2  
### Heading 3

What we can do is split by headers.

For that we will use regex too:

In [17]:
import re

def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

Note: This code may not work perfectly if we want to split by level 1 headings and have Python code with # comments. But in general, this is not a big problem for documentation.

If we want to split by second-level headers, that's what we do:

In [18]:
sections = split_markdown_by_level(text, level=2)

Now we iterate over all the docs to create the final result:

In [19]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

Like previously, copy() creates a copy of the document metadata. pop('content') removes and returns the content. This way, each section gets the same metadata (title, description) as the original document.

This was more intelligent processing, but we can go even further and use LLMs for that.
## 3. Intelligent Chunking with LLM
In some cases, we want to be more intelligent with chunking. Instead of doing simple splits, we delegate this work to AI.

This makes sense when:
- Complex structure: Documents have complex, non-standard structure
- Semantic coherence: You want chunks that are semantically meaningful
- Custom logic: You need domain-specific splitting rules
- Quality over cost: You prioritize quality over processing cost

This costs money. In most cases, we don't need intelligent chunking.

Simple approaches are sufficient. Use intelligent chunking only when
- You already evaluated simpler methods and you can confirm that they produce poor results
- You have complex, unstructured documents
- Quality is more important than cost
- You have the budget for LLM processing

Note: You can use any alternative LLM provider. One option is Groq, which is free with rate limits. You can replace the OpenAI library with the Groq library and it should work.

To continue, you need to get the API key from https://platform.openai.com/api-keys (assuming you have an account).

Let's stop Jupyter and create an environment variable with your key:
```
export OPENAI_API_KEY='your-api-key'
export GROQ_API_KEY='your-api-key'
```
Install the OpenAI/Groq SDK:
```
uv add openai
uv add groq
```
Then run jupyter notebook:
```
uv run jupyter notebook
```
It's cumbersome to set environment variables every time. I recommend using direnv, which works for Linux, Mac and Windows.

Note: if you use direnv, don't forget to add .envrc to .gitignore.

Warning: Never commit your API keys to git! Others can use your API key and you'll pay for it.

Now we're ready to use OpenAI/Groq:

In [36]:
'''
from openai import OpenAI

openai_client = OpenAI()


def llm(prompt, model='gpt-4o-mini'):
    messages = [
        {"role": "user", "content": prompt}
    ]

    response = openai_client.responses.create(
        model='gpt-4o-mini',
        input=messages
    )

    return response.output_text
'''

from groq import Groq
client = Groq()

def llm(prompt, model="llama-3.3-70b-versatile"):
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.choices[0].message.content

This code invokes an LLM (gpt-4o-mini) with the provided prompt and returns the results. We will explain in more detail what this code does in the next lessons.

Let's create a prompt:

In [37]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()

The prompt asks the LLM to:
- Split the document logically (not just by length)
- Make sections self-contained
- Use a specific output format that's easy to parse

Let's create a function for intelligent chunking:

In [38]:
def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

Now we apply this to every document:

In [39]:
from tqdm.auto import tqdm

evidently_chunks = []

for doc in tqdm(evidently_docs):
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

  0%|          | 0/95 [00:00<?, ?it/s]

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.3-70b-versatile` in organization `org_01kc73gxf2es4rv6qb1873937p` service tier `on_demand` on tokens per day (TPD): Limit 100000, Used 99704, Requested 1289. Please try again in 14m17.952s. Need more tokens? Upgrade to Dev Tier today at https://console.groq.com/settings/billing', 'type': 'tokens', 'code': 'rate_limit_exceeded'}}

tqdm is a library that shows progress bars. It helps you track progress when processing a large number of documents.

**Note: This process requires time and incurs costs. As mentioned before, use this only when really necessary. For most applications, you don't need intelligent chunking.**

Bonus: you can use this approach for processing the code in your GitHub repository. You can use a variation of the following prompt:

"Summarize the code in plain English. Briefly describe each class and function/method (their purpose and role), then give a short overall summary of how they work together. Avoid low-level details.". Then add both the source code and the summary to your documents.
## 4. How to Choose a Chunking Approach
You may wonder - which chunking should I use? The answer: start with the simplest one and gradually increase complexity. Start with simple chunking with overlaps. We will later talk about evaluations. You can use evaluations to make informed decisions about chunking strategies.
## Coming Up Tomorrow (Day 3)
Our data is ready. Now we can index it – insert it into a search engine and make it available for our (future) agent to use.

If you have suggestions about the course content or want to improve something, let me know! Answer to the email with this lesson.
## Homework
- For the project you selected, apply chunking
- Experiment with simple chunking, paragraph chunking + sliding window, and section chunking
- Which approach makes sense for your application? Manually inspect the results and analyze what works best
- Make a post on social media about what you're building