In [1]:
import io
import zipfile
import requests
import frontmatter

## Understanding Frontmatter
We will also need a library for parsing frontmatter - a popular documentation format commonly used for modern frameworks like Jekyll, Hugo, and Next.js.

It looks like this:
```
---
title: "Getting Started with AI"
author: "John Doe"
date: "2024-01-15"
tags: ["ai", "machine-learning", "tutorial"]
difficulty: "beginner"
---

# Getting Started with AI

This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.

```
This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.

This format is called "frontmatter". The section between the --- markers contains YAML metadata that describes the document, while everything below is regular Markdown content. This is very useful because we can extract structured information (like title, tags, difficulty level) along with the content.

This is how we read it:


In [2]:
with open('example.md', 'r', encoding='utf-8') as f:
    post = frontmatter.load(f)

In [3]:
# Access metadata
print(post.metadata['title'])  # "Getting Started with AI"
print(post.metadata['tags'])   # ["ai", "machine-learning", "tutorial"]

Getting Started with AI
['ai', 'machine-learning', 'tutorial']


In [4]:
# Access content
print(post.content)  # The markdown content without frontmatter

# Getting Started with AI

This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.


We can also get all the metadata and content at the same time using the post.to_dict() method.
## Sample Repositories
Now that we know how to process a single markdown file, let's find a repo with multiple files that we will use as our knowledge base.

We will work with multiple repositories:
- https://github.com/DataTalksClub/faq (source for https://datatalks.club/faq/) - FAQ for DataTalks.Club courses
- https://github.com/evidentlyai/docs/ - docs for Evidently AI library
There are multiple ways you can download a GitHub repo.
First, you can clone it using git, then we process each file and prepare it for ingestion into our search system.
Alternatively, we can download the entire repository as a zip file and process all the content.
## Working with Zip Archives
The second option is easier and more efficient for our use case.
We don't even need to save the zip archive - we can load it into our Python process memory and extract all the data we need from there.
So the plan:
- Use requests for downloading the zip archive from GitHub
- Open the archive using built-in zipfile and io modules
- Iterate over all .md and .mdx files in the repo
- Collect the results into a list

Let's implement it step by step.

Next, we download the repository as a zip file. GitHub provides a convenient URL format for this:

In [5]:
url = 'https://codeload.github.com/DataTalksClub/faq/zip/refs/heads/main'
resp = requests.get(url)

Next, we download the repository as a zip file. GitHub provides a convenient URL format for this:

In [6]:
repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    # Only process markdown files
    if not filename.endswith('.md'):
        continue

    # Read and parse each file
    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()

Let's look at what we got:

In [7]:
print(repository_data[1])

{'content': '# DataTalks.Club FAQ\n\nA static site generator for DataTalks.Club course FAQs with automated AI-powered FAQ maintenance.\n\n## Features\n\n- **Static Site Generation**: Converts markdown FAQs to a beautiful, searchable HTML site\n- **Automated FAQ Management**: AI-powered bot that processes new FAQ proposals\n- **Intelligent Triage**: Automatically determines if proposals should create new entries, update existing ones, or are duplicates\n- **GitHub Integration**: Seamless workflow via GitHub Issues and Pull Requests\n\n## Project Structure\n\n```\nfaq/\n├── _questions/              # FAQ content organized by course\n│   ├── machine-learning-zoomcamp/\n│   │   ├── _metadata.yaml   # Course configuration\n│   │   ├── general/         # General course questions\n│   │   ├── module-1/        # Module-specific questions\n│   │   └── ...\n│   ├── data-engineering-zoomcamp/\n│   └── ...\n├── _layouts/                # Jinja2 HTML templates\n│   ├── base.html\n│   ├── course.htm

For processing Evidently docs we also need .mdx files (React markdown), so we can modify the code like this:

In [8]:
url = 'https://codeload.github.com/evidentlyai/docs/zip/refs/heads/main'
resp = requests.get(url)

In [9]:
repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    if not (filename.endswith('.md') or filename.endswith('.mdx')):
        continue

    # Read and parse each file
    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()

In [10]:
print(repository_data[1])

{'title': 'Delete Plant', 'openapi': 'DELETE /plants/{id}', 'content': '', 'filename': 'docs-main/api-reference/endpoint/delete.mdx'}


## Complete Implementation
Let's now put everything together into a reusable function:

In [11]:
def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data

We can now use this function for different repositories:

In [12]:
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")

FAQ documents: 1228
Evidently documents: 95


## Data Processing Considerations
For FAQ, the data is ready to use. These are small records that we can index (put into a search engine) as is.

For Evidently docs, the documents are very large. We need extra processing called "chunking" - breaking large documents into smaller, manageable pieces. This is important because:
1. Search relevance: Smaller chunks are more specific and relevant to user queries
2. Performance: AI models work better with shorter text segments
3. Memory limits: Large documents might exceed token limits of language models
We will cover chunking techniques in tomorrow's lesson.

If you have any suggestions about the course content or want to improve something, let me know!