## Step 1: Import Required Libraries


In [2]:
## URL Format : https://github.com/<owner>/<repository>/archive/refs/heads/<branch_name>.zip
import io
import zipfile
import requests
import frontmatter

## Step 2: Download the Repository
- GitHub's ZIP URL format:
https://codeload.github.com/{owner}/{repo}/zip/refs/heads/{branch}


In [3]:
url = 'https://codeload.github.com/fsamura01/taREDACTED_OPENAI_KEY-app/zip/refs/heads/main'
resp = requests.get(url)
resp

<Response [200]>

## Step 3: Process the ZIP File in Memory


In [4]:
repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    # Only process markdown files
    if not filename.endswith('.md'):
        continue

    # Read and parse each file
    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()

In [5]:
print(f"Total documents extracted: {len(repository_data)}")

Total documents extracted: 3


In [7]:
# Look at multiple documents to find one with frontmatter
for i, doc in enumerate(repository_data[:5]):
    print(f"\n--- Document {i} ---")
    print(f"Filename: {doc.get('filename')}")
    print(f"Keys: {list(doc.keys())}")
    if 'question' in doc:
        print(f"Question: {doc.get('question')}")
        break


--- Document 0 ---
Filename: taREDACTED_OPENAI_KEY-app-main/readme.md
Keys: ['content', 'filename']

--- Document 1 ---
Filename: taREDACTED_OPENAI_KEY-app-main/client/readme.md
Keys: ['content', 'filename']

--- Document 2 ---
Filename: taREDACTED_OPENAI_KEY-app-main/server/readme.md
Keys: ['content', 'filename']


In [13]:
# Find the document with the question
faq_doc = repository_data[0]
print(faq_doc)

{'content': '# Task Manager App\n\nA React-based task management application that allows users to create, edit, delete, and track tasks with due dates and completion status.\n\n## Features\n\n### Task Management\n- **Create Tasks**: Add new tasks with title, description, and due date\n- **Edit Tasks**: Modify existing tasks with inline editing\n- **Delete Tasks**: Remove tasks with confirmation dialog\n- **Toggle Completion**: Mark tasks as complete/incomplete with one click\n\n### User Experience\n- **Task Statistics**: View total, incomplete, and completed task counts\n- **Visual Feedback**: Different styling for completed vs incomplete tasks\n- **Loading States**: Clear feedback during API operations\n- **Error Handling**: Comprehensive error messages and validation\n\n### Form Validation\n- **Title**: Required, minimum 3 characters\n- **Description**: Required\n- **Due Date**: Required, cannot be in the past (for incomplete tasks)\n- **Real-time Validation**: Clear errors as user t

## Step 5: Support Multiple Markdown Types
- To include .mdx files (React markdown):



In [25]:
for file_info in zf.infolist():
    filename = file_info.filename.lower()

    if not (filename.endswith('.md') or filename.endswith('.mdx')):
        continue


## Step 6: Complete Reusable Function
- Here's the production-ready version with error handling:

In [17]:
import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.

    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name

    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com'
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)

    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))

    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') or filename_lower.endswith('.mdx')):
            continue

        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue

    zf.close()
    return repository_data

## Step 7: Use the Function

In [26]:
# Download and process different repositories
task_manager_app_docs = read_repo_data('fsamura01', 'taREDACTED_OPENAI_KEY-app')
print(task_manager_app_docs[0])

{'content': '# Task Manager App\n\nA React-based task management application that allows users to create, edit, delete, and track tasks with due dates and completion status.\n\n## Features\n\n### Task Management\n- **Create Tasks**: Add new tasks with title, description, and due date\n- **Edit Tasks**: Modify existing tasks with inline editing\n- **Delete Tasks**: Remove tasks with confirmation dialog\n- **Toggle Completion**: Mark tasks as complete/incomplete with one click\n\n### User Experience\n- **Task Statistics**: View total, incomplete, and completed task counts\n- **Visual Feedback**: Different styling for completed vs incomplete tasks\n- **Loading States**: Clear feedback during API operations\n- **Error Handling**: Comprehensive error messages and validation\n\n### Form Validation\n- **Title**: Required, minimum 3 characters\n- **Description**: Required\n- **Due Date**: Required, cannot be in the past (for incomplete tasks)\n- **Real-time Validation**: Clear errors as user t

## Step 8: Inspect the Data

In [27]:
# Look at the first document
print(task_manager_app_docs[0])

{'content': '# Task Manager App\n\nA React-based task management application that allows users to create, edit, delete, and track tasks with due dates and completion status.\n\n## Features\n\n### Task Management\n- **Create Tasks**: Add new tasks with title, description, and due date\n- **Edit Tasks**: Modify existing tasks with inline editing\n- **Delete Tasks**: Remove tasks with confirmation dialog\n- **Toggle Completion**: Mark tasks as complete/incomplete with one click\n\n### User Experience\n- **Task Statistics**: View total, incomplete, and completed task counts\n- **Visual Feedback**: Different styling for completed vs incomplete tasks\n- **Loading States**: Clear feedback during API operations\n- **Error Handling**: Comprehensive error messages and validation\n\n### Form Validation\n- **Title**: Required, minimum 3 characters\n- **Description**: Required\n- **Due Date**: Required, cannot be in the past (for incomplete tasks)\n- **Real-time Validation**: Clear errors as user t

## Today’s Tasks (Day 2)

### 1. Simple Chunking

In [24]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [30]:
task_manager_app_chunks = []

for doc in task_manager_app_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    task_manager_app_chunks.extend(chunks)

In [31]:
task_manager_app_chunks

[{'start': 0,
  'chunk': '# Task Manager App\n\nA React-based task management application that allows users to create, edit, delete, and track tasks with due dates and completion status.\n\n## Features\n\n### Task Management\n- **Create Tasks**: Add new tasks with title, description, and due date\n- **Edit Tasks**: Modify existing tasks with inline editing\n- **Delete Tasks**: Remove tasks with confirmation dialog\n- **Toggle Completion**: Mark tasks as complete/incomplete with one click\n\n### User Experience\n- **Task Statistics**: View total, incomplete, and completed task counts\n- **Visual Feedback**: Different styling for completed vs incomplete tasks\n- **Loading States**: Clear feedback during API operations\n- **Error Handling**: Comprehensive error messages and validation\n\n### Form Validation\n- **Title**: Required, minimum 3 characters\n- **Description**: Required\n- **Due Date**: Required, cannot be in the past (for incomplete tasks)\n- **Real-time Validation**: Clear err