In [104]:
import io
import zipfile
import requests
import frontmatter
import re
import json
import os
from dotenv import load_dotenv
from pathlib import Path
from tqdm.auto import tqdm

# 1: Ingest and Index Data

In [105]:
def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data

In [106]:
zenbrowser_docs = read_repo_data('zen-browser', 'docs')

print(f"Zen browser documents: {len(zenbrowser_docs)}")

Zen browser documents: 35


In [107]:
for file in zenbrowser_docs:
    print(file['filename'])

docs-main/CODE_OF_CONDUCT.md
docs-main/README.md
docs-main/content/docs/contribute/code-of-conduct.mdx
docs-main/content/docs/contribute/desktop/building.mdx
docs-main/content/docs/contribute/desktop/code-structure-and-prefs.mdx
docs-main/content/docs/contribute/desktop/index.mdx
docs-main/content/docs/contribute/docs/editing-with-vscode.mdx
docs-main/content/docs/contribute/docs/index.mdx
docs-main/content/docs/contribute/index.mdx
docs-main/content/docs/contribute/translation.mdx
docs-main/content/docs/contribute/www.mdx
docs-main/content/docs/faq.mdx
docs-main/content/docs/guides/1password.mdx
docs-main/content/docs/guides/about-config-flags.mdx
docs-main/content/docs/guides/generic-optimized.mdx
docs-main/content/docs/guides/live-editing.mdx
docs-main/content/docs/guides/manage-profiles.mdx
docs-main/content/docs/index.mdx
docs-main/content/docs/security.mdx
docs-main/content/docs/themes-store/themes-marketplace-preferences.mdx
docs-main/content/docs/themes-store/themes-marketplace

In [108]:
md_count = 0
mdx_count = 0

for file in zenbrowser_docs:
    if file['filename'].endswith('.md'):
        md_count += 1
    elif file['filename'].endswith('.mdx'):
        mdx_count += 1

print(f"Number of .md files: {md_count}")
print(f"Number of .mdx files: {mdx_count}")

Number of .md files: 2
Number of .mdx files: 33


In [109]:
zenbrowser_docs[0]

 'filename': 'docs-main/CODE_OF_CONDUCT.md'}

In [110]:
zenbrowser_docs[25]

{'title': 'Compact Mode',
 'description': 'Minimalistic interface for focused browsing',
 'content': 'import KeyboardShortcut from \'@/components/KeyboardShortcut\';\n\nCompact Mode is one of Zen\'s main features. It lets you hide all browser toolbars and gives a wider view of the website you\'re currently visiting.\n\nYou can activate this feature by right-click on an empty area on the `toolbar > "Compact Mode" > "Enable compact mode"`, or use <KeyboardShortcut shortcut="Alt + Ctrl + C" /> keyboard shortcut.\n\n{\n<div align="center">\n  <video width="100%" loop autoPlay>\n    <source src="/assets/user-manual/compact-mode/compact-mode.webm" />\n    Your browser does not support the video tag.\n  </video>\n</div>\n}\n\nIn Single Toolbar mode, activating Compact Mode will hide the tab sidebar. You can access the tab sidebar by hovering the side edge of the browser (based on whether `"Tabs on the right"` is activated or not).\n\nIn Multiple Toolbar or Collapsed Toolbar mode, you can choo

# 2: Chunking and Processing Data

## 2.1: Split by Paragraphs

In [111]:
text = zenbrowser_docs[12]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())

In [112]:
text

'This Guide is designed to help you integrate [1Password Desktop App](https://1password.com/downloads) with Zen Browser, for a more **straight forward workflow** when accessing your credentials using this password manager browser extension.\n\n<Callout type="warn">\nThis guide only applies for **Linux** and **MacOS** users.\n\n**Windows** users can still use the Browser Extension without integration with the Desktop App\n\nSee: [Adding another trusted browser - 1Password](https://support.1password.com/1password-browser-connection-security/#adding-another-trusted-browser)\n</Callout>\n\n1Password browser integrations follows a [list of well-known/trusted browser](https://support.1password.com/1password-browser-connection-security/), with this integration account information and encryption keys are transferred using this connection to allow the 1Password app and browser extension to share your vaults and lock state and allowing you to unlock your Browser Extension Vault with [bio-metric]

In [113]:
for idx, paragraph in enumerate(paragraphs, start=1):
    print(f"Index {idx}: {paragraph}\n")

Index 1: This Guide is designed to help you integrate [1Password Desktop App](https://1password.com/downloads) with Zen Browser, for a more **straight forward workflow** when accessing your credentials using this password manager browser extension.

Index 2: <Callout type="warn">
This guide only applies for **Linux** and **MacOS** users.

Index 3: **Windows** users can still use the Browser Extension without integration with the Desktop App

Index 4: See: [Adding another trusted browser - 1Password](https://support.1password.com/1password-browser-connection-security/#adding-another-trusted-browser)
</Callout>

Index 5: 1Password browser integrations follows a [list of well-known/trusted browser](https://support.1password.com/1password-browser-connection-security/), with this integration account information and encryption keys are transferred using this connection to allow the 1Password app and browser extension to share your vaults and lock state and allowing you to unlock your Browser

In [114]:
len(paragraphs)

28

*Notes*
- In this type of document, splitting according to paragraphs and sections do not make sense as the context is lost especially there are many symbols.
- If the document is a simple text document then the approach would make more sense.

## 2.2 Splitting by Sections

In [132]:
def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

In [133]:
zenbrowser_sections = split_markdown_by_level(text, level=2)

for doc in zenbrowser_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    zenbrowser_chunks.extend(chunks)

In [134]:
zenbrowser_sections

['## Workarounds\n\nThat being said, there are workaround methods to add Zen Browser to this _Trusted Browsers_ list for **Linux** and **MacOS**.\n\n### Linux\n\nYou can create a _Custom Allowed Browsers_ file that 1Password will use to allow Zen Browser -- or other non-officially supported browser-- to integrate with 1Password\'s desktop app.\n\n#### 1. Create 1Password\'s config directory\n\n```bash\nsudo mkdir /etc/1password\n```\n\n#### 2. Create the Custom Allowed Browsers file\n\n```bash\nsudo touch /etc/1password/custom_allowed_browsers\n```\n\n#### 3. Add Zen Browser to this custom list\n\n```bash\necho "zen-bin" | sudo tee -a /etc/1password/custom_allowed_browsers\n```\n\n---\n\nSpecial thanks to [u/xmansyx](https://www.reddit.com/user/xmansyx/) and [u/feelspeaceman](https://www.reddit.com/user/feelspeaceman/)\n\nSources:\n\n- [1Password Integration fix (Linux) - Reddit](https://www.reddit.com/r/zen_browser/comments/1gcm33v/1password_integration_fix_linux/)\n- [1Password Exten

In [130]:
for doc_idx, doc in enumerate(zenbrowser_docs, start=1):
    print(f"Document {doc_idx}: {doc.get('filename', 'Unknown filename')}")
    content = doc.get('content', '')
    sections = split_markdown_by_level(content, level=2)
    
    for idx, section in enumerate(sections, start=1):
        print(f"  Index {idx} - Section content:\n{section}\n")

Document 1: docs-main/CODE_OF_CONDUCT.md
  Index 1 - Section content:
## 1. Purpose

A primary goal of the Quartz community is to be inclusive to the largest number of contributors, with the most varied and diverse backgrounds possible. As such, we are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, ability, ethnicity, socioeconomic status, and religion (or lack thereof).

This code of conduct outlines our expectations for all those who participate in our community, as well as the consequences for unacceptable behavior.

We invite all those who participate in the Quartz community to help us create safe and positive experiences for everyone.

  Index 2 - Section content:
## 2. Open [Source/Culture/Tech] Citizenship

A supplemental goal of this Code of Conduct is to increase open [source/culture/tech] citizenship by encouraging participants to recognize and strengthen the relationships between our actions and their effe

In [135]:
print(f"Number of sections: {len(zenbrowser_sections)}")

Number of sections: 1


## 2.3: Simple Chunking with Sliding Window

In [115]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [116]:
zenbrowser_chunks = []

for doc in zenbrowser_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    zenbrowser_chunks.extend(chunks)

In [119]:
zenbrowser_chunks

[{'start': 0,
  'chunk': '# Citizen Code of Conduct\n\n## 1. Purpose\n\nA primary goal of the Quartz community is to be inclusive to the largest number of contributors, with the most varied and diverse backgrounds possible. As such, we are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, ability, ethnicity, socioeconomic status, and religion (or lack thereof).\n\nThis code of conduct outlines our expectations for all those who participate in our community, as well as the consequences for unacceptable behavior.\n\nWe invite all those who participate in the Quartz community to help us create safe and positive experiences for everyone.\n\n## 2. Open [Source/Culture/Tech] Citizenship\n\nA supplemental goal of this Code of Conduct is to increase open [source/culture/tech] citizenship by encouraging participants to recognize and strengthen the relationships between our actions and their effects on our community.\n\nCommuniti

In [121]:
for idx, chunk in enumerate(zenbrowser_chunks, start=1):
    print(f"Index {idx} - Chunk content:\n{chunk['chunk']}\n")

Index 1 - Chunk content:
# Citizen Code of Conduct

## 1. Purpose

A primary goal of the Quartz community is to be inclusive to the largest number of contributors, with the most varied and diverse backgrounds possible. As such, we are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, ability, ethnicity, socioeconomic status, and religion (or lack thereof).

This code of conduct outlines our expectations for all those who participate in our community, as well as the consequences for unacceptable behavior.

We invite all those who participate in the Quartz community to help us create safe and positive experiences for everyone.

## 2. Open [Source/Culture/Tech] Citizenship

A supplemental goal of this Code of Conduct is to increase open [source/culture/tech] citizenship by encouraging participants to recognize and strengthen the relationships between our actions and their effects on our community.

Communities mirror the s

In [120]:
len(zenbrowser_chunks)

129

*Notes*
- With the overlapping, the context is preserved better. Though this approach presents two concern:
- First, the sliding window will create a memory overhead. Not a good approach if memory efficiency is a priority.
- Second, the chunk is split after `n` number of characters. Will coherence be maintained?