<a href="https://colab.research.google.com/github/alio-elmotafy/ai-hero-project/blob/main/Day_3/Day_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install minsearch sentence-transformers tqdm python-frontmatter


Collecting minsearch
  Downloading minsearch-0.0.7-py3-none-any.whl.metadata (8.3 kB)
Collecting python-frontmatter
  Downloading python_frontmatter-1.1.0-py3-none-any.whl.metadata (4.1 kB)
Downloading minsearch-0.0.7-py3-none-any.whl (11 kB)
Downloading python_frontmatter-1.1.0-py3-none-any.whl (9.8 kB)
Installing collected packages: python-frontmatter, minsearch
Successfully installed minsearch-0.0.7 python-frontmatter-1.1.0


In [2]:

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.

    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name

    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com'
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)

    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))

    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md')
            or filename_lower.endswith('.mdx')):
            continue

        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue

    zf.close()
    return repository_data


In [3]:
docs = read_repo_data('letta-ai', 'letta')

print(f"Flask documents: {len(docs)}")

Flask documents: 17


# Text Search (Lexical Search)

In [4]:
from minsearch import Index

faq_index = Index(
    text_fields=["question", "content"],
    keyword_fields=[]
)

faq_index.fit(docs)

<minsearch.minsearch.Index at 0x7dc3b4884530>

In [5]:
query = "Can I join the course now?"
text_results = faq_index.search(query)

text_results

[{'content': '# 🚀 How to Contribute to Letta\n\nThank you for investing time in contributing to our project! Here\'s a guide to get you started.\n\n## 1. 🚀 Getting Started\n\n### 🍴 Fork the Repository\n\nFirst things first, let\'s get you a personal copy of Letta to play with. Think of it as your very own playground. 🎪\n\n1. Head over to the Letta repository on GitHub.\n2. In the upper-right corner, hit the \'Fork\' button.\n\n### 🚀 Clone the Repository\n\nNow, let\'s bring your new playground to your local machine.\n\n```shell\ngit clone https://github.com/your-username/letta.git\n```\n\n### 🧩 Install dependencies & configure environment\n\n#### Install uv and dependencies\n\nFirst, install uv using [the official instructions here](https://docs.astral.sh/uv/getting-started/installation/).\n\nOnce uv is installed, navigate to the letta directory and install the Letta project with uv:\n```shell\ncd letta\neval $(uv env activate)\nuv sync --all-extras\n```\n#### Setup PostgreSQL environm

# Vector Search

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np

embedding_model = SentenceTransformer("multi-qa-distilbert-cos-v1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/523 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
print(docs[0].keys())


dict_keys(['name', 'about', 'title', 'labels', 'assignees', 'content', 'filename'])


In [8]:
faq_embeddings = []

for d in docs:
    text_parts = []

    if "title" in d and d["title"]:
        text_parts.append(d["title"])

    if "about" in d and d["about"]:
        text_parts.append(d["about"])

    if "content" in d and d["content"]:
        text_parts.append(d["content"])

    text = " ".join(text_parts)

    emb = embedding_model.encode(text)
    faq_embeddings.append(emb)

faq_embeddings = np.array(faq_embeddings)

In [9]:
from minsearch import VectorSearch

faq_vindex = VectorSearch()
faq_vindex.fit(faq_embeddings, docs)


<minsearch.vector.VectorSearch at 0x7dc37711cfe0>

In [10]:
query = "I just found out about the course. Can I enroll?"
q = embedding_model.encode(query)

vector_results = faq_vindex.search(q)
vector_results


[{'content': '# About\nThese certs are used to set up a localhost https connection to the ADE.\n\n## Instructions\n1. Install [mkcert](https://github.com/FiloSottile/mkcert)\n2. Run `mkcert -install`\n3. Run letta with the environment variable `LOCAL_HTTPS=true`\n4. Access the app at [https://app.letta.com/development-servers/local/dashboard](https://app.letta.com/development-servers/local/dashboard)\n5. Click "Add remote server" and enter `https://localhost:8283` as the URL, leave password blank unless you have secured your ADE with a password.',
  'filename': 'letta-main/certs/README.md'},
  'filename': 'letta-main/TERMS.md'},
 {'content': '# Letta + local LLMs\n\nSee [https://letta.readme.io/docs/local_llm](https://letta.readme.io/docs/local_llm) for documentation on running Letta with custom LLM backends.',
  'filename': 'letta-main/letta/local_llm/README.md'},
  'filename': 'letta-main/PRIVACY.md'},
 {'content': '# 🚀 How to Contribute to Letta\n\nThank you for investing time in co

# Hybrid Search

In [11]:
def text_search(query):
    return faq_index.search(query, num_results=5)

def vector_search(query):
    q = embedding_model.encode(query)
    return faq_vindex.search(q, num_results=5)

def hybrid_search(query):
    text_results = text_search(query)
    vector_results = vector_search(query)

    seen_ids = set()
    final_results = []

    for r in text_results + vector_results:
        doc_id = r.get("filename")

        if doc_id not in seen_ids:
            seen_ids.add(doc_id)
            final_results.append(r)

    return final_results

In [12]:
query = "Can I enroll now?"
results = hybrid_search(query)

results


[{'content': '# 🚀 How to Contribute to Letta\n\nThank you for investing time in contributing to our project! Here\'s a guide to get you started.\n\n## 1. 🚀 Getting Started\n\n### 🍴 Fork the Repository\n\nFirst things first, let\'s get you a personal copy of Letta to play with. Think of it as your very own playground. 🎪\n\n1. Head over to the Letta repository on GitHub.\n2. In the upper-right corner, hit the \'Fork\' button.\n\n### 🚀 Clone the Repository\n\nNow, let\'s bring your new playground to your local machine.\n\n```shell\ngit clone https://github.com/your-username/letta.git\n```\n\n### 🧩 Install dependencies & configure environment\n\n#### Install uv and dependencies\n\nFirst, install uv using [the official instructions here](https://docs.astral.sh/uv/getting-started/installation/).\n\nOnce uv is installed, navigate to the letta directory and install the Letta project with uv:\n```shell\ncd letta\neval $(uv env activate)\nuv sync --all-extras\n```\n#### Setup PostgreSQL environm

In [13]:
for i, r in enumerate(results[:3], 1):
    print(f"\n🔹 Result {i}")
    print("File:", r["filename"])
    print("Preview:")
    print(r["content"][:300])


🔹 Result 1
File: letta-main/CONTRIBUTING.md
Preview:
# 🚀 How to Contribute to Letta

Thank you for investing time in contributing to our project! Here's a guide to get you started.

## 1. 🚀 Getting Started

### 🍴 Fork the Repository

First things first, let's get you a personal copy of Letta to play with. Think of it as your very own playground. 🎪

1.

🔹 Result 2
File: letta-main/.github/pull_request_template.md
Preview:
**Please describe the purpose of this pull request.**
Is it to add a new feature? Is it to fix a bug?

**How to test**
How can we test your PR during review? What commands should we run? What outcomes should we expect?

**Have you tested this PR?**
Have you tested the latest commit on the PR? If so 

🔹 Result 3
File: letta-main/letta/plugins/README.md
Preview:
### Plugins

Plugins enable plug and play for various components.

Plugin configurations can be set in `letta.settings.settings`.

The plugins will take a delimited list of consisting of individual plugin config