# Lab 08: Mini RAG System - Tutorial

## Building a Complete Search Pipeline

This lab combines everything you learned!

## The RAG Pipeline

```
1. LOAD    → Read markdown files
2. CHUNK   → Split into smaller pieces  
3. INDEX   → Store chunks with metadata
4. SEARCH  → Find relevant chunks
5. RANK    → Sort by relevance
6. RETURN  → Give top results
```

## Step 1: Load Documents

In [None]:
def load_document(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        return f.read()

# Load multiple files
def load_all_documents(filepaths):
    documents = []
    for path in filepaths:
        content = load_document(path)
        documents.append({'path': path, 'content': content})
    return documents

## Step 2: Chunk Documents

In [None]:
def chunk_text(text, chunk_size=200, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

def chunk_documents(documents, chunk_size=200):
    all_chunks = []
    for doc in documents:
        chunks = chunk_text(doc['content'], chunk_size)
        for i, chunk in enumerate(chunks):
            all_chunks.append({
                'id': f"{doc['path']}_{i}",
                'source': doc['path'],
                'text': chunk
            })
    return all_chunks

## Step 3: Search and Rank

In [None]:
def search_chunks(query, chunks, top_k=5):
    # Score each chunk
    scored = []
    for chunk in chunks:
        score = chunk['text'].lower().count(query.lower())
        if score > 0:
            scored.append({'score': score, **chunk})
    
    # Sort by score
    scored.sort(key=lambda x: x['score'], reverse=True)
    
    return scored[:top_k]

## Complete Mini RAG System

In [None]:
class MiniRAG:
    def __init__(self):
        self.chunks = []
    
    def load(self, filepaths):
        """Load documents from files."""
        documents = load_all_documents(filepaths)
        print(f'Loaded {len(documents)} documents')
        return documents
    
    def index(self, documents, chunk_size=200):
        """Chunk and index documents."""
        self.chunks = chunk_documents(documents, chunk_size)
        print(f'Created {len(self.chunks)} chunks')
    
    def search(self, query, top_k=5):
        """Search for relevant chunks."""
        results = search_chunks(query, self.chunks, top_k)
        return results

# Usage:
# rag = MiniRAG()
# docs = rag.load(['data/rubella.md', 'data/cholera.md'])
# rag.index(docs)
# results = rag.search('fever')

## Summary

You've built a mini RAG system that:
1. ✅ Loads markdown files
2. ✅ Chunks text into pieces
3. ✅ Searches by keyword
4. ✅ Ranks results by relevance

**Next steps:** Add embeddings and vector search!