# 🔍 RAG (Retrieval Augmented Generation) - Interactive Lab

## Learn how to build AI systems that can search and understand massive document repositories!

**What you'll learn:**
- How RAG solves the challenge of connecting AI to large document collections
- Vector embeddings and semantic search
- Different chunking strategies and when to use them
- Building production-ready RAG systems
- Hands-on implementation with real documents

**Prerequisites:**
- Basic Python knowledge
- Understanding of machine learning concepts (helpful but not required)

**Time to complete:** 60-90 minutes

## 📚 Table of Contents

1. [Introduction to RAG](#intro)
2. [Setup and Dependencies](#setup)
3. [Loading Documents](#loading)
4. [Chunking Strategies](#chunking)
5. [Vector Embeddings](#embeddings)
6. [Vector Databases](#vectordb)
7. [Semantic Search](#search)
8. [Complete RAG Pipeline](#pipeline)
9. [Advanced Topics](#advanced)
10. [Exercises](#exercises)

<a id='intro'></a>
## 1. 🎯 Introduction to RAG

### The Problem

Imagine you have:
- **1000s of company documents** (policies, reports, documentation)
- **Questions** that need answers from these documents
- **AI assistants** (like ChatGPT) that can answer questions BUT...

❌ ChatGPT doesn't know about YOUR specific documents

❌ Can't feed 1000s of pages into a single prompt (context limits)

❌ Documents change frequently

### The Solution: RAG

**RAG = Retrieval Augmented Generation**

Think of it as giving AI a "search engine" for your documents:

```
User Question → [RAG System] → Relevant Docs → AI → Answer
```

**How it works:**

1. **📄 Index Phase** (done once):
   - Split documents into chunks
   - Convert chunks to vectors (embeddings)
   - Store in vector database

2. **🔍 Query Phase** (every question):
   - Convert question to vector
   - Find similar document chunks (semantic search)
   - Feed relevant chunks + question to AI
   - Get accurate, grounded answer

### Real-World Applications

- **Customer Support**: Answer questions from product documentation
- **Legal**: Search through contracts and case law
- **Healthcare**: Query medical research papers
- **Enterprise**: Internal knowledge management systems
- **Education**: Intelligent tutoring systems

<a id='setup'></a>
## 2. 🛠️ Setup and Dependencies

Let's install the required libraries:

In [None]:
# Install required packages (run once)
!pip install -q sentence-transformers faiss-cpu chromadb nltk tiktoken openai anthropic

In [None]:
# Import libraries
import os
import re
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# For text processing
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt', quiet=True)

# For embeddings
from sentence_transformers import SentenceTransformer

# For vector databases
import faiss
import chromadb
from chromadb.config import Settings

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

print("✅ All libraries imported successfully!")

<a id='loading'></a>
## 3. 📄 Loading Documents

We'll work with realistic company documents:
- Employee policies
- Product documentation  
- Sales reports

These represent typical enterprise knowledge bases.

In [None]:
# Document loader class
class DocumentLoader:
    """Load and manage documents from various sources"""
    
    def __init__(self, documents_dir: str):
        self.documents_dir = documents_dir
        self.documents = {}
    
    def load_txt_files(self) -> Dict[str, str]:
        """Load all .txt files from the documents directory"""
        for filename in os.listdir(self.documents_dir):
            if filename.endswith('.txt'):
                filepath = os.path.join(self.documents_dir, filename)
                with open(filepath, 'r', encoding='utf-8') as f:
                    self.documents[filename] = f.read()
        
        print(f"✅ Loaded {len(self.documents)} documents")
        return self.documents
    
    def get_document_stats(self) -> pd.DataFrame:
        """Get statistics about loaded documents"""
        stats = []
        for name, content in self.documents.items():
            stats.append({
                'Document': name,
                'Characters': len(content),
                'Words': len(content.split()),
                'Lines': len(content.split('\n'))
            })
        return pd.DataFrame(stats)

# Load our sample documents
loader = DocumentLoader('data/sample_documents')
documents = loader.load_txt_files()

# Display stats
print("\n📊 Document Statistics:")
print(loader.get_document_stats().to_string(index=False))

# Preview first document
print("\n📄 Preview of first document:")
first_doc = list(documents.values())[0]
print(first_doc[:500] + "...\n")

<a id='chunking'></a>
## 4. ✂️ Chunking Strategies

### Why Chunk Documents?

Documents are often too large to process at once. We need to split them into **chunks**:

- **Too small**: Loses context ("the company" - which company?)
- **Too large**: Too much irrelevant info, exceeds AI context limits
- **Just right**: Balances context and precision

### Three Chunking Strategies

We'll implement and compare:

1. **Fixed-Size Chunking**: Split every N characters/tokens
2. **Sentence-Based Chunking**: Group complete sentences
3. **Semantic Chunking**: Split at topic boundaries

In [None]:
class TextChunker:
    """Different strategies for chunking text documents"""
    
    @staticmethod
    def fixed_size_chunking(
        text: str, 
        chunk_size: int = 500, 
        overlap: int = 50
    ) -> List[str]:
        """Split text into fixed-size chunks with overlap
        
        Args:
            text: Input text to chunk
            chunk_size: Number of characters per chunk
            overlap: Number of overlapping characters between chunks
        """
        chunks = []
        start = 0
        
        while start < len(text):
            end = start + chunk_size
            chunk = text[start:end]
            
            # Try to break at word boundary
            if end < len(text):
                last_space = chunk.rfind(' ')
                if last_space != -1:
                    chunk = chunk[:last_space]
            
            chunks.append(chunk.strip())
            start = end - overlap
        
        return chunks
    
    @staticmethod
    def sentence_based_chunking(
        text: str, 
        sentences_per_chunk: int = 5
    ) -> List[str]:
        """Group sentences into chunks
        
        Args:
            text: Input text to chunk
            sentences_per_chunk: Number of sentences per chunk
        """
        sentences = sent_tokenize(text)
        chunks = []
        
        for i in range(0, len(sentences), sentences_per_chunk):
            chunk = ' '.join(sentences[i:i + sentences_per_chunk])
            chunks.append(chunk)
        
        return chunks
    
    @staticmethod
    def semantic_chunking(
        text: str,
        section_pattern: str = r'\n\n+|SECTION \d+'
    ) -> List[str]:
        """Split text at semantic boundaries (sections, paragraphs)
        
        Args:
            text: Input text to chunk
            section_pattern: Regex pattern to identify sections
        """
        # Split on section markers or double newlines
        chunks = re.split(section_pattern, text)
        
        # Filter empty chunks and very short ones
        chunks = [c.strip() for c in chunks if len(c.strip()) > 100]
        
        return chunks

# Test different chunking strategies
print("🔬 Testing Chunking Strategies\n")

sample_doc = documents['company_policies.txt']
chunker = TextChunker()

# Strategy 1: Fixed-size
fixed_chunks = chunker.fixed_size_chunking(sample_doc, chunk_size=500, overlap=50)
print(f"✂️ Fixed-size chunking: {len(fixed_chunks)} chunks")
print(f"   Average length: {np.mean([len(c) for c in fixed_chunks]):.0f} chars\n")

# Strategy 2: Sentence-based
sentence_chunks = chunker.sentence_based_chunking(sample_doc, sentences_per_chunk=5)
print(f"📝 Sentence-based chunking: {len(sentence_chunks)} chunks")
print(f"   Average length: {np.mean([len(c) for c in sentence_chunks]):.0f} chars\n")

# Strategy 3: Semantic
semantic_chunks = chunker.semantic_chunking(sample_doc)
print(f"🎯 Semantic chunking: {len(semantic_chunks)} chunks")
print(f"   Average length: {np.mean([len(c) for c in semantic_chunks]):.0f} chars\n")

# Preview a chunk from each strategy
print("\n📋 Sample chunks:\n")
print("Fixed-size chunk:")
print(fixed_chunks[0][:200] + "...\n")
print("\nSentence-based chunk:")
print(sentence_chunks[0][:200] + "...\n")
print("\nSemantic chunk:")
print(semantic_chunks[0][:200] + "...\n")

### 🎓 Chunking Strategy Comparison

| Strategy | Pros | Cons | Best For |
|----------|------|------|----------|
| **Fixed-size** | Simple, predictable size | May break sentences mid-way | Uniform processing, simple documents |
| **Sentence-based** | Preserves sentence integrity | Variable chunk sizes | Q&A systems, natural language |
| **Semantic** | Preserves topic coherence | Requires good structure | Structured docs, technical docs |

**Pro Tip**: Start with sentence-based chunking - it's a good balance!

<a id='embeddings'></a>
## 5. 🧠 Vector Embeddings

### What are Embeddings?

**Embeddings** convert text into numbers (vectors) that capture **meaning**:

```
"remote work policy" → [0.23, -0.45, 0.67, ...] (384 numbers)
"working from home"  → [0.21, -0.43, 0.69, ...] (similar numbers!)
```

**Key insight**: Similar meaning = similar vectors!

This enables **semantic search**: Find documents by meaning, not just keywords.

### Popular Embedding Models

- **sentence-transformers**: Fast, runs locally
- **OpenAI ada-002**: High quality, API-based
- **Cohere**: Great for multilingual

We'll use `sentence-transformers` - it's free and fast!

In [None]:
class EmbeddingGenerator:
    """Generate vector embeddings for text"""
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """Initialize embedding model
        
        Popular models:
        - all-MiniLM-L6-v2: Fast, 384 dimensions (default)
        - all-mpnet-base-v2: Higher quality, 768 dimensions
        - multi-qa-mpnet-base-dot-v1: Optimized for Q&A
        """
        print(f"Loading embedding model: {model_name}...")
        self.model = SentenceTransformer(model_name)
        print(f"✅ Model loaded! Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
    
    def embed_texts(self, texts: List[str]) -> np.ndarray:
        """Generate embeddings for a list of texts"""
        embeddings = self.model.encode(texts, show_progress_bar=True)
        return embeddings
    
    def embed_single(self, text: str) -> np.ndarray:
        """Generate embedding for a single text"""
        return self.model.encode([text])[0]

# Initialize embedding generator
embedding_gen = EmbeddingGenerator()

# Generate embeddings for our chunks
print("\n🔄 Generating embeddings for document chunks...")
chunks_to_embed = sentence_chunks[:10]  # Use first 10 chunks for demo
chunk_embeddings = embedding_gen.embed_texts(chunks_to_embed)

print(f"\n✅ Generated {len(chunk_embeddings)} embeddings")
print(f"   Shape: {chunk_embeddings.shape}")
print(f"   Each chunk is represented by {chunk_embeddings.shape[1]} numbers")

In [None]:
# Visualize embeddings in 2D using PCA
print("\n📊 Visualizing Embeddings\n")

# Reduce dimensions for visualization
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(chunk_embeddings)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], s=100, alpha=0.6)

# Add labels for each point
for i, (x, y) in enumerate(embeddings_2d):
    label = chunks_to_embed[i][:50] + "..."  # First 50 chars
    plt.annotate(f"Chunk {i}", (x, y), fontsize=9, alpha=0.7)

plt.title("Document Chunks in 2D Embedding Space\n(Similar chunks are closer together)", fontsize=14)
plt.xlabel("PCA Dimension 1")
plt.ylabel("PCA Dimension 2")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("💡 Interpretation: Chunks with similar content appear closer in this space!")

### 🧪 Experiment: Semantic Similarity

Let's test if embeddings truly capture meaning:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Test semantic similarity
test_queries = [
    "remote work from home",
    "salary and compensation",
    "vacation time off"
]

print("🔍 Testing Semantic Similarity\n")
print("We'll find which chunks are most similar to each query:\n")

for query in test_queries:
    # Embed query
    query_embedding = embedding_gen.embed_single(query)
    
    # Calculate similarity with all chunks
    similarities = cosine_similarity([query_embedding], chunk_embeddings)[0]
    
    # Get top match
    top_idx = np.argmax(similarities)
    top_score = similarities[top_idx]
    
    print(f"Query: '{query}'")
    print(f"Best match (score: {top_score:.3f}):")
    print(f"{chunks_to_embed[top_idx][:200]}...\n")
    print("-" * 80 + "\n")

<a id='vectordb'></a>
## 6. 🗄️ Vector Databases

### Why Vector Databases?

Once you have embeddings, you need **fast similarity search**:

- Traditional DB: "Find WHERE id = 123" ✅
- Vector DB: "Find most similar to [0.23, -0.45, ...]" ✅

**Challenge**: Comparing millions of high-dimensional vectors is slow!

**Solution**: Vector databases use clever algorithms (HNSW, IVF) for fast approximate search.

### Popular Vector Databases

1. **FAISS** (Facebook AI): Fastest, in-memory, local
2. **ChromaDB**: Easy to use, persistent storage
3. **Pinecone**: Managed cloud service
4. **Weaviate**: Full-featured, production-grade

We'll demo both FAISS and ChromaDB!

### Option 1: FAISS (Fast, In-Memory)

In [None]:
class FAISSVectorStore:
    """Vector store using FAISS for fast similarity search"""
    
    def __init__(self, dimension: int):
        """Initialize FAISS index
        
        Args:
            dimension: Embedding dimension (e.g., 384 for MiniLM)
        """
        # Create a flat (exact) index for small datasets
        self.index = faiss.IndexFlatL2(dimension)
        self.chunks = []
        self.metadata = []
    
    def add_documents(self, chunks: List[str], embeddings: np.ndarray, metadata: List[Dict] = None):
        """Add documents to the index"""
        # FAISS requires float32
        embeddings = embeddings.astype('float32')
        
        self.index.add(embeddings)
        self.chunks.extend(chunks)
        
        if metadata:
            self.metadata.extend(metadata)
        else:
            self.metadata.extend([{}] * len(chunks))
        
        print(f"✅ Added {len(chunks)} documents. Total: {self.index.ntotal}")
    
    def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Tuple[str, float, Dict]]:
        """Search for most similar documents
        
        Args:
            query_embedding: Query vector
            k: Number of results to return
            
        Returns:
            List of (chunk, distance, metadata) tuples
        """
        query_embedding = query_embedding.astype('float32').reshape(1, -1)
        
        # Search
        distances, indices = self.index.search(query_embedding, k)
        
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            results.append((
                self.chunks[idx],
                float(dist),
                self.metadata[idx]
            ))
        
        return results

# Create FAISS vector store
print("🏗️ Building FAISS Vector Store\n")

vector_store = FAISSVectorStore(dimension=384)

# Add our embedded chunks
metadata = [{'chunk_id': i, 'source': 'company_policies.txt'} for i in range(len(chunks_to_embed))]
vector_store.add_documents(chunks_to_embed, chunk_embeddings, metadata)

print(f"\n📊 Vector store stats:")
print(f"   Total documents: {vector_store.index.ntotal}")
print(f"   Dimension: {vector_store.index.d}")

In [None]:
# Test FAISS search
print("\n🔍 Testing FAISS Search\n")

query = "What is the remote work policy?"
query_embedding = embedding_gen.embed_single(query)

results = vector_store.search(query_embedding, k=3)

print(f"Query: '{query}'\n")
print("Top 3 Results:\n")

for i, (chunk, distance, meta) in enumerate(results, 1):
    print(f"{i}. [Distance: {distance:.4f}] Source: {meta['source']}")
    print(f"   {chunk[:200]}...\n")

### Option 2: ChromaDB (Persistent, Easy to Use)

In [None]:
class ChromaVectorStore:
    """Vector store using ChromaDB"""
    
    def __init__(self, collection_name: str = "documents", persist_directory: str = "./chroma_db"):
        """Initialize ChromaDB"""
        self.client = chromadb.Client(Settings(
            chroma_db_impl="duckdb+parquet",
            persist_directory=persist_directory
        ))
        
        # Create or get collection
        self.collection = self.client.get_or_create_collection(name=collection_name)
        print(f"✅ ChromaDB collection '{collection_name}' ready")
    
    def add_documents(self, chunks: List[str], embeddings: np.ndarray = None, metadata: List[Dict] = None):
        """Add documents to ChromaDB"""
        ids = [f"doc_{i}" for i in range(len(chunks))]
        
        if embeddings is not None:
            embeddings = embeddings.tolist()
        
        self.collection.add(
            documents=chunks,
            embeddings=embeddings,
            metadatas=metadata,
            ids=ids
        )
        
        print(f"✅ Added {len(chunks)} documents to ChromaDB")
    
    def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Tuple[str, float, Dict]]:
        """Search for similar documents"""
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=k
        )
        
        output = []
        for doc, dist, meta in zip(
            results['documents'][0],
            results['distances'][0],
            results['metadatas'][0]
        ):
            output.append((doc, dist, meta))
        
        return output

# Create ChromaDB vector store
print("\n🏗️ Building ChromaDB Vector Store\n")

chroma_store = ChromaVectorStore(collection_name="company_docs")
chroma_store.add_documents(chunks_to_embed, chunk_embeddings, metadata)

# Test ChromaDB search
print("\n🔍 Testing ChromaDB Search\n")

query = "How much PTO do employees get?"
query_embedding = embedding_gen.embed_single(query)

results = chroma_store.search(query_embedding, k=3)

print(f"Query: '{query}'\n")
print("Top 3 Results:\n")

for i, (chunk, distance, meta) in enumerate(results, 1):
    print(f"{i}. [Distance: {distance:.4f}]")
    print(f"   {chunk[:200]}...\n")

### 🎯 FAISS vs ChromaDB Comparison

| Feature | FAISS | ChromaDB |
|---------|-------|----------|
| **Speed** | ⚡ Fastest | 🚀 Fast |
| **Persistence** | In-memory (can save) | Built-in persistence |
| **Ease of use** | More code | Very simple |
| **Filtering** | Manual | Built-in metadata filters |
| **Best for** | Maximum speed, large scale | Quick prototypes, small-medium scale |

**Recommendation**: 
- Prototyping? Use **ChromaDB**
- Production with millions of docs? Use **FAISS** or **Pinecone**

<a id='search'></a>
## 7. 🔎 Semantic Search

Now let's build a complete semantic search system!

In [None]:
class SemanticSearchEngine:
    """Complete semantic search system"""
    
    def __init__(self, embedding_model: EmbeddingGenerator, vector_store):
        self.embedding_model = embedding_model
        self.vector_store = vector_store
    
    def index_documents(self, documents: Dict[str, str], chunking_strategy: str = 'sentence'):
        """Index all documents"""
        print("📚 Indexing documents...\n")
        
        all_chunks = []
        all_metadata = []
        chunker = TextChunker()
        
        for doc_name, content in documents.items():
            print(f"Processing: {doc_name}")
            
            # Chunk document
            if chunking_strategy == 'sentence':
                chunks = chunker.sentence_based_chunking(content)
            elif chunking_strategy == 'fixed':
                chunks = chunker.fixed_size_chunking(content)
            else:
                chunks = chunker.semantic_chunking(content)
            
            # Create metadata
            metadata = [{'source': doc_name, 'chunk_id': i} for i in range(len(chunks))]
            
            all_chunks.extend(chunks)
            all_metadata.extend(metadata)
        
        # Generate embeddings
        print(f"\n🔄 Generating embeddings for {len(all_chunks)} chunks...")
        embeddings = self.embedding_model.embed_texts(all_chunks)
        
        # Add to vector store
        self.vector_store.add_documents(all_chunks, embeddings, all_metadata)
        
        print(f"\n✅ Indexing complete! {len(all_chunks)} chunks indexed.")
    
    def search(self, query: str, k: int = 5, verbose: bool = True) -> List[Dict]:
        """Search for relevant documents"""
        # Embed query
        query_embedding = self.embedding_model.embed_single(query)
        
        # Search
        results = self.vector_store.search(query_embedding, k=k)
        
        # Format results
        formatted_results = []
        for chunk, score, metadata in results:
            formatted_results.append({
                'text': chunk,
                'score': score,
                'source': metadata.get('source', 'unknown'),
                'chunk_id': metadata.get('chunk_id', -1)
            })
        
        if verbose:
            self._display_results(query, formatted_results)
        
        return formatted_results
    
    def _display_results(self, query: str, results: List[Dict]):
        """Pretty print search results"""
        print(f"\n🔍 Query: '{query}'\n")
        print("=" * 80)
        
        for i, result in enumerate(results, 1):
            print(f"\n{i}. [{result['source']}] Score: {result['score']:.4f}")
            print("-" * 80)
            print(result['text'][:300] + "..." if len(result['text']) > 300 else result['text'])
            print()

# Build complete search engine
print("🚀 Building Semantic Search Engine\n")

# Create fresh vector store for all documents
search_vector_store = FAISSVectorStore(dimension=384)
search_engine = SemanticSearchEngine(embedding_gen, search_vector_store)

# Index all documents
search_engine.index_documents(documents, chunking_strategy='sentence')

In [None]:
# Test search with various queries
test_queries = [
    "What is the remote work policy?",
    "How do I install the software on Mac?",
    "What were the Q4 sales numbers?",
    "Tell me about parental leave benefits",
    "How do I fix sync issues?"
]

print("\n" + "=" * 80)
print("🧪 SEMANTIC SEARCH DEMO")
print("=" * 80)

for query in test_queries:
    results = search_engine.search(query, k=2)
    print("\n" + "=" * 80 + "\n")

<a id='pipeline'></a>
## 8. 🏗️ Complete RAG Pipeline

Now let's build the full RAG system: **Retrieval + Generation**

We'll integrate with an LLM to generate answers based on retrieved context.

In [None]:
class RAGSystem:
    """Complete Retrieval Augmented Generation system"""
    
    def __init__(self, search_engine: SemanticSearchEngine, llm_provider: str = 'mock'):
        """
        Args:
            search_engine: Semantic search engine for retrieval
            llm_provider: 'openai', 'anthropic', or 'mock' for demo
        """
        self.search_engine = search_engine
        self.llm_provider = llm_provider
    
    def query(self, question: str, k: int = 3, verbose: bool = True) -> Dict:
        """Process a question through the RAG pipeline"""
        
        # Step 1: Retrieve relevant documents
        if verbose:
            print(f"\n{'='*80}")
            print(f"📥 QUESTION: {question}")
            print(f"{'='*80}\n")
            print("🔍 Step 1: Retrieving relevant documents...\n")
        
        relevant_docs = self.search_engine.search(question, k=k, verbose=False)
        
        if verbose:
            print(f"✅ Found {len(relevant_docs)} relevant chunks\n")
            for i, doc in enumerate(relevant_docs, 1):
                print(f"{i}. [{doc['source']}] (score: {doc['score']:.4f})")
            print()
        
        # Step 2: Build context
        context = self._build_context(relevant_docs)
        
        if verbose:
            print("📝 Step 2: Building context for LLM...")
            print(f"   Context length: {len(context)} characters\n")
        
        # Step 3: Generate answer
        if verbose:
            print("🤖 Step 3: Generating answer...\n")
        
        answer = self._generate_answer(question, context)
        
        # Display results
        if verbose:
            print(f"{'='*80}")
            print("💡 ANSWER:")
            print(f"{'='*80}\n")
            print(answer)
            print(f"\n{'='*80}")
            print("📚 SOURCES:")
            print(f"{'='*80}\n")
            sources = list(set([doc['source'] for doc in relevant_docs]))
            for source in sources:
                print(f"  - {source}")
            print()
        
        return {
            'question': question,
            'answer': answer,
            'sources': relevant_docs,
            'context': context
        }
    
    def _build_context(self, docs: List[Dict]) -> str:
        """Build context string from retrieved documents"""
        context_parts = []
        for i, doc in enumerate(docs, 1):
            context_parts.append(f"[Source {i}: {doc['source']}]\n{doc['text']}")
        return "\n\n".join(context_parts)
    
    def _generate_answer(self, question: str, context: str) -> str:
        """Generate answer using LLM"""
        
        prompt = f"""Answer the question based on the context below. If the answer cannot be found in the context, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""
        
        if self.llm_provider == 'mock':
            # Mock response for demo (in real system, call OpenAI/Anthropic)
            return self._mock_llm_response(question, context)
        
        # Real LLM integration would go here:
        # elif self.llm_provider == 'openai':
        #     return self._call_openai(prompt)
        # elif self.llm_provider == 'anthropic':
        #     return self._call_anthropic(prompt)
    
    def _mock_llm_response(self, question: str, context: str) -> str:
        """Generate a mock response (for demo purposes)"""
        # In a real system, this would call an actual LLM
        # For now, we'll extract relevant parts of the context
        
        # Simple extraction: return first 500 chars of most relevant chunk
        first_chunk = context.split("\n\n")[0]
        
        return f"""Based on the company documentation:

{first_chunk[:400]}...

[NOTE: This is a mock response. In production, connect to OpenAI/Anthropic for actual AI-generated answers.]"""

# Create RAG system
print("\n🏗️ Building Complete RAG System\n")
rag_system = RAGSystem(search_engine, llm_provider='mock')
print("✅ RAG System ready!\n")

In [None]:
# Test RAG system with sample questions
sample_questions = [
    "What is the company's remote work policy?",
    "How do I troubleshoot sync issues with CloudSync?",
    "What were the top performing regions in Q4?"
]

for question in sample_questions:
    result = rag_system.query(question, k=2)
    print("\n" + "*" * 80 + "\n")

### 🔌 Connecting to Real LLMs

To use real AI for answer generation, uncomment and configure:

#### OpenAI Integration:
```python
import openai

def _call_openai(self, prompt):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )
    return response.choices[0].message.content
```

#### Anthropic (Claude) Integration:
```python
import anthropic

def _call_anthropic(self, prompt):
    client = anthropic.Anthropic(api_key=os.environ['ANTHROPIC_API_KEY'])
    response = client.messages.create(
        model="claude-3-opus-20240229",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1024
    )
    return response.content[0].text
```

<a id='advanced'></a>
## 9. 🚀 Advanced Topics

### Hybrid Search (Keyword + Semantic)

Combine traditional keyword search with semantic search for best results:

In [None]:
class HybridSearch:
    """Combine keyword and semantic search"""
    
    def __init__(self, search_engine: SemanticSearchEngine):
        self.search_engine = search_engine
        self.chunks = []
    
    def keyword_search(self, query: str, chunks: List[str], k: int = 5) -> List[Tuple[str, float]]:
        """Simple keyword-based search using TF-IDF"""
        from sklearn.feature_extraction.text import TfidfVectorizer
        from sklearn.metrics.pairwise import cosine_similarity
        
        # Create TF-IDF vectors
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform(chunks + [query])
        
        # Calculate similarity
        query_vector = tfidf_matrix[-1]
        doc_vectors = tfidf_matrix[:-1]
        similarities = cosine_similarity(query_vector, doc_vectors)[0]
        
        # Get top k
        top_indices = np.argsort(similarities)[::-1][:k]
        
        return [(chunks[i], similarities[i]) for i in top_indices]
    
    def hybrid_search(self, query: str, chunks: List[str], k: int = 5, 
                     semantic_weight: float = 0.7) -> List[str]:
        """Combine keyword and semantic search
        
        Args:
            semantic_weight: Weight for semantic search (0-1)
                            keyword_weight = 1 - semantic_weight
        """
        # Get semantic results
        semantic_results = self.search_engine.search(query, k=k*2, verbose=False)
        
        # Get keyword results  
        keyword_results = self.keyword_search(query, chunks, k=k*2)
        
        # Combine scores (normalize first)
        combined_scores = {}
        
        # Add semantic scores
        for result in semantic_results:
            text = result['text']
            # Convert distance to similarity (inverse)
            score = 1 / (1 + result['score'])
            combined_scores[text] = score * semantic_weight
        
        # Add keyword scores
        keyword_weight = 1 - semantic_weight
        for text, score in keyword_results:
            if text in combined_scores:
                combined_scores[text] += score * keyword_weight
            else:
                combined_scores[text] = score * keyword_weight
        
        # Sort by combined score
        sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        
        return [text for text, score in sorted_results[:k]]

print("🔬 Hybrid Search Demo\n")
print("Combining keyword matching with semantic understanding...\n")

hybrid = HybridSearch(search_engine)

# Test query
query = "PTO vacation days"
all_chunks = [doc['text'] for doc in search_engine.search("*", k=100, verbose=False)]

results = hybrid.hybrid_search(query, all_chunks[:50], k=3, semantic_weight=0.7)

print(f"Query: '{query}'\n")
print("Top 3 Hybrid Results:\n")
for i, result in enumerate(results, 1):
    print(f"{i}. {result[:200]}...\n")

### Re-ranking for Better Results

In [None]:
class Reranker:
    """Re-rank search results for better relevance"""
    
    def __init__(self, model_name: str = 'cross-encoder/ms-marco-MiniLM-L-6-v2'):
        """Initialize cross-encoder for re-ranking"""
        from sentence_transformers import CrossEncoder
        print(f"Loading re-ranking model: {model_name}...")
        self.model = CrossEncoder(model_name)
        print("✅ Re-ranker ready!")
    
    def rerank(self, query: str, documents: List[str], top_k: int = 5) -> List[Tuple[str, float]]:
        """Re-rank documents by relevance to query"""
        # Create query-document pairs
        pairs = [[query, doc] for doc in documents]
        
        # Score all pairs
        scores = self.model.predict(pairs)
        
        # Sort by score
        scored_docs = list(zip(documents, scores))
        scored_docs.sort(key=lambda x: x[1], reverse=True)
        
        return scored_docs[:top_k]

# Demo re-ranking
print("\n🎯 Re-ranking Demo\n")
print("Re-ranking improves precision by using more sophisticated relevance scoring...\n")

reranker = Reranker()

query = "What are the system requirements?"
candidates = [doc['text'] for doc in search_engine.search(query, k=10, verbose=False)]

reranked = reranker.rerank(query, candidates, top_k=3)

print(f"\nQuery: '{query}'\n")
print("Top 3 Re-ranked Results:\n")
for i, (doc, score) in enumerate(reranked, 1):
    print(f"{i}. [Score: {score:.4f}]")
    print(f"   {doc[:200]}...\n")

<a id='exercises'></a>
## 10. 🎓 Exercises

Try these challenges to deepen your understanding:

### Exercise 1: Custom Chunking Strategy
Create a chunking strategy that splits on section headers (e.g., "SECTION 1:", "###")

### Exercise 2: Add Metadata Filtering
Extend the search to filter by document source (e.g., only search in "product_documentation.txt")

### Exercise 3: Evaluation Metrics
Implement precision@k and recall@k to measure search quality

### Exercise 4: Multi-query RAG
Generate multiple variations of the user's question and retrieve docs for each

### Exercise 5: Add Your Own Documents
Add your own text files to the `data/sample_documents` folder and index them!

In [None]:
# Exercise 1: Custom Chunking Strategy
# YOUR CODE HERE

def custom_chunking(text: str) -> List[str]:
    """Split text on section headers"""
    # Hint: Use regex to find patterns like "SECTION 1:", "###", etc.
    pass

# Test your chunking strategy
# chunks = custom_chunking(documents['company_policies.txt'])
# print(f"Created {len(chunks)} chunks")

In [None]:
# Exercise 2: Metadata Filtering
# YOUR CODE HERE

def search_with_filter(search_engine, query: str, source_filter: str = None):
    """Search with optional source filtering"""
    # Hint: Retrieve more results than needed, then filter by metadata
    pass

# Test
# results = search_with_filter(search_engine, "sales", source_filter="sales_reports.txt")

## 🎉 Congratulations!

You've built a complete RAG system from scratch! You now understand:

✅ What RAG is and why it's important

✅ Different document chunking strategies

✅ How vector embeddings capture semantic meaning

✅ Vector databases for fast similarity search

✅ Building a complete RAG pipeline

✅ Advanced techniques (hybrid search, re-ranking)

### 📚 Next Steps

1. **Connect to real LLMs**: Integrate OpenAI or Anthropic APIs
2. **Scale up**: Try with larger document collections
3. **Production deployment**: Use managed services like Pinecone or Weaviate
4. **Add chat history**: Build a conversational RAG system
5. **Fine-tune**: Experiment with different embedding models

### 🔗 Resources

- [LangChain RAG Tutorial](https://python.langchain.com/docs/use_cases/question_answering/)
- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)
- [FAISS Documentation](https://github.com/facebookresearch/faiss/wiki)
- [Sentence Transformers](https://www.sbert.net/)

Happy building! 🚀