# Module 9: RAG Foundations - Vector Databases & Retrieval

**Retrieval-Augmented Generation (RAG)** is one of the most impactful patterns in modern AI engineering. Instead of relying solely on what an LLM memorized during training, RAG allows us to **ground** model responses in real, up-to-date, domain-specific information.

## Why RAG?

Large Language Models have three fundamental limitations:

1. **Knowledge cutoff** - LLMs only know what was in their training data. They cannot answer questions about events or documents created after their training date.
2. **Hallucination** - When an LLM does not know an answer, it may confidently generate plausible-sounding but incorrect information.
3. **No domain-specific knowledge** - General-purpose LLMs lack deep expertise in your company's proprietary data, internal documents, or niche domains.

RAG solves all three problems by **retrieving** relevant documents and **providing them as context** to the LLM before generation.

## RAG Architecture Overview

```
                         RAG Pipeline
                         ============

  User Query
      |
      v
 +-----------+     +-----------+     +----------------+     +-----------+
 |           |     |  Vector   |     |                |     |           |
 |   Embed   |---->|  Database |---->|   Augment      |---->|  Generate |
 |   Query   |     |  Search   |     |   Prompt with  |     |  Answer   |
 |           |     |  (Top-K)  |     |   Retrieved    |     |  (LLM)    |
 +-----------+     +-----------+     |   Documents    |     +-----------+
                         ^           +----------------+           |
                         |                                        v
                   +------------+                           Final Response
                   | Document   |                           (Grounded in
                   | Embeddings |                            real data)
                   | (offline)  |
                   +------------+
```

### What we will cover in this notebook:
1. Why RAG matters (with a live demo)
2. Document loading and chunking strategies
3. Embedding documents for semantic search
4. Vector databases with ChromaDB
5. Distance metrics explained
6. Metadata filtering
7. Landscape of vector databases

---
## 1. Setup

First, let's install the required packages and load our environment.

In [None]:
!pip install -q chromadb openai cohere python-dotenv datasets sentence-transformers matplotlib numpy

In [None]:
from dotenv import load_dotenv
import os
load_dotenv("/home/amir/source/.env")

# Core libraries
import json
import numpy as np
import matplotlib.pyplot as plt

# Embeddings
from sentence_transformers import SentenceTransformer

# Vector database
import chromadb

print("All imports successful!")

---
## 2. Why RAG? Seeing the Problem First-Hand

Let's demonstrate the core problem RAG solves. We will:
1. Ask an LLM about information it **cannot** know (fictional company data)
2. Show how providing context fixes the answer

Even without calling an LLM API, we can illustrate this principle clearly.

In [None]:
# Simulating the RAG problem
# Imagine asking an LLM: "What were NovaTech Solutions' Q3 2025 revenue figures?"
# The LLM would either hallucinate numbers or say it doesn't know.

# Without context (what a naive LLM might do):
query = "What were NovaTech Solutions' Q3 2025 revenue figures?"
naive_response = (
    "I don't have specific information about NovaTech Solutions' Q3 2025 "
    "revenue figures. My training data may not include this information. "
    "Please check their latest earnings report or SEC filings."
)
print("QUERY:", query)
print("\nNAIVE LLM RESPONSE (no RAG):")
print(naive_response)

# Now, with context provided (RAG approach):
context_document = """
NovaTech Solutions Q3 2025 Earnings Report:
- Total Revenue: $4.2 billion (up 23% YoY)
- Cloud Services Revenue: $2.8 billion (up 31% YoY)
- Enterprise Software Revenue: $1.1 billion (up 12% YoY)
- Professional Services Revenue: $0.3 billion (down 5% YoY)
- Net Income: $680 million
- Adjusted EBITDA: $1.1 billion
"""

augmented_response = (
    "Based on NovaTech Solutions' Q3 2025 earnings report, their total revenue "
    "was $4.2 billion, representing a 23% year-over-year increase. Cloud Services "
    "led the growth at $2.8 billion (up 31% YoY), followed by Enterprise Software "
    "at $1.1 billion (up 12% YoY). Net income was $680 million."
)

print("\n" + "="*60)
print("\nRAG-AUGMENTED RESPONSE (with retrieved context):")
print(augmented_response)
print("\n--- Retrieved Context ---")
print(context_document)

**Key insight:** The LLM's response quality is dramatically improved when we provide relevant context. The entire RAG pipeline exists to **automatically find and provide that context** for any given query.

The rest of this notebook focuses on building that retrieval mechanism step by step.

---
## 3. Document Loading

The first step in any RAG pipeline is loading the source documents. In production you would load from PDFs, web pages, databases, APIs, etc. Here we use a curated set of sample documents about AI/ML topics for portability.

In [None]:
# Sample knowledge base - 15 paragraphs about AI/ML topics
# In production, these would come from files, databases, APIs, etc.

documents = [
    {
        "id": "doc_01",
        "text": "Transformer architectures have revolutionized natural language processing since their introduction in the 2017 paper 'Attention Is All You Need' by Vaswani et al. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input when producing each part of the output. Unlike recurrent neural networks (RNNs) that process tokens sequentially, transformers can process all tokens in parallel, making them significantly faster to train on modern GPU hardware. The original transformer uses an encoder-decoder structure, but subsequent models have used encoder-only (BERT) or decoder-only (GPT) variants.",
        "category": "architectures",
        "source": "textbook",
        "year": 2023
    },
    {
        "id": "doc_02",
        "text": "Retrieval-Augmented Generation (RAG) combines the strengths of retrieval-based and generative AI systems. In a RAG pipeline, a user query is first used to search a knowledge base for relevant documents. These documents are then provided as context to a large language model, which generates a response grounded in the retrieved information. RAG significantly reduces hallucination compared to pure generative approaches, because the model can reference actual source documents rather than relying solely on parametric knowledge stored during training.",
        "category": "rag",
        "source": "research_paper",
        "year": 2024
    },
    {
        "id": "doc_03",
        "text": "Vector databases are purpose-built storage systems optimized for storing, indexing, and querying high-dimensional vector embeddings. Unlike traditional databases that search by exact matches or keyword overlap, vector databases find items by semantic similarity. Popular vector databases include Pinecone (cloud-native), ChromaDB (lightweight, open-source), FAISS (Facebook's library for efficient similarity search), Weaviate (open-source with hybrid search), and Qdrant (Rust-based, high performance). The choice of vector database depends on scale, deployment requirements, and feature needs.",
        "category": "infrastructure",
        "source": "blog_post",
        "year": 2024
    },
    {
        "id": "doc_04",
        "text": "Embedding models convert text into dense numerical vectors that capture semantic meaning. Words or sentences with similar meanings will have vectors that are close together in the embedding space. Modern embedding models like OpenAI's text-embedding-ada-002, Cohere's embed-v3, and open-source alternatives like all-MiniLM-L6-v2 from Sentence Transformers can produce high-quality embeddings for similarity search. The dimensionality of embeddings varies: MiniLM produces 384-dimensional vectors, while OpenAI's ada-002 produces 1536-dimensional vectors.",
        "category": "embeddings",
        "source": "textbook",
        "year": 2024
    },
    {
        "id": "doc_05",
        "text": "Fine-tuning is the process of taking a pre-trained model and further training it on a smaller, task-specific dataset. This allows the model to adapt its general knowledge to perform well on specific domains or tasks. Common fine-tuning approaches include full fine-tuning (updating all parameters), LoRA (Low-Rank Adaptation, which adds small trainable matrices), and QLoRA (quantized LoRA, which reduces memory requirements). Fine-tuning a 7B parameter model with LoRA can be done on a single consumer GPU with 16GB of VRAM, making it accessible to individual practitioners.",
        "category": "training",
        "source": "tutorial",
        "year": 2024
    },
    {
        "id": "doc_06",
        "text": "Chunking is a critical preprocessing step in RAG pipelines. Long documents must be split into smaller pieces (chunks) before embedding, because embedding models have token limits and because smaller chunks lead to more precise retrieval. Common chunking strategies include: fixed-size chunking (splitting every N characters with optional overlap), sentence-based chunking (splitting on sentence boundaries), and recursive chunking (trying larger separators first, then falling back to smaller ones). The optimal chunk size depends on the use case, but 256-512 tokens is a common starting point.",
        "category": "rag",
        "source": "tutorial",
        "year": 2024
    },
    {
        "id": "doc_07",
        "text": "Large Language Models (LLMs) are neural networks trained on massive text corpora to predict the next token in a sequence. Models like GPT-4, Claude, Llama, and Gemini have demonstrated remarkable capabilities in understanding and generating human language. These models typically have billions of parameters and are trained on trillions of tokens of text data. The training process involves two main phases: pre-training on general text data (unsupervised), and alignment through techniques like RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization).",
        "category": "architectures",
        "source": "textbook",
        "year": 2024
    },
    {
        "id": "doc_08",
        "text": "Cosine similarity is the most commonly used distance metric for comparing text embeddings. It measures the cosine of the angle between two vectors, ranging from -1 (opposite) to 1 (identical direction). Cosine similarity is preferred over Euclidean distance for text because it is invariant to vector magnitude, focusing purely on direction. This means two documents about the same topic will have high cosine similarity regardless of their length. Other useful metrics include dot product (which factors in magnitude) and Euclidean distance (L2 norm of the difference vector).",
        "category": "embeddings",
        "source": "textbook",
        "year": 2023
    },
    {
        "id": "doc_09",
        "text": "Prompt engineering is the practice of crafting effective prompts to guide LLM behavior. Key techniques include: zero-shot prompting (no examples), few-shot prompting (providing examples in the prompt), chain-of-thought (asking the model to reason step by step), and system prompts (setting the model's role and behavior guidelines). In RAG systems, prompt engineering is crucial for the generation step. The retrieved documents must be formatted clearly in the prompt, and the model should be instructed to base its answer only on the provided context to minimize hallucination.",
        "category": "techniques",
        "source": "tutorial",
        "year": 2024
    },
    {
        "id": "doc_10",
        "text": "Attention mechanisms allow neural networks to focus on the most relevant parts of the input when generating each part of the output. In the transformer architecture, multi-head attention splits the input into multiple parallel attention operations (heads), each learning to attend to different types of relationships. Self-attention computes attention scores between all pairs of positions in a sequence, creating a weighted representation where each token incorporates information from all other tokens. The computational cost of self-attention is O(n^2) with respect to sequence length, which has motivated research into efficient attention variants like sparse attention and linear attention.",
        "category": "architectures",
        "source": "research_paper",
        "year": 2023
    },
    {
        "id": "doc_11",
        "text": "Evaluation of RAG systems requires measuring both retrieval quality and generation quality. Retrieval metrics include precision@k (fraction of top-k results that are relevant), recall@k (fraction of relevant documents found in top-k), and Mean Reciprocal Rank (MRR). Generation metrics include faithfulness (does the answer stick to retrieved context?), relevance (does the answer address the question?), and completeness (does the answer cover all aspects?). Automated evaluation frameworks like RAGAS provide standardized metrics for RAG evaluation.",
        "category": "rag",
        "source": "research_paper",
        "year": 2024
    },
    {
        "id": "doc_12",
        "text": "Transfer learning is the technique of leveraging knowledge gained from one task to improve performance on a different but related task. In NLP, this typically involves using a model pre-trained on a large general corpus and then adapting it to a specific downstream task. BERT popularized this approach by showing that a single pre-trained model could be fine-tuned to achieve state-of-the-art results on a wide range of NLP benchmarks. The success of transfer learning depends on the similarity between the pre-training domain and the target domain.",
        "category": "training",
        "source": "textbook",
        "year": 2023
    },
    {
        "id": "doc_13",
        "text": "Semantic search goes beyond keyword matching to understand the intent and meaning behind a query. Traditional search engines rely on BM25 and TF-IDF, which match documents based on word overlap. Semantic search uses embedding models to represent both queries and documents as vectors, then finds the nearest neighbors in vector space. This means a search for 'automobile' will also find documents about 'cars' and 'vehicles', even if those exact words do not appear in the query. Hybrid search combines both approaches: semantic similarity for understanding meaning and keyword matching for precision on specific terms.",
        "category": "rag",
        "source": "blog_post",
        "year": 2024
    },
    {
        "id": "doc_14",
        "text": "Quantization reduces the memory footprint and computational cost of neural networks by representing weights with lower precision numbers. Common quantization levels include FP16 (16-bit floating point), INT8 (8-bit integer), and INT4 (4-bit integer). A 70B parameter model requires about 140GB in FP16 but only 35GB in INT4, making it possible to run on consumer hardware. Techniques like GPTQ and AWQ provide post-training quantization with minimal quality loss. The GGUF format, used by llama.cpp, has become a popular standard for distributing quantized models.",
        "category": "infrastructure",
        "source": "tutorial",
        "year": 2024
    },
    {
        "id": "doc_15",
        "text": "AI agents are systems that use LLMs as reasoning engines to autonomously plan and execute multi-step tasks. An agent typically has access to tools (APIs, databases, code execution) and uses a loop of observation, reasoning, and action. Popular agent frameworks include LangChain's AgentExecutor, AutoGPT, and CrewAI. The ReAct (Reasoning + Acting) pattern is a common approach where the agent alternates between thinking about what to do and taking actions. Key challenges in agent design include error recovery, cost control, and preventing infinite loops.",
        "category": "techniques",
        "source": "blog_post",
        "year": 2024
    }
]

print(f"Loaded {len(documents)} documents")
print(f"\nCategories: {sorted(set(d['category'] for d in documents))}")
print(f"Sources: {sorted(set(d['source'] for d in documents))}")
print(f"\nSample document (first 150 chars):")
print(f"  [{documents[0]['id']}] {documents[0]['text'][:150]}...")

**Note on real-world document loading:** In production, you would use document loaders to ingest content from various sources:

- **PDFs**: `PyPDFLoader`, `UnstructuredPDFLoader` (via LangChain or custom)
- **Web pages**: `BeautifulSoup`, `Trafilatura`, `WebBaseLoader`
- **Databases**: Direct SQL queries or ORM exports
- **APIs**: REST/GraphQL endpoints returning text data
- **Office docs**: `python-docx`, `openpyxl`

LangChain provides over 100 document loaders: https://python.langchain.com/docs/integrations/document_loaders/

---
## 4. Chunking Strategies

Before we can embed documents, we often need to split them into smaller **chunks**. Why?

1. **Embedding models have token limits** (typically 256-512 tokens)
2. **Smaller chunks = more precise retrieval** (a full 10-page document would dilute the relevant information)
3. **LLM context windows are limited** (we want to pack the most relevant information)

Let's implement three common chunking strategies from scratch.

In [None]:
# A longer sample document for demonstrating chunking
long_document = """Artificial intelligence has a long and fascinating history that spans several decades of research and development. The field was formally founded at the Dartmouth Conference in 1956, where researchers like John McCarthy, Marvin Minsky, Allen Newell, and Herbert Simon gathered to discuss the possibility of creating intelligent machines.

In the early years, AI research focused on symbolic approaches, where knowledge was represented using rules and logic. Expert systems became popular in the 1980s, encoding domain knowledge from human experts into rule-based systems. These systems found commercial success in medical diagnosis, financial analysis, and manufacturing.

The field experienced several periods of reduced funding and interest, known as AI winters. The first AI winter occurred in the 1970s when early promises of AI went unfulfilled. The second AI winter happened in the late 1980s and early 1990s when expert systems proved too expensive to maintain and too brittle to handle real-world complexity.

The modern era of AI began with the resurgence of neural networks, particularly deep learning. The breakthrough came in 2012 when AlexNet, a deep convolutional neural network, won the ImageNet competition by a large margin. This demonstrated that deep neural networks, trained on large datasets with GPU acceleration, could achieve superhuman performance on visual recognition tasks.

Natural language processing saw its own revolution with the introduction of the transformer architecture in 2017. The paper 'Attention Is All You Need' by Vaswani et al. introduced the self-attention mechanism, which allowed models to process entire sequences in parallel rather than sequentially. This led to a series of increasingly powerful language models.

BERT, introduced by Google in 2018, demonstrated the power of pre-training on large text corpora followed by fine-tuning on specific tasks. GPT-2 and GPT-3 from OpenAI showed that scaling up language models led to emergent capabilities, including few-shot learning where the model could perform tasks with just a few examples in the prompt.

The release of ChatGPT in November 2022 brought AI into the mainstream consciousness. Built on GPT-3.5 and later GPT-4, ChatGPT demonstrated that large language models could engage in natural conversations, write code, analyze data, and assist with a wide range of tasks. This triggered an explosion of interest and investment in AI.

Today, AI research is advancing rapidly across multiple fronts. Multi-modal models can process text, images, audio, and video. Retrieval-augmented generation grounds LLM responses in factual data. AI agents can autonomously plan and execute complex tasks using tools. The field continues to evolve at an unprecedented pace, raising both excitement about possibilities and important questions about safety and ethics."""

print(f"Document length: {len(long_document)} characters")
print(f"Approximate words: {len(long_document.split())}")

### Strategy 1: Fixed-Size Chunking

The simplest approach: split every N characters, with optional overlap to preserve context across chunk boundaries.

In [None]:
def fixed_size_chunk(text, chunk_size=300, overlap=50):
    """
    Split text into fixed-size chunks with overlap.
    
    Args:
        text: The input text to chunk
        chunk_size: Maximum characters per chunk
        overlap: Number of overlapping characters between consecutive chunks
    
    Returns:
        List of text chunks
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # step back by overlap amount
    return chunks

fixed_chunks = fixed_size_chunk(long_document, chunk_size=300, overlap=50)
print(f"Fixed-size chunking: {len(fixed_chunks)} chunks")
for i, chunk in enumerate(fixed_chunks[:3]):
    print(f"\n--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

Notice how fixed-size chunks can cut right through the middle of a sentence or even a word. This is the main downside of this approach.

### Strategy 2: Sentence-Based Chunking

Split on sentence boundaries to keep complete thoughts together.

In [None]:
import re

def sentence_chunk(text, max_sentences=3):
    """
    Split text into chunks of N sentences each.
    
    Args:
        text: The input text to chunk
        max_sentences: Maximum sentences per chunk
    
    Returns:
        List of text chunks
    """
    # Split on sentence-ending punctuation followed by a space
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = ' '.join(sentences[i:i + max_sentences])
        if chunk.strip():
            chunks.append(chunk.strip())
    return chunks

sentence_chunks = sentence_chunk(long_document, max_sentences=3)
print(f"Sentence-based chunking: {len(sentence_chunks)} chunks")
for i, chunk in enumerate(sentence_chunks[:3]):
    print(f"\n--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

### Strategy 3: Recursive Character Chunking

This strategy tries to split on natural boundaries in order of preference: paragraphs first, then sentences, then words. This is the approach used by LangChain's `RecursiveCharacterTextSplitter`.

In [None]:
def recursive_chunk(text, chunk_size=500, separators=None):
    """
    Recursively split text using a hierarchy of separators.
    Tries the most meaningful separator first (paragraph breaks),
    then falls back to smaller separators.
    
    Args:
        text: The input text to chunk
        chunk_size: Maximum characters per chunk
        separators: Ordered list of separators to try
    
    Returns:
        List of text chunks
    """
    if separators is None:
        separators = ["\n\n", "\n", ". ", " "]
    
    # Base case: text is small enough
    if len(text) <= chunk_size:
        return [text] if text.strip() else []
    
    # Try each separator in order
    for sep in separators:
        if sep in text:
            parts = text.split(sep)
            chunks = []
            current_chunk = ""
            
            for part in parts:
                # If adding this part would exceed chunk_size
                test_chunk = current_chunk + sep + part if current_chunk else part
                
                if len(test_chunk) <= chunk_size:
                    current_chunk = test_chunk
                else:
                    # Save current chunk and start new one
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                    
                    # If the part itself is too large, recurse with next separator
                    if len(part) > chunk_size:
                        remaining_seps = separators[separators.index(sep) + 1:]
                        if remaining_seps:
                            sub_chunks = recursive_chunk(part, chunk_size, remaining_seps)
                            chunks.extend(sub_chunks)
                            current_chunk = ""
                        else:
                            # Last resort: hard split
                            chunks.append(part[:chunk_size].strip())
                            current_chunk = part[chunk_size:]
                    else:
                        current_chunk = part
            
            if current_chunk.strip():
                chunks.append(current_chunk.strip())
            
            return chunks
    
    # Fallback: hard split
    return [text[i:i+chunk_size].strip() for i in range(0, len(text), chunk_size) if text[i:i+chunk_size].strip()]

recursive_chunks = recursive_chunk(long_document, chunk_size=500)
print(f"Recursive chunking: {len(recursive_chunks)} chunks")
for i, chunk in enumerate(recursive_chunks[:3]):
    print(f"\n--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

### Visualizing Chunk Size Distributions

Let's compare the three strategies by looking at their chunk size distributions.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

strategies = {
    "Fixed-Size (300 chars, 50 overlap)": fixed_chunks,
    "Sentence-Based (3 per chunk)": sentence_chunks,
    "Recursive (500 char limit)": recursive_chunks
}

colors = ["#2196F3", "#4CAF50", "#FF9800"]

for ax, (name, chunks), color in zip(axes, strategies.items(), colors):
    sizes = [len(c) for c in chunks]
    ax.bar(range(len(sizes)), sizes, color=color, alpha=0.7)
    ax.set_title(name, fontsize=10)
    ax.set_xlabel("Chunk Index")
    ax.set_ylabel("Characters")
    ax.axhline(y=np.mean(sizes), color='red', linestyle='--', alpha=0.7, label=f"Mean: {np.mean(sizes):.0f}")
    ax.legend(fontsize=8)

plt.suptitle("Chunk Size Distribution by Strategy", fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

# Summary statistics
print(f"{'Strategy':<40} {'Count':>6} {'Mean':>8} {'Std':>8} {'Min':>6} {'Max':>6}")
print("-" * 76)
for name, chunks in strategies.items():
    sizes = [len(c) for c in chunks]
    print(f"{name:<40} {len(sizes):>6} {np.mean(sizes):>8.1f} {np.std(sizes):>8.1f} {min(sizes):>6} {max(sizes):>6}")

### Exercise 1: Implement and Compare Chunking Strategies

Implement three chunking functions and compare their behavior on the same text.

In [None]:
# Exercise 1: Implement the three chunking strategies

exercise_text = """Machine learning is a subset of artificial intelligence that focuses on building systems that learn from data. Instead of being explicitly programmed with rules, these systems identify patterns in data and make decisions with minimal human intervention.

Supervised learning is the most common type of machine learning. In supervised learning, the model is trained on labeled data, where each input comes with the correct output. The model learns to map inputs to outputs and can then make predictions on new, unseen data. Common algorithms include linear regression, decision trees, and neural networks.

Unsupervised learning works with unlabeled data. The model tries to find hidden patterns or structures in the data without being told what to look for. Clustering algorithms like K-means group similar data points together. Dimensionality reduction techniques like PCA and t-SNE help visualize high-dimensional data.

Reinforcement learning is inspired by behavioral psychology. An agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize cumulative reward over time. This approach has achieved remarkable results in game playing, robotics, and resource optimization."""

# TODO: Implement fixed_size_chunk_ex that splits by character count with overlap
def fixed_size_chunk_ex(text, size=200, overlap=30):
    result = None
    return result

# TODO: Implement sentence_chunk_ex that splits on sentence boundaries
def sentence_chunk_ex(text, sentences_per_chunk=2):
    result = None
    return result

# TODO: Implement recursive_chunk_ex that tries \n\n, then \n, then '. ', then ' '
def recursive_chunk_ex(text, size=300):
    result = None
    return result

# Compare results
# for name, func in [("Fixed", fixed_size_chunk_ex), ("Sentence", sentence_chunk_ex), ("Recursive", recursive_chunk_ex)]:
#     chunks = func(exercise_text)
#     print(f"\n{name}: {len(chunks)} chunks")
#     for i, c in enumerate(chunks):
#         print(f"  Chunk {i} ({len(c)} chars): {c[:80]}...")

### Solution

In [None]:
# Solution: Exercise 1

def fixed_size_chunk_ex(text, size=200, overlap=30):
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

def sentence_chunk_ex(text, sentences_per_chunk=2):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ' '.join(sentences[i:i + sentences_per_chunk]).strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def recursive_chunk_ex(text, size=300):
    return recursive_chunk(text, chunk_size=size, separators=["\n\n", "\n", ". ", " "])

# Compare results
for name, func in [("Fixed", fixed_size_chunk_ex), ("Sentence", sentence_chunk_ex), ("Recursive", recursive_chunk_ex)]:
    chunks = func(exercise_text)
    print(f"\n{name}: {len(chunks)} chunks")
    for i, c in enumerate(chunks):
        preview = c[:80] + "..." if len(c) > 80 else c
        print(f"  Chunk {i} ({len(c)} chars): {preview}")

---
## 5. Embedding Documents

Now that we have our documents (and know how to chunk them), we need to convert them into numerical vectors that capture their semantic meaning. We will use **sentence-transformers** with the `all-MiniLM-L6-v2` model, which is:

- **Free** - no API key required
- **Fast** - small model (80MB), runs on CPU
- **Good quality** - 384-dimensional embeddings, strong performance on semantic similarity

> **Alternatives:** For production systems, you might use OpenAI's `text-embedding-3-small` (API, 1536 dims), Cohere's `embed-english-v3.0` (API, 1024 dims), or larger local models like `all-mpnet-base-v2` (768 dims).

In [None]:
# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print(f"Model loaded: all-MiniLM-L6-v2")
print(f"Max sequence length: {embedding_model.max_seq_length}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

In [None]:
# Embed our documents
texts = [doc["text"] for doc in documents]
embeddings = embedding_model.encode(texts, show_progress_bar=True)

print(f"\nEmbedded {len(embeddings)} documents")
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dtype: {embeddings.dtype}")
print(f"\nFirst embedding (first 10 dimensions): {embeddings[0][:10]}")
print(f"Embedding norm (L2): {np.linalg.norm(embeddings[0]):.4f}")

In [None]:
# Explore embedding properties

# 1. Embeddings are normalized (unit vectors) for this model
norms = np.linalg.norm(embeddings, axis=1)
print("Embedding norms (should be ~1.0 for normalized models):")
print(f"  Mean: {norms.mean():.4f}, Std: {norms.std():.6f}")

# 2. Semantic similarity: compare two related documents
# doc_01 (transformers) vs doc_09 (attention mechanisms) - should be similar
sim_related = np.dot(embeddings[0], embeddings[9])  # cosine sim (since normalized)
# doc_01 (transformers) vs doc_13 (quantization) - less related
sim_unrelated = np.dot(embeddings[0], embeddings[13])

print(f"\nSimilarity between 'Transformers' and 'Attention mechanisms': {sim_related:.4f}")
print(f"Similarity between 'Transformers' and 'Quantization':          {sim_unrelated:.4f}")
print(f"\nThe embedding space captures that transformers and attention are more related!")

---
## 6. Vector Databases: ChromaDB

A vector database stores embeddings and enables efficient similarity search. **ChromaDB** is an excellent choice for learning and prototyping because:

- Simple Python API
- Works in-memory or with persistent storage
- Built-in embedding support
- Metadata filtering
- Open source

### Core ChromaDB Concepts

| Concept | Description |
|---------|------------|
| **Client** | Connection to ChromaDB (in-memory or persistent) |
| **Collection** | A named group of documents + embeddings (like a table) |
| **Document** | The original text content |
| **Embedding** | The vector representation of a document |
| **Metadata** | Key-value pairs associated with each document |
| **Query** | Semantic search to find similar documents |

In [None]:
# Create a ChromaDB client (in-memory for this tutorial)
chroma_client = chromadb.Client()

# For persistent storage, you would use:
# chroma_client = chromadb.PersistentClient(path="./chroma_db")

print("ChromaDB client created (in-memory mode)")

In [None]:
# Create a collection
# We can use cosine similarity (default), L2 distance, or inner product
collection = chroma_client.create_collection(
    name="ai_ml_knowledge_base",
    metadata={"hnsw:space": "cosine"}  # use cosine similarity
)

print(f"Collection created: {collection.name}")
print(f"Distance metric: cosine")

In [None]:
# Add documents to the collection
collection.add(
    ids=[doc["id"] for doc in documents],
    documents=[doc["text"] for doc in documents],
    embeddings=embeddings.tolist(),
    metadatas=[{"category": doc["category"], "source": doc["source"], "year": doc["year"]} for doc in documents]
)

print(f"Added {collection.count()} documents to the collection")

In [None]:
# Semantic search: query the collection
def search(query_text, n_results=3):
    """Search the collection for documents similar to the query."""
    # Embed the query
    query_embedding = embedding_model.encode([query_text]).tolist()
    
    # Search
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=n_results,
        include=["documents", "metadatas", "distances"]
    )
    
    return results

# Test search
query = "How do embedding models work?"
results = search(query)

print(f"Query: '{query}'")
print(f"\nTop {len(results['ids'][0])} results:")
print("=" * 60)
for i in range(len(results['ids'][0])):
    doc_id = results['ids'][0][i]
    distance = results['distances'][0][i]
    similarity = 1 - distance  # cosine distance to similarity
    text = results['documents'][0][i]
    metadata = results['metadatas'][0][i]
    
    print(f"\n[{i+1}] ID: {doc_id} | Similarity: {similarity:.4f} | Category: {metadata['category']}")
    print(f"    {text[:150]}...")

In [None]:
# Let's try several different queries
test_queries = [
    "What is RAG and how does it work?",
    "How can I make a model smaller to run on my laptop?",
    "What are the best ways to evaluate a search system?",
    "How do transformer neural networks process language?",
    "What tools can AI agents use?"
]

for query in test_queries:
    results = search(query, n_results=2)
    top_id = results['ids'][0][0]
    top_sim = 1 - results['distances'][0][0]
    top_cat = results['metadatas'][0][0]['category']
    print(f"Query: '{query}'")
    print(f"  -> Top match: {top_id} (similarity: {top_sim:.4f}, category: {top_cat})")
    print(f"     {results['documents'][0][0][:100]}...")
    print()

### Exercise 2: Build a ChromaDB Collection and Perform Semantic Search

Create a new collection with a different set of documents and perform searches.

In [None]:
# Exercise 2: Build your own ChromaDB collection

# Here are some programming-related documents
programming_docs = [
    {"id": "prog_01", "text": "Python is a high-level, interpreted programming language known for its clear syntax and readability. It supports multiple paradigms including procedural, object-oriented, and functional programming. Python's extensive standard library and package ecosystem (via pip) make it popular for web development, data science, automation, and AI.", "topic": "python"},
    {"id": "prog_02", "text": "JavaScript is the language of the web, running in every modern browser. With Node.js, it can also run on servers. Key features include event-driven programming, closures, prototypal inheritance, and async/await for handling asynchronous operations. TypeScript adds static typing on top of JavaScript.", "topic": "javascript"},
    {"id": "prog_03", "text": "Rust is a systems programming language focused on safety, speed, and concurrency. Its ownership system prevents memory errors at compile time without needing a garbage collector. Rust is used for operating systems, game engines, web assembly, and performance-critical applications.", "topic": "rust"},
    {"id": "prog_04", "text": "Docker containers package applications with all their dependencies into standardized units. Unlike virtual machines, containers share the host OS kernel, making them lightweight and fast to start. Docker Compose allows defining multi-container applications, while Kubernetes orchestrates containers at scale.", "topic": "devops"},
    {"id": "prog_05", "text": "Git is a distributed version control system that tracks changes in source code. Key concepts include commits (snapshots), branches (parallel development lines), merges (combining branches), and remotes (shared repositories). GitHub, GitLab, and Bitbucket provide cloud hosting for Git repositories.", "topic": "tools"},
    {"id": "prog_06", "text": "SQL (Structured Query Language) is used to manage and query relational databases. Core operations include SELECT for reading, INSERT for adding, UPDATE for modifying, and DELETE for removing data. JOINs combine data from multiple tables. PostgreSQL and MySQL are popular open-source relational databases.", "topic": "databases"},
]

# TODO: Create a new collection called 'programming_kb'
prog_collection = None

# TODO: Embed the programming documents using embedding_model.encode()
prog_embeddings = None

# TODO: Add documents, embeddings, and metadata to the collection

# TODO: Search with these 3 queries and print results
search_queries = [
    "How do I manage code versions and collaborate with others?",
    "What language is best for building web applications?",
    "How can I deploy my application reliably?"
]

### Solution

In [None]:
# Solution: Exercise 2

# Create a new collection
prog_collection = chroma_client.create_collection(
    name="programming_kb",
    metadata={"hnsw:space": "cosine"}
)

# Embed the documents
prog_texts = [doc["text"] for doc in programming_docs]
prog_embeddings = embedding_model.encode(prog_texts)

# Add to collection
prog_collection.add(
    ids=[doc["id"] for doc in programming_docs],
    documents=prog_texts,
    embeddings=prog_embeddings.tolist(),
    metadatas=[{"topic": doc["topic"]} for doc in programming_docs]
)

print(f"Collection '{prog_collection.name}' created with {prog_collection.count()} documents\n")

# Search
search_queries = [
    "How do I manage code versions and collaborate with others?",
    "What language is best for building web applications?",
    "How can I deploy my application reliably?"
]

for query in search_queries:
    query_emb = embedding_model.encode([query]).tolist()
    results = prog_collection.query(
        query_embeddings=query_emb,
        n_results=2,
        include=["documents", "metadatas", "distances"]
    )
    
    print(f"Query: '{query}'")
    for i in range(len(results['ids'][0])):
        sim = 1 - results['distances'][0][i]
        topic = results['metadatas'][0][i]['topic']
        print(f"  [{i+1}] {results['ids'][0][i]} (sim: {sim:.4f}, topic: {topic})")
        print(f"      {results['documents'][0][i][:100]}...")
    print()

---
## 7. Distance Metrics Deep Dive

Understanding distance metrics is essential for building effective retrieval systems. Let's implement them from scratch and compare their behavior.

In [None]:
def cosine_similarity(a, b):
    """Cosine similarity: measures the angle between two vectors.
    Range: [-1, 1] where 1 = identical direction, 0 = orthogonal, -1 = opposite
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

def dot_product(a, b):
    """Dot product: combines direction AND magnitude.
    Range: (-inf, inf) - higher is more similar
    """
    return np.dot(a, b)

def euclidean_distance(a, b):
    """Euclidean distance (L2): straight-line distance between two points.
    Range: [0, inf) - lower is more similar
    """
    return np.linalg.norm(a - b)

# Compare metrics on our embeddings
# Select three documents: two related (transformers, attention) and one unrelated (quantization)
emb_transformers = embeddings[0]     # doc_01: Transformer architectures
emb_attention = embeddings[9]        # doc_10: Attention mechanisms
emb_quantization = embeddings[13]    # doc_14: Quantization

pairs = [
    ("Transformers vs Attention", emb_transformers, emb_attention),
    ("Transformers vs Quantization", emb_transformers, emb_quantization),
    ("Attention vs Quantization", emb_attention, emb_quantization),
]

print(f"{'Pair':<35} {'Cosine Sim':>12} {'Dot Product':>12} {'Euclidean':>12}")
print("-" * 73)
for name, a, b in pairs:
    cs = cosine_similarity(a, b)
    dp = dot_product(a, b)
    ed = euclidean_distance(a, b)
    print(f"{name:<35} {cs:>12.4f} {dp:>12.4f} {ed:>12.4f}")

In [None]:
# Visualize: Full similarity matrix using cosine similarity
n = len(embeddings)
sim_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        sim_matrix[i][j] = cosine_similarity(embeddings[i], embeddings[j])

# Plot heatmap
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(sim_matrix, cmap='YlOrRd', vmin=0, vmax=1)

# Labels
labels = [d['id'].replace('doc_', '') for d in documents]
short_labels = [
    "Transformers", "RAG", "Vector DBs", "Embeddings", "Fine-tuning",
    "Chunking", "LLMs", "Cosine Sim", "Prompting", "Attention",
    "RAG Eval", "Transfer Learn", "Semantic Search", "Quantization", "AI Agents"
]

ax.set_xticks(range(n))
ax.set_yticks(range(n))
ax.set_xticklabels(short_labels, rotation=45, ha='right', fontsize=8)
ax.set_yticklabels(short_labels, fontsize=8)

plt.colorbar(im, label='Cosine Similarity')
ax.set_title('Document Similarity Matrix', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

### When to Use Each Metric

| Metric | Use When | Key Property |
|--------|----------|-------------|
| **Cosine Similarity** | Text embeddings, when you care about direction not magnitude | Invariant to vector length |
| **Dot Product** | When embeddings encode importance in their magnitude | Faster than cosine (no normalization) |
| **Euclidean Distance** | When absolute position in space matters (e.g., spatial data) | Sensitive to magnitude |

**Rule of thumb:** For text similarity search, **cosine similarity** is almost always the right choice. If your embeddings are already normalized (unit vectors), cosine similarity equals the dot product.

---
## 8. Metadata Filtering

Real-world RAG systems need more than just semantic similarity. **Metadata filtering** lets you constrain search results based on structured attributes like category, date, source, or access permissions.

ChromaDB supports `where` clauses for metadata filtering and `where_document` for text content filtering.

In [None]:
# Search with metadata filters
query = "How do neural networks learn?"
query_emb = embedding_model.encode([query]).tolist()

# 1. No filter (baseline)
results_all = collection.query(
    query_embeddings=query_emb,
    n_results=3,
    include=["documents", "metadatas", "distances"]
)

print("=== No filter ===")
for i in range(len(results_all['ids'][0])):
    sim = 1 - results_all['distances'][0][i]
    cat = results_all['metadatas'][0][i]['category']
    print(f"  [{results_all['ids'][0][i]}] sim: {sim:.4f}, category: {cat}")

# 2. Filter by category
results_arch = collection.query(
    query_embeddings=query_emb,
    n_results=3,
    where={"category": "architectures"},
    include=["documents", "metadatas", "distances"]
)

print("\n=== Filter: category = 'architectures' ===")
for i in range(len(results_arch['ids'][0])):
    sim = 1 - results_arch['distances'][0][i]
    cat = results_arch['metadatas'][0][i]['category']
    print(f"  [{results_arch['ids'][0][i]}] sim: {sim:.4f}, category: {cat}")

# 3. Filter by year
results_2024 = collection.query(
    query_embeddings=query_emb,
    n_results=3,
    where={"year": {"$gte": 2024}},
    include=["documents", "metadatas", "distances"]
)

print("\n=== Filter: year >= 2024 ===")
for i in range(len(results_2024['ids'][0])):
    sim = 1 - results_2024['distances'][0][i]
    year = results_2024['metadatas'][0][i]['year']
    cat = results_2024['metadatas'][0][i]['category']
    print(f"  [{results_2024['ids'][0][i]}] sim: {sim:.4f}, year: {year}, category: {cat}")

In [None]:
# Advanced filtering: combining conditions
query = "Best practices and tutorials"
query_emb = embedding_model.encode([query]).tolist()

# Filter: source is tutorial AND year >= 2024
results_filtered = collection.query(
    query_embeddings=query_emb,
    n_results=5,
    where={
        "$and": [
            {"source": "tutorial"},
            {"year": {"$gte": 2024}}
        ]
    },
    include=["documents", "metadatas", "distances"]
)

print("=== Filter: source='tutorial' AND year >= 2024 ===")
for i in range(len(results_filtered['ids'][0])):
    sim = 1 - results_filtered['distances'][0][i]
    meta = results_filtered['metadatas'][0][i]
    print(f"  [{results_filtered['ids'][0][i]}] sim: {sim:.4f}, source: {meta['source']}, year: {meta['year']}")
    print(f"    {results_filtered['documents'][0][i][:100]}...")
    print()

# Also show what ChromaDB filter operators are available
print("\nChromaDB filter operators:")
print("  $eq   - equals (default)")
print("  $ne   - not equals")
print("  $gt   - greater than")
print("  $gte  - greater than or equal")
print("  $lt   - less than")
print("  $lte  - less than or equal")
print("  $in   - in list")
print("  $nin  - not in list")
print("  $and  - logical AND")
print("  $or   - logical OR")

### Exercise 3: Compare Retrieval Quality Across Queries

Try several queries and manually assess whether the top results are relevant.

In [None]:
# Exercise 3: Evaluate retrieval quality

evaluation_queries = [
    "What is the difference between RAG and fine-tuning?",
    "How do I choose a vector database?",
    "What are the latest developments in language models?",
    "How can I reduce hallucination in AI systems?",
    "What hardware do I need to run large models?"
]

# TODO: For each query, retrieve top-3 results and assess relevance
# Print the query, the top-3 document IDs, their similarity scores,
# and a brief note on whether they seem relevant

relevance_scores = None  # TODO: store your relevance assessments

# Expected output format:
# Query: "What is the difference between RAG and fine-tuning?"
#   [1] doc_02 (sim: 0.65) - RELEVANT (directly about RAG)
#   [2] doc_05 (sim: 0.52) - RELEVANT (directly about fine-tuning)
#   [3] doc_06 (sim: 0.48) - RELEVANT (about chunking, a RAG component)

### Solution

In [None]:
# Solution: Exercise 3

evaluation_queries = [
    "What is the difference between RAG and fine-tuning?",
    "How do I choose a vector database?",
    "What are the latest developments in language models?",
    "How can I reduce hallucination in AI systems?",
    "What hardware do I need to run large models?"
]

# Ground truth: manually identified relevant doc IDs for each query
expected_relevant = {
    0: ["doc_02", "doc_05", "doc_06"],       # RAG vs fine-tuning
    1: ["doc_03"],                              # vector databases
    2: ["doc_07", "doc_01"],                    # language models
    3: ["doc_02", "doc_09", "doc_11"],          # hallucination
    4: ["doc_14", "doc_05"],                    # hardware / quantization
}

total_relevant_found = 0
total_results = 0

for idx, query in enumerate(evaluation_queries):
    query_emb = embedding_model.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_emb,
        n_results=3,
        include=["documents", "metadatas", "distances"]
    )
    
    print(f"Query: '{query}'")
    for i in range(len(results['ids'][0])):
        doc_id = results['ids'][0][i]
        sim = 1 - results['distances'][0][i]
        cat = results['metadatas'][0][i]['category']
        is_relevant = doc_id in expected_relevant.get(idx, [])
        relevance_tag = "RELEVANT" if is_relevant else "NOT RELEVANT"
        
        if is_relevant:
            total_relevant_found += 1
        total_results += 1
        
        print(f"  [{i+1}] {doc_id} (sim: {sim:.4f}, cat: {cat}) - {relevance_tag}")
    print()

print(f"Overall precision: {total_relevant_found}/{total_results} = {total_relevant_found/total_results:.1%}")

### Exercise 4: Metadata Filtering

Practice using metadata filters to narrow down search results.

In [None]:
# Exercise 4: Metadata filtering

# TODO 1: Retrieve all documents in the 'rag' category about search or retrieval
# Hint: use where={"category": "rag"}
rag_results = None

# TODO 2: Retrieve only documents from 'textbook' sources published in 2023
textbook_2023 = None

# TODO 3: Retrieve documents from either 'tutorial' or 'blog_post' sources
# Hint: use $in operator: {"source": {"$in": ["tutorial", "blog_post"]}}
informal_sources = None

# Print results for each

### Solution

In [None]:
# Solution: Exercise 4

query = "search and retrieval techniques"
query_emb = embedding_model.encode([query]).tolist()

# 1. RAG category only
rag_results = collection.query(
    query_embeddings=query_emb,
    n_results=5,
    where={"category": "rag"},
    include=["documents", "metadatas", "distances"]
)

print("=== RAG category results ===")
for i in range(len(rag_results['ids'][0])):
    sim = 1 - rag_results['distances'][0][i]
    print(f"  [{rag_results['ids'][0][i]}] sim: {sim:.4f} - {rag_results['documents'][0][i][:80]}...")

# 2. Textbook sources from 2023
textbook_query = "fundamental concepts in AI"
textbook_emb = embedding_model.encode([textbook_query]).tolist()

textbook_2023 = collection.query(
    query_embeddings=textbook_emb,
    n_results=5,
    where={
        "$and": [
            {"source": "textbook"},
            {"year": 2023}
        ]
    },
    include=["documents", "metadatas", "distances"]
)

print("\n=== Textbook sources from 2023 ===")
for i in range(len(textbook_2023['ids'][0])):
    meta = textbook_2023['metadatas'][0][i]
    sim = 1 - textbook_2023['distances'][0][i]
    print(f"  [{textbook_2023['ids'][0][i]}] sim: {sim:.4f}, source: {meta['source']}, year: {meta['year']}")

# 3. Informal sources (tutorial or blog_post)
informal_sources = collection.query(
    query_embeddings=query_emb,
    n_results=5,
    where={"source": {"$in": ["tutorial", "blog_post"]}},
    include=["documents", "metadatas", "distances"]
)

print("\n=== Tutorial & Blog Post sources ===")
for i in range(len(informal_sources['ids'][0])):
    meta = informal_sources['metadatas'][0][i]
    sim = 1 - informal_sources['distances'][0][i]
    print(f"  [{informal_sources['ids'][0][i]}] sim: {sim:.4f}, source: {meta['source']}, category: {meta['category']}")

---
## 9. Other Vector Databases

ChromaDB is excellent for prototyping and small-to-medium workloads. For production at scale, you may need specialized solutions:

| Database | Type | Best For | Key Features |
|----------|------|----------|--------------|
| **ChromaDB** | Open-source, embedded | Prototyping, small apps | Simple API, Python-native, in-memory or persistent |
| **Pinecone** | Cloud-managed | Production SaaS | Fully managed, auto-scaling, hybrid search |
| **FAISS** | Library (Facebook) | High-performance local | Extremely fast, GPU support, many index types |
| **Weaviate** | Open-source, self-hosted | Hybrid search | GraphQL API, built-in vectorization, BM25 + vector |
| **Qdrant** | Open-source, self-hosted | Performance-critical | Written in Rust, advanced filtering, gRPC API |
| **Milvus** | Open-source, distributed | Large-scale production | Distributed architecture, billion-scale vectors |
| **pgvector** | PostgreSQL extension | Existing Postgres users | Integrates with existing SQL database, familiar tooling |

### Decision Guide

- **Just learning/prototyping?** Use **ChromaDB** (simplest setup)
- **Already using PostgreSQL?** Add **pgvector** extension
- **Need a managed service?** Use **Pinecone** (zero ops)
- **Need maximum local speed?** Use **FAISS** (raw performance)
- **Need hybrid search (keyword + semantic)?** Use **Weaviate** or **Qdrant**
- **Billions of vectors?** Use **Milvus** (distributed)

---
## 10. Bridge to the RAG Pipeline

We have now built the **retrieval** half of RAG. Here is where we stand in the full pipeline:

```
                    What we built in this notebook
                    ==============================

 Documents  -->  Chunk  -->  Embed  -->  Store in     -->  Query &
                                         Vector DB         Retrieve
 [done]         [done]      [done]       [done]           [done]


                    What comes next (Module 10)
                    ============================

 Retrieved   -->  Format      -->  Send to    -->  Generate
 Documents        as Context       LLM             Grounded Answer
 [from above]     [prompt eng]     [API call]      [final output]
```

In Module 10, we will:
1. Build a complete RAG pipeline that connects retrieval to generation
2. Use prompt engineering to instruct the LLM to answer based on retrieved context
3. Handle edge cases (no relevant documents found, conflicting information)
4. Evaluate the full system end-to-end

The key insight is that **retrieval quality directly determines generation quality**. If we retrieve the wrong documents, even the best LLM will produce poor answers. That is why this module focused deeply on the retrieval foundations.

---
## 11. Summary

### Key Takeaways

1. **RAG solves critical LLM limitations**: knowledge cutoff, hallucination, and lack of domain expertise
2. **Chunking matters**: How you split documents significantly affects retrieval quality. Recursive chunking is generally the best default strategy.
3. **Embedding models convert text to vectors**: Sentence-transformers (e.g., all-MiniLM-L6-v2) provide free, local, high-quality embeddings
4. **Vector databases enable semantic search**: ChromaDB is excellent for learning and prototyping; production systems may need Pinecone, FAISS, or Weaviate
5. **Cosine similarity is the standard metric** for text embedding comparison
6. **Metadata filtering** combines semantic search with structured constraints for more precise retrieval

### References

- **Paper**: Lewis et al. ["Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"](https://arxiv.org/abs/2005.11401) (2020) - The foundational RAG paper
- **Paper**: Vaswani et al. ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) (2017) - The transformer paper
- **Docs**: [ChromaDB Documentation](https://docs.trychroma.com/) - Vector database used in this notebook
- **Docs**: [Pinecone Documentation](https://docs.pinecone.io/) - Cloud vector database
- **Docs**: [FAISS Wiki](https://github.com/facebookresearch/faiss/wiki) - Facebook's similarity search library
- **Docs**: [Sentence Transformers](https://www.sbert.net/) - Embedding models used in this notebook
- **Course**: DeepLearning.AI ["LangChain: Chat with Your Data"](https://www.deeplearning.ai/short-courses/langchain-chat-with-your-data/) - Covers RAG with LangChain
- **Course**: DeepLearning.AI ["Building and Evaluating Advanced RAG Applications"](https://www.deeplearning.ai/short-courses/building-evaluating-advanced-rag/) - Advanced RAG techniques