# Notebook 17: RAG with Local LLMs

**Learning Objectives:**
- Understand Retrieval-Augmented Generation (RAG)
- Build vector databases with FAISS and ChromaDB
- Create embeddings with sentence transformers
- Implement semantic search and context injection
- Combine retrieval with local LLMs

## Prerequisites

### Hardware Requirements

| Component | Small (CPU) | Large (GPU) | SOTA (Reference) |
|-----------|-------------|-------------|------------------|
| **Embedding Model** | all-MiniLM-L6-v2 | all-mpnet-base-v2 | OpenAI text-embedding-3 |
| **Model Size** | 80MB | 420MB | API |
| **LLM** | llama3.2:1b | llama3.1:8b | Claude 3.5 Sonnet |
| **LLM Size** | 1.3GB | 4.7GB | API |
| **Min RAM** | 8GB | 12GB | N/A |
| **Min VRAM** | N/A (CPU) | 8GB | N/A |
| **Performance** | 2-4s per query | 1-2s per query | <1s per query |

### Software Requirements
- Python 3.10+
- Ollama installed with models pulled
- Libraries: `sentence-transformers`, `faiss-cpu`, `chromadb`, `ollama`

### Installation

```bash
# Install embedding and vector DB libraries
pip install sentence-transformers
pip install faiss-cpu  # or faiss-gpu if you have CUDA
pip install chromadb
pip install ollama

# Pull Ollama models
ollama pull llama3.2:1b
ollama pull llama3.1:8b  # Optional, if you have resources
```

## Expected Behaviors

### First Time Running
- **Embedding Model Download**: 80MB (small) or 420MB (large)
- **Ollama Model**: Already downloaded from notebooks 14-16
- Models cached in `~/.cache/huggingface/` and `~/.ollama/models/`

### Vector Database Creation
```
Creating embeddings for 50 documents...
FAISS index created: 50 vectors, 384 dimensions
Index saved to: ./rag_index
```

### RAG Query Execution
- **Embedding query**: 50-100ms
- **Semantic search**: 10-50ms for 1000 documents
- **LLM generation**: 1-4 seconds depending on model
- **Total**: 2-5 seconds per query

### Common Observations
- Embedding quality matters more than LLM size for accuracy
- Top-K=3-5 documents usually sufficient for context
- Larger context windows (llama3.1:8b) improve answer quality
- ChromaDB is easier for persistence, FAISS is faster for search

## What is RAG?

**Retrieval-Augmented Generation (RAG)** combines information retrieval with text generation to answer questions using external knowledge.

### Why RAG?

**Problem with LLMs alone:**
- Knowledge cutoff dates (outdated information)
- Hallucinations (making up facts)
- No access to private/proprietary data
- Cannot cite sources

**RAG Solution:**
1. **Retrieve** relevant documents from a knowledge base
2. **Inject** documents as context into LLM prompt
3. **Generate** answer grounded in retrieved facts

### RAG Architecture

```
User Query
    ↓
[Embedding Model] → Query Vector
    ↓
[Vector Database] → Semantic Search
    ↓
Top-K Relevant Documents
    ↓
[LLM] ← Query + Documents
    ↓
Grounded Answer
```

### Key Components

1. **Embedding Model** - Converts text to vectors (sentence-transformers)
2. **Vector Database** - Stores and searches embeddings (FAISS, ChromaDB)
3. **LLM** - Generates answers using retrieved context (Ollama)
4. **Knowledge Base** - Your documents/data to search

In [None]:
import numpy as np
import random
import torch
import ollama
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
random.seed(1103)
np.random.seed(1103)
torch.manual_seed(1103)

print("RAG Tutorial - Setup Complete")

## Model Selection

In [None]:
# CHOOSE YOUR MODELS:

# Embedding model (for converting text to vectors)
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # small: 80MB, 384 dimensions
# EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"  # large: 420MB, 768 dimensions

# LLM model (for generating answers)
LLM_MODEL = "llama3.2:1b"  # small: 1.3GB, CPU-friendly
# LLM_MODEL = "llama3.1:8b"  # large: 4.7GB, GPU-optimized

print(f"Embedding model: {EMBEDDING_MODEL}")
print(f"LLM model: {LLM_MODEL}")

## Sample Knowledge Base

Let's create a sample knowledge base about machine learning.

In [None]:
# Sample documents for our knowledge base
documents = [
    "Transformers are neural network architectures that use self-attention mechanisms. They were introduced in the paper 'Attention is All You Need' in 2017.",
    "BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model pre-trained on large text corpora. It excels at understanding context in both directions.",
    "GPT (Generative Pre-trained Transformer) is an autoregressive language model that generates text by predicting the next token. GPT-3 has 175 billion parameters.",
    "Fine-tuning is the process of adapting a pre-trained model to a specific task by training it on task-specific data. This is more efficient than training from scratch.",
    "LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that adds trainable low-rank matrices to model layers, reducing memory requirements.",
    "Vector databases store embeddings and enable fast similarity search. Popular options include FAISS, Pinecone, Weaviate, and ChromaDB.",
    "Semantic search finds documents based on meaning rather than keyword matching. It uses embeddings to represent text as vectors in a high-dimensional space.",
    "RAG (Retrieval-Augmented Generation) combines information retrieval with text generation. It retrieves relevant documents and uses them as context for generating answers.",
    "Ollama is a tool for running large language models locally on your machine. It supports models like Llama 3, Mistral, and Phi-3.",
    "Embeddings are dense vector representations of text that capture semantic meaning. Similar texts have similar embeddings in vector space.",
    "Attention mechanisms allow models to focus on different parts of the input when generating output. Self-attention computes attention within a single sequence.",
    "The Model Context Protocol (MCP) standardizes how AI assistants connect to external tools and data sources, enabling reusable integrations.",
    "Zero-shot learning is when a model performs tasks it wasn't explicitly trained for, using only the task description in the prompt.",
    "Few-shot learning provides a few examples of a task in the prompt to guide the model's behavior, improving performance without fine-tuning.",
    "Prompt engineering is the practice of crafting effective prompts to elicit desired behaviors from language models. It's crucial for maximizing model performance."
]

print(f"Knowledge base: {len(documents)} documents")
print(f"Sample document: {documents[0][:80]}...")

## Method 1: RAG with FAISS

FAISS (Facebook AI Similarity Search) is a fast library for similarity search.

In [None]:
import faiss

print("Loading embedding model...")
embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

In [None]:
# Create embeddings for all documents
print(f"Creating embeddings for {len(documents)} documents...")
embeddings = embedding_model.encode(documents, show_progress_bar=True)
print(f"Embeddings shape: {embeddings.shape}")

In [None]:
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance (Euclidean)
index.add(embeddings.astype('float32'))

print(f"FAISS index created: {index.ntotal} vectors, {dimension} dimensions")

In [None]:
def search_documents(query, top_k=3):
    """
    Search for relevant documents using semantic similarity.
    
    Args:
        query: Search query
        top_k: Number of documents to retrieve
    
    Returns:
        List of (document, score) tuples
    """
    query_embedding = embedding_model.encode([query])
    distances, indices = index.search(query_embedding.astype('float32'), top_k)
    
    results = []
    for idx, distance in zip(indices[0], distances[0]):
        results.append((documents[idx], float(distance)))
    
    return results

print("Search function ready")

In [None]:
# Test semantic search
query = "What are transformers in machine learning?"
results = search_documents(query, top_k=3)

print(f"Query: {query}\n")
print("Top 3 relevant documents:\n")
for i, (doc, score) in enumerate(results, 1):
    print(f"{i}. [Score: {score:.4f}]")
    print(f"   {doc}\n")

## RAG Query Function

Combine retrieval with generation.

In [None]:
def rag_query(question, top_k=3, verbose=True):
    """
    Answer a question using RAG.
    
    Args:
        question: User question
        top_k: Number of documents to retrieve
        verbose: Print retrieval details
    
    Returns:
        Generated answer
    """
    if verbose:
        print(f"Question: {question}\n")
        print("Step 1: Retrieving relevant documents...")
    
    results = search_documents(question, top_k=top_k)
    
    if verbose:
        print(f"Retrieved {len(results)} documents\n")
    
    context = "\n\n".join([doc for doc, _ in results])
    
    if verbose:
        print("Step 2: Generating answer with LLM...\n")
    
    prompt = f"""You are a helpful AI assistant. Answer the question based on the provided context.

Context:
{context}

Question: {question}

Answer:"""
    
    response = ollama.chat(
        model=LLM_MODEL,
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    answer = response['message']['content']
    
    if verbose:
        print("="*70)
        print("ANSWER")
        print("="*70)
        print(answer)
        print("="*70)
    
    return answer

print("RAG query function ready")

In [None]:
# Example 1: Simple question
answer = rag_query("What is BERT?")

In [None]:
# Example 2: More complex question
answer = rag_query("How does LoRA help with fine-tuning?")

In [None]:
# Example 3: Comparison question
answer = rag_query("What's the difference between zero-shot and few-shot learning?")

## Method 2: RAG with ChromaDB

ChromaDB is a more feature-rich vector database with built-in persistence.

In [None]:
import chromadb
from chromadb.config import Settings

print("Initializing ChromaDB...")
chroma_client = chromadb.Client(Settings(
    persist_directory="./chroma_db",
    anonymized_telemetry=False
))

# Create or get collection
try:
    collection = chroma_client.get_collection(name="ml_knowledge")
    print(f"Loaded existing collection: {collection.count()} documents")
except:
    collection = chroma_client.create_collection(
        name="ml_knowledge",
        metadata={"description": "Machine learning knowledge base"}
    )
    print("Created new collection")

In [None]:
# Add documents to ChromaDB
if collection.count() == 0:
    print(f"Adding {len(documents)} documents to ChromaDB...")
    
    collection.add(
        documents=documents,
        ids=[f"doc_{i}" for i in range(len(documents))],
        metadatas=[{"source": "tutorial", "index": i} for i in range(len(documents))]
    )
    
    print(f"Added {collection.count()} documents")
else:
    print(f"Collection already contains {collection.count()} documents")

In [None]:
def rag_query_chroma(question, top_k=3, verbose=True):
    """
    Answer a question using RAG with ChromaDB.
    
    Args:
        question: User question
        top_k: Number of documents to retrieve
        verbose: Print retrieval details
    
    Returns:
        Generated answer
    """
    if verbose:
        print(f"Question: {question}\n")
        print("Step 1: Querying ChromaDB...")
    
    results = collection.query(
        query_texts=[question],
        n_results=top_k
    )
    
    retrieved_docs = results['documents'][0]
    
    if verbose:
        print(f"Retrieved {len(retrieved_docs)} documents\n")
    
    context = "\n\n".join(retrieved_docs)
    
    if verbose:
        print("Step 2: Generating answer with LLM...\n")
    
    prompt = f"""You are a helpful AI assistant. Answer the question based on the provided context.

Context:
{context}

Question: {question}

Answer:"""
    
    response = ollama.chat(
        model=LLM_MODEL,
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    answer = response['message']['content']
    
    if verbose:
        print("="*70)
        print("ANSWER")
        print("="*70)
        print(answer)
        print("="*70)
    
    return answer

print("ChromaDB RAG function ready")

In [None]:
# Test ChromaDB RAG
answer = rag_query_chroma("What is RAG and how does it work?")

## Comparison: With vs Without RAG

In [None]:
test_question = "What is the Model Context Protocol?"

print("WITHOUT RAG (LLM alone):")
print("="*70)
response_no_rag = ollama.chat(
    model=LLM_MODEL,
    messages=[{'role': 'user', 'content': test_question}]
)
print(response_no_rag['message']['content'])
print("\n" + "="*70 + "\n")

print("WITH RAG (retrieval + LLM):")
print("="*70)
answer_with_rag = rag_query(test_question, verbose=False)
print(answer_with_rag)
print("="*70)

## Practical Application: Building a Custom Knowledge Base

Let's build a RAG system for a different domain - company documentation.

In [None]:
# Example: Company policy documents
company_docs = [
    "Our company offers 20 days of paid vacation per year for all full-time employees. Part-time employees receive prorated vacation days.",
    "Remote work is available up to 3 days per week for eligible positions. Employees must coordinate with their manager and maintain core hours of 10am-3pm.",
    "Health insurance coverage begins on the first day of the month following your start date. We offer medical, dental, and vision plans.",
    "The 401(k) retirement plan has a 4% company match. Employees are eligible after 90 days of employment.",
    "Performance reviews are conducted twice per year in June and December. Salary adjustments are made following the December review cycle.",
    "Our parental leave policy provides 16 weeks of paid leave for primary caregivers and 8 weeks for secondary caregivers.",
    "Professional development budget of $2000 per year is available for conferences, courses, and certifications after 6 months of employment.",
    "The employee referral program offers a $3000 bonus for successful hires who remain with the company for at least 6 months."
]

print(f"Company knowledge base: {len(company_docs)} policies")

In [None]:
# Create embeddings and index for company docs
company_embeddings = embedding_model.encode(company_docs)
company_index = faiss.IndexFlatL2(company_embeddings.shape[1])
company_index.add(company_embeddings.astype('float32'))

print(f"Company document index created: {company_index.ntotal} documents")

In [None]:
def company_rag_query(question, top_k=2):
    """RAG for company policy questions."""
    query_embedding = embedding_model.encode([question])
    distances, indices = company_index.search(query_embedding.astype('float32'), top_k)
    
    context = "\n\n".join([company_docs[idx] for idx in indices[0]])
    
    prompt = f"""You are a helpful HR assistant. Answer the employee's question based on company policies.

Company Policies:
{context}

Question: {question}

Answer:"""
    
    response = ollama.chat(
        model=LLM_MODEL,
        messages=[{'role': 'user', 'content': prompt}]
    )
    
    return response['message']['content']

print("Company RAG system ready")

In [None]:
# Test company RAG
questions = [
    "How many vacation days do I get?",
    "When does health insurance start?",
    "What is the 401k match?"
]

for q in questions:
    print(f"Q: {q}")
    answer = company_rag_query(q)
    print(f"A: {answer}\n")

## Performance Benchmarking

In [None]:
import time

def benchmark_rag(query, num_runs=3):
    """Benchmark RAG performance."""
    times = {'embedding': [], 'search': [], 'generation': [], 'total': []}
    
    for _ in range(num_runs):
        start_total = time.time()
        
        start = time.time()
        query_embedding = embedding_model.encode([query])
        times['embedding'].append(time.time() - start)
        
        start = time.time()
        distances, indices = index.search(query_embedding.astype('float32'), 3)
        times['search'].append(time.time() - start)
        
        context = "\n\n".join([documents[idx] for idx in indices[0]])
        prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"
        
        start = time.time()
        response = ollama.chat(
            model=LLM_MODEL,
            messages=[{'role': 'user', 'content': prompt}]
        )
        times['generation'].append(time.time() - start)
        
        times['total'].append(time.time() - start_total)
    
    print("Performance Breakdown (averages):")
    print(f"  Embedding:  {np.mean(times['embedding'])*1000:.1f}ms")
    print(f"  Search:     {np.mean(times['search'])*1000:.1f}ms")
    print(f"  Generation: {np.mean(times['generation'])*1000:.1f}ms")
    print(f"  Total:      {np.mean(times['total'])*1000:.1f}ms")

benchmark_rag("What is RAG?")

## Exercises

1. **Custom Knowledge Base**: Create a RAG system with your own documents (recipes, study notes, etc.)
2. **Tune Top-K**: Experiment with different top_k values (1, 3, 5, 10) and observe answer quality
3. **Hybrid Search**: Combine keyword matching with semantic search for better retrieval
4. **Citation**: Modify the prompt to make the LLM cite which documents it used
5. **Multi-Query**: Implement query expansion (generate multiple related queries for better coverage)
6. **Re-ranking**: Add a re-ranking step after initial retrieval to improve relevance
7. **Larger Dataset**: Test with 100+ documents and measure performance scaling

In [None]:
# Your code here for exercises


## Key Takeaways

✅ **RAG combines retrieval and generation** to answer questions using external knowledge

✅ **Embedding models** convert text to vectors for semantic similarity

✅ **Vector databases** (FAISS, ChromaDB) enable fast similarity search

✅ **Context injection** grounds LLM responses in retrieved facts

✅ **Local LLMs** (Ollama) work well for RAG with proper context

✅ **RAG reduces hallucinations** by providing factual grounding

## Next Steps

- Try **Advanced RAG Techniques**: HyDE, query expansion, re-ranking
- Explore **Production RAG**: LangChain, LlamaIndex frameworks
- Learn **Evaluation**: RAGAS, TruLens for RAG quality metrics
- Combine with **MCP**: Use RAG as a tool in agentic workflows

## Resources

- [FAISS Documentation](https://github.com/facebookresearch/faiss)
- [ChromaDB Documentation](https://docs.trychroma.com/)
- [Sentence Transformers](https://www.sbert.net/)
- [RAG Papers](https://arxiv.org/abs/2005.11401)
- [Ollama Documentation](https://ollama.ai/)