# ML Lab 10: Build a RAG System

Everyone is building RAG systems, but most people treat them as black boxes. In this lab,
you'll poke inside one: send queries, examine retrieved chunks, experiment with chunking
strategies, measure retrieval quality, and discover exactly where RAG systems break.

---
## Section 1: Query the RAG System

The RAG API is running at `http://localhost:8000`. Documents about distributed systems,
ML basics, and Python tips have already been ingested. Let's send queries and examine
what comes back: the retrieved chunks, the generated answer, and the latency breakdown.

In [None]:
import requests
import json

RAG_API = "http://localhost:8000"

# Verify the API is running
health = requests.get(f"{RAG_API}/health").json()
print(f"API status: {health}")

# Check how many document chunks are stored
stats = requests.get(f"{RAG_API}/collection/stats").json()
print(f"Collection: {stats['name']}, Chunks stored: {stats['count']}")

In [None]:
def query_rag(question, n_results=3):
    """Send a question to the RAG API and display results."""
    resp = requests.post(f"{RAG_API}/query", json={
        "question": question,
        "n_results": n_results,
        "model": "tinyllama"
    })
    result = resp.json()
    
    print(f"Question: {question}")
    print(f"{'=' * 60}")
    print(f"Answer: {result['answer'][:500]}")
    print(f"\n--- Timing ---")
    print(f"  Retrieval: {result['retrieval_time_ms']:.1f} ms")
    print(f"  Generation: {result['generation_time_ms']:.1f} ms")
    print(f"  Total: {result['total_time_ms']:.1f} ms")
    print(f"\n--- Sources ({len(result['sources'])}) ---")
    for i, src in enumerate(result['sources']):
        print(f"  [{i+1}] {src['metadata'].get('source', 'unknown')}: {src['chunk'][:100]}...")
    print()
    return result

In [None]:
# Query 1: A topic well-covered by the documents
result_cap = query_rag("What is the CAP theorem and what tradeoffs does it describe?")

In [None]:
# Query 2: Another well-covered topic
result_gd = query_rag("How does gradient descent work in machine learning?")

In [None]:
# Query 3: A topic that spans multiple documents
result_cross = query_rag("What are ensemble methods and how do they relate to overfitting?")

In [None]:
# Query 4: Something NOT in the documents at all
result_miss = query_rag("What is the current stock price of NVIDIA?")

**What you should see:** For topics covered in the documents (CAP theorem, gradient descent),
the system retrieves relevant chunks and generates a grounded answer. For the stock price
question, the system retrieves unrelated chunks and either hallucinates or says it doesn't know.

Notice how generation time dominates total latency -- retrieval is fast, LLM generation is slow.

---

## Section 2: Chunking Strategies

Chunk size is the single most impactful parameter in a RAG system. Too small, and chunks
lose context. Too large, and chunks dilute the relevant information with irrelevant text.

Let's re-ingest the same document with different chunk sizes and see how it affects retrieval.

In [None]:
# Read the distributed systems document
import urllib.request

# We'll read the document from the local file system or fetch it
# Since we're outside the container, let's read from the mounted path
doc_text = """
The CAP theorem, formulated by Eric Brewer in 2000, states that a distributed data store 
can provide at most two out of three guarantees simultaneously: Consistency, Availability, 
and Partition Tolerance. In practice, since network partitions are unavoidable in distributed 
systems, the real choice is between consistency and availability during a partition.

Consensus algorithms allow distributed systems to agree on a single value even when some 
nodes fail. Paxos uses a three-phase protocol: Prepare, Promise, and Accept. Raft was 
designed as an understandable alternative to Paxos, decomposing consensus into leader 
election, log replication, and safety.

Eventual consistency is a model in which, if no new updates are made, eventually all 
accesses will return the last updated value. This is weaker than strong consistency but 
allows for higher availability and lower latency.

Vector clocks track causality in distributed systems. Each node maintains a vector of 
logical timestamps. CRDTs are data structures that can be replicated and updated 
independently without coordination, guaranteed to converge.
"""

print(f"Document length: {len(doc_text.split())} words")

In [None]:
# Helper: chunk text locally so we can inspect the chunks
def chunk_text(text, chunk_size, overlap):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap
    return chunks

# Compare chunk counts at different sizes
for size in [50, 100, 200, 500]:
    chunks = chunk_text(doc_text, chunk_size=size, overlap=max(10, size // 10))
    print(f"Chunk size {size:>4d} words -> {len(chunks):>2d} chunks, "
          f"avg {sum(len(c.split()) for c in chunks) / len(chunks):.0f} words/chunk")
    print(f"  First chunk preview: {chunks[0][:80]}...")
    print()

In [None]:
# Ingest with different chunk sizes and compare retrieval
test_question = "What is the CAP theorem?"
chunk_sizes = [50, 100, 200, 500]

print(f"Question: {test_question}\n")

for size in chunk_sizes:
    # Ingest with this chunk size
    resp = requests.post(f"{RAG_API}/ingest", json={
        "text": doc_text,
        "metadata": {"source": f"test_chunk_{size}"},
        "chunk_size": size,
        "chunk_overlap": max(10, size // 10)
    })
    ingest_result = resp.json()
    
    # Query
    resp = requests.post(f"{RAG_API}/query", json={
        "question": test_question,
        "n_results": 2,
        "model": "tinyllama"
    })
    result = resp.json()
    
    print(f"--- Chunk size: {size} words ({ingest_result['chunks']} chunks) ---")
    print(f"  Retrieval time: {result['retrieval_time_ms']:.1f} ms")
    for i, src in enumerate(result['sources']):
        print(f"  Source [{i+1}]: {src['chunk'][:120]}...")
    print()

**Key insight:** With very small chunks (50 words), each chunk lacks context -- you get a
sentence fragment that mentions "CAP theorem" but not the full explanation. With very large
chunks (500+ words), the relevant information is buried among unrelated content.

The sweet spot for most text is 150-300 words per chunk with 10-20% overlap. This is why
**chunk size is the most impactful parameter in RAG** -- more than the embedding model,
more than the LLM, more than the number of retrieved chunks.

---

## Section 3: Evaluate Retrieval Quality

The LLM can only generate good answers if the retrieval step finds the right chunks.
Let's create golden question-answer pairs and measure retrieval precision: for each
question, is the correct source chunk in the top-3 retrieved results?

In [None]:
# Golden test set: questions with the expected source document
golden_qa = [
    {"question": "What is the CAP theorem?",
     "expected_source": "distributed_systems.md"},
    {"question": "How does the Raft consensus algorithm work?",
     "expected_source": "distributed_systems.md"},
    {"question": "What are CRDTs and how do they work?",
     "expected_source": "distributed_systems.md"},
    {"question": "What is consistent hashing used for?",
     "expected_source": "distributed_systems.md"},
    {"question": "What is the difference between supervised and unsupervised learning?",
     "expected_source": "ml_basics.md"},
    {"question": "What is the bias-variance tradeoff?",
     "expected_source": "ml_basics.md"},
    {"question": "How does cross-validation work?",
     "expected_source": "ml_basics.md"},
    {"question": "What are Python generators and when should you use them?",
     "expected_source": "python_tips.md"},
    {"question": "How do Python decorators work?",
     "expected_source": "python_tips.md"},
    {"question": "What is NumPy used for in Python?",
     "expected_source": "python_tips.md"},
]

print(f"Golden test set: {len(golden_qa)} questions")

In [None]:
# Evaluate retrieval precision@3
correct = 0
results_log = []

for qa in golden_qa:
    resp = requests.post(f"{RAG_API}/query", json={
        "question": qa["question"],
        "n_results": 3,
        "model": "tinyllama"
    })
    result = resp.json()
    
    # Check if any retrieved source matches the expected document
    retrieved_sources = [s["metadata"].get("source", "") for s in result["sources"]]
    hit = qa["expected_source"] in retrieved_sources
    if hit:
        correct += 1
    
    results_log.append({
        "question": qa["question"][:50],
        "expected": qa["expected_source"],
        "retrieved": retrieved_sources,
        "hit": hit
    })
    
    status = "HIT" if hit else "MISS"
    print(f"[{status}] {qa['question'][:60]}")
    print(f"       Expected: {qa['expected_source']}")
    print(f"       Got: {retrieved_sources}")
    print()

precision = correct / len(golden_qa)
print(f"{'=' * 50}")
print(f"Retrieval Precision@3: {precision:.1%} ({correct}/{len(golden_qa)})")
print(f"{'=' * 50}")

In [None]:
# Visualize results
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

questions = [r["question"][:30] + "..." for r in results_log]
hits = [1 if r["hit"] else 0 for r in results_log]
colors = ["#4CAF50" if h else "#F44336" for h in hits]

fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.barh(range(len(questions)), hits, color=colors)
ax.set_yticks(range(len(questions)))
ax.set_yticklabels(questions, fontsize=9)
ax.set_xlabel("Retrieved Correct Source (1=yes, 0=no)")
ax.set_title(f"Retrieval Precision@3: {precision:.0%}")
ax.set_xlim(-0.1, 1.1)
ax.invert_yaxis()
plt.tight_layout()
plt.savefig("retrieval_precision.png", dpi=100, bbox_inches="tight")
plt.show()
print("Saved to retrieval_precision.png")

**What you should see:** Most questions should retrieve the correct source document in the
top 3 results. Questions that fail are usually those where the wording doesn't closely
match the document text, or where the topic spans multiple documents.

This metric -- retrieval precision -- is the most important metric to track in a RAG system.
If retrieval is wrong, no amount of LLM quality can save the answer.

---

## Section 4: Breaking the System

RAG systems fail in interesting ways. Let's stress-test ours with adversarial inputs
and see how it handles each failure mode.

In [None]:
# Failure mode 1: Question not in the documents at all
print("=" * 60)
print("FAILURE MODE 1: Out-of-scope question")
print("=" * 60)
result = query_rag("What is the recipe for chocolate cake?")
print("Observation: The system retrieves whatever is 'closest' in vector space,")
print("even though none of the chunks are actually relevant.")
print("The LLM may hallucinate an answer or (if we're lucky) say it doesn't know.")

In [None]:
# Failure mode 2: Ambiguous question
print("=" * 60)
print("FAILURE MODE 2: Ambiguous question")
print("=" * 60)
result = query_rag("What is consistency?")
print("Observation: 'Consistency' appears in both distributed systems (CAP theorem)")
print("and ML (consistent models). The system may mix concepts from different domains.")

In [None]:
# Failure mode 3: Very short query
print("=" * 60)
print("FAILURE MODE 3: Very short query")
print("=" * 60)
result = query_rag("Raft")
print("Observation: Single-word queries have less semantic signal for retrieval.")
print("The embedding of 'Raft' may or may not match the chunk about Raft consensus.")

In [None]:
# Failure mode 4: Very long query
print("=" * 60)
print("FAILURE MODE 4: Very long query")
print("=" * 60)
long_question = (
    "I am building a distributed system and I need to understand all the different "
    "consistency models and how they relate to the CAP theorem and also I want to know "
    "about consensus algorithms like Paxos and Raft and how they handle leader election "
    "and log replication and also what about CRDTs and vector clocks and gossip protocols "
    "and how do all of these things fit together in a real system?"
)
result = query_rag(long_question)
print("Observation: Long queries dilute the semantic signal. The embedding tries to")
print("represent too many concepts at once, and retrieval may miss the most relevant chunk.")

In [None]:
# Failure mode 5: Trick question (looks relevant but is adversarial)
print("=" * 60)
print("FAILURE MODE 5: Trick question")
print("=" * 60)
result = query_rag("Why is the CAP theorem wrong and outdated?")
print("Observation: The documents describe CAP factually. But the question frames it")
print("as 'wrong'. The LLM might agree with the premise (hallucination) or correctly")
print("note that the documents don't say it's wrong.")

**Key takeaways from breaking the system:**

1. **RAG systems always retrieve something** -- even when nothing is relevant. There is no built-in "I don't know" from the retrieval step.
2. **Ambiguous queries retrieve mixed results** -- the system can't ask for clarification.
3. **Query length affects retrieval** -- too short lacks signal, too long dilutes it.
4. **The LLM trusts its context** -- if irrelevant chunks are retrieved, the LLM will try to use them.
5. **Adversarial framing fools the LLM** -- the retrieval finds the right content, but the question's framing can lead to wrong answers.

In production RAG systems, you need:
- Retrieval score thresholds to filter out low-confidence results
- Query classification to detect out-of-scope questions
- Answer grounding checks to verify the answer is supported by the retrieved text

---

## Summary

You've built and thoroughly tested a RAG system. Here's what you now know:

| Concept | What You Learned |
|---------|------------------|
| **RAG Pipeline** | Ingest -> Chunk -> Embed -> Store -> Retrieve -> Generate |
| **Chunk Size** | The most impactful parameter -- 150-300 words is the sweet spot |
| **Retrieval Precision** | The metric that matters most -- bad retrieval means bad answers |
| **Failure Modes** | RAG always retrieves something, even when nothing is relevant |
| **Latency** | Retrieval is fast (ms), generation is slow (seconds) |

### What's Next?

In **ML Lab 11**, you'll build an AI agent that uses this RAG system as one of its tools,
along with a calculator, search, and time tool -- with guardrails to prevent misuse.