# 🧪 Lab 3: Retrieval-Augmented Generation (RAG) Basics — Google Colab

In this lab, you’ll build a **minimal RAG pipeline** end-to-end:

**What you'll learn**
1. Chunking a small knowledge base (KB) of documents
2. Embedding texts with OpenAI Embeddings API
3. Vector search (cosine similarity) for top-k retrieval
4. Assembling a prompt with **context + question**
5. Answering with **citations**

> **Why are we doing this?**
> LLMs are great generalists, but they don’t know your *local* facts. RAG lets you search your KB and feed those facts into the model to ground the answer. This is the most common, production-ready pattern for LLM apps.

## ✅ Step 0 — Colab Runtime Check

In [None]:
import sys\nprint('Python', sys.version)\ntry:\n    import google.colab  # type: ignore\n    print('✅ Running in Google Colab')\nexcept Exception:\n    print('ℹ️ Not in Colab (that is okay for local runs).')

## 🔐 Step 1 — Install SDK & Set API Key

We’ll use the official **OpenAI Python SDK** for both **embeddings** and **chat completions**. You need an API key.

**How to use in Colab**
1. Create a key at: https://platform.openai.com/account/api-keys
2. Run the cell below (you’ll be prompted to paste the key).

In [None]:
!pip -q install --upgrade openai>=1.40 numpy\n\nimport os\nfrom getpass import getpass\n\nif 'OPENAI_API_KEY' not in os.environ or not os.environ['OPENAI_API_KEY']:\n    print('Enter your OpenAI API key (hidden):')\n    os.environ['OPENAI_API_KEY'] = getpass()\nprint('✅ API key set in environment.')

## 📚 Step 2 — Create a Tiny Knowledge Base (Toy Docs)

We’ll start with a few short, engineer-friendly snippets. You can replace these with your own content later.

In [None]:
toy_docs = [\n    ("microservices.txt", """
Microservices split a system into independently deployable services. They communicate via APIs. Each service owns its data and scales separately. Trade-offs include operational complexity and eventual consistency.
"""),\n    ("kafka.txt", """
Apache Kafka is a distributed event streaming platform used for high-throughput, low-latency pipelines. Producers write to topics; consumers read from partitions. Kafka is durable and horizontally scalable.
"""),\n    ("docker.txt", """
Docker packages applications into containers with all dependencies. Images are immutable; containers are ephemeral. Use tags, registries, and multi-stage builds for efficient delivery.
"""),\n    ("kubernetes.txt", """
Kubernetes orchestrates containers across nodes. Core objects: Pods, Deployments, Services, ConfigMaps, Secrets. It supports rolling updates, health checks, and autoscaling.
"""),\n    ("python.txt", """
Python is a general-purpose language with strong ecosystems in data, web, and automation. Virtual environments isolate dependencies. Package with pip and pyproject.toml for reproducible builds.
"""),\n    ("ci_cd.txt", """
CI/CD automates building, testing, and deploying. Use pipelines to run unit tests, linting, security scans, and to promote artifacts across environments with approvals.
"""),\n    ("rest.txt", """
REST uses HTTP verbs and resources. Prefer stateless servers and consistent status codes. Document with OpenAPI. Support pagination, filtering, and idempotency for PUT.
"""),\n    ("grpc.txt", """
gRPC uses Protocol Buffers and HTTP/2 for fast, type-safe RPC. It supports streaming and bi-directional communication. Good for internal service-to-service calls.
""")\n]\n\nlen(toy_docs), toy_docs[0][0]

## ✂️ Step 3 — Chunking

We split long documents into smaller **chunks** so retrieval can match relevant parts. We’ll use a simple character-length chunker with overlap.

In [None]:
from typing import List, Dict\n\ndef chunk_text(text: str, chunk_size: int = 500, overlap: int = 80) -> List[str]:\n    text = ' '.join(text.split())  # normalize whitespace\n    chunks = []\n    start = 0\n    while start < len(text):\n        end = min(len(text), start + chunk_size)\n        chunk = text[start:end]\n        chunks.append(chunk)\n        if end == len(text):\n            break\n        start = end - overlap\n    return chunks\n\nkb_chunks: List[Dict] = []\nfor doc_id, (fname, content) in enumerate(toy_docs):\n    for i, chunk in enumerate(chunk_text(content, 500, 80)):\n        kb_chunks.append({\n            'doc_id': doc_id,\n            'doc_name': fname,\n            'chunk_id': i,\n            'text': chunk\n        })\nlen(kb_chunks), kb_chunks[0]['doc_name'], kb_chunks[0]['text'][:80] + '...'

## 🔤 Step 4 — Embeddings

Create vector embeddings for each chunk using **text-embedding-3-small** (cheap & solid). We’ll store normalized vectors for cosine similarity search.

In [None]:
import numpy as np\nfrom openai import OpenAI\n\nclient = OpenAI()\nEMBED_MODEL = 'text-embedding-3-small'\n\ntexts = [c['text'] for c in kb_chunks]\nemb_resp = client.embeddings.create(model=EMBED_MODEL, input=texts)\nE = np.array([item.embedding for item in emb_resp.data], dtype='float32')\n\n# Normalize for cosine similarity via dot product\nE_norms = np.linalg.norm(E, axis=1, keepdims=True) + 1e-12\nE_unit = E / E_norms\nE_unit.shape

## 🔎 Step 5 — Vector Search (Cosine Similarity)

We embed the **query** and compute dot products against chunk vectors, then rank top-k results.

In [None]:
def embed_query(q: str) -> np.ndarray:\n    v = client.embeddings.create(model=EMBED_MODEL, input=[q]).data[0].embedding\n    v = np.array(v, dtype='float32')\n    return v / (np.linalg.norm(v) + 1e-12)\n\ndef search(query: str, top_k: int = 4):\n    qv = embed_query(query)\n    sims = (E_unit @ qv)  # cosine similarity\n    idx = sims.argsort()[-top_k:][::-1]\n    results = []\n    for i in idx:\n        r = kb_chunks[i].copy()\n        r['score'] = float(sims[i])\n        results.append(r)\n    return results\n\nfor r in search('What is Kafka used for?'):\n    print(f"{r['doc_name']}#${r['chunk_id']}  score={r['score']:.3f}\n→ {r['text'][:120]}...\n")

## 🧠 Step 6 — Answer with Context + Citations

We’ll **stuff** the top chunks into the prompt and ask the LLM to answer using only that context. The answer will include inline citations like `[source: filename#chunk]`.

In [None]:
CHAT_MODEL = 'gpt-4o-mini'\n\ndef format_context(chunks):\n    lines = []\n    for c in chunks:\n        tag = f"{c['doc_name']}#{c['chunk_id']}"\n        lines.append(f"[{tag}]\n{c['text']}")\n    return "\n\n".join(lines)\n\ndef answer_with_context(question: str, top_k: int = 4, temperature: float = 0.2, max_tokens: int = 450):\n    retrieved = search(question, top_k=top_k)\n    ctx = format_context(retrieved)\n    sys_prompt = (
        "You are a precise assistant for software engineers. Answer ONLY from the provided context. "
        "If the answer is not in the context, say 'I don't know from the provided context.' Always include citations as [source: filename#chunk_id]."
    )\n    user_prompt = (
        f"Context:\n{ctx}\n\n"
        f"Question: {question}\n"
        f"Answer with citations."
    )\n    resp = client.chat.completions.create(\n        model=CHAT_MODEL,\n        messages=[\n            {"role": "system", "content": sys_prompt},\n            {"role": "user", "content": user_prompt}\n        ],\n        temperature=temperature,\n        max_tokens=max_tokens\n    )\n    return resp.choices[0].message.content, retrieved\n\nans, used = answer_with_context("Why do teams adopt microservices and what are the trade-offs?")\nprint(ans)\nprint('\nCITED SOURCES:')\nfor u in used:\n    print(f"- {u['doc_name']}#{u['chunk_id']} (score={u['score']:.3f})")

## ⬆️ Step 7 — (Optional) Upload Your Own `.txt` Files

You can upload one or more `.txt` files to extend the knowledge base, then re-embed and search again.

➡️ **Tip**: Keep each file reasonably small for the lab (~10–50 KB). For PDFs or HTML you’d normally add a text extraction step.

In [None]:
try:\n    from google.colab import files  # type: ignore\n    uploaded = files.upload()\n    for name, content in uploaded.items():\n        txt = content.decode('utf-8', errors='ignore')\n        base = name\n        # Add chunks from uploaded file\n        new_chunks = chunk_text(txt, 500, 80)\n        start_doc_id = max([c['doc_id'] for c in kb_chunks]) + 1 if kb_chunks else 0\n        for i, ch in enumerate(new_chunks):\n            kb_chunks.append({'doc_id': start_doc_id, 'doc_name': base, 'chunk_id': i, 'text': ch})\n    # Re-embed full KB (simple for lab; in production, embed only new chunks and concatenate)\n    texts = [c['text'] for c in kb_chunks]\n    emb_resp = client.embeddings.create(model=EMBED_MODEL, input=texts)\n    import numpy as _np\n    E = _np.array([item.embedding for item in emb_resp.data], dtype='float32')\n    E_unit = E / (_np.linalg.norm(E, axis=1, keepdims=True) + 1e-12)\n    print(f'✅ Uploaded and re-embedded {len(kb_chunks)} chunks.')\nexcept Exception as e:\n    print('Upload step skipped or failed:', e)

## ♻️ Step 8 — (Optional) Diversity via MMR Re-ranking

**MMR (Maximal Marginal Relevance)** balances relevance with diversity to avoid near-duplicate chunks. This is a simple, greedy demo.

In [None]:
def mmr_search(query: str, fetch_k: int = 12, final_k: int = 5, lambda_mul: float = 0.7):\n    qv = embed_query(query)\n    sims = (E_unit @ qv)\n    candidates = sims.argsort()[-fetch_k:][::-1].tolist()\n    selected = []\n    selected_vs = []\n    for _ in range(min(final_k, len(candidates))):\n        best_idx, best_score = None, -1e9\n        for i in candidates:\n            rel = sims[i]\n            div = 0.0\n            if selected_vs:\n                div = max((E_unit[i] @ v for v in selected_vs))\n            score = lambda_mul * rel - (1 - lambda_mul) * div\n            if score > best_score:\n                best_score, best_idx = score, i\n        selected.append(best_idx)\n        selected_vs.append(E_unit[best_idx])\n        candidates.remove(best_idx)\n    results = []\n    for i in selected:\n        r = kb_chunks[i].copy()\n        r['score'] = float(sims[i])\n        results.append(r)\n    return results\n\ndef answer_with_mmr(question: str, final_k: int = 5):\n    retrieved = mmr_search(question, fetch_k=12, final_k=final_k, lambda_mul=0.7)\n    ctx = format_context(retrieved)\n    sys_prompt = ("Answer ONLY from the provided context. If unknown, say so. Include citations [source: filename#chunk_id].")\n    user_prompt = f"Context:\n{ctx}\n\nQuestion: {question}\nAnswer with citations."\n    resp = client.chat.completions.create(\n        model=CHAT_MODEL,\n        messages=[\n            {"role": "system", "content": sys_prompt},\n            {"role": "user", "content": user_prompt}\n        ],\n        temperature=0.2,\n        max_tokens=450\n    )\n    return resp.choices[0].message.content, retrieved\n\nans, used = answer_with_mmr('When should I prefer gRPC over REST?')\nprint(ans)\nprint('\nCITED SOURCES:')\nfor u in used:\n    print(f"- {u['doc_name']}#{u['chunk_id']} (score={u['score']:.3f})")

## 🧭 What to try next

1. Replace the toy docs with your project docs (markdown or txt) and rerun embedding.
2. Add **metadata** (e.g., section headings) and include them in citations.
3. Experiment with chunk sizes and overlaps for better retrieval.
4. Try **chain-of-thought style prompts** *summarized in output only* (e.g., require bullet-point reasoning) while respecting data handling.
5. For production, consider: FAISS/PGVector for storage, streaming responses, tracing, caching, and access control.

**You’ve completed Lab 3.** 🎉