<div style='text-align:center; padding: 20px 0;'>
<h1 style='color:#A31F34; margin-bottom:0;'>MIT AI STUDIO COURSE</h1>
<h3 style='color:#8A8B8C; margin-top:5px;'>MAS.664 / MAS.665 / EC.731 / IDS.865</h3>
<h2>RAG Workshop: Chat with the Course Website</h2>
<p style='font-size:14px; color:#666;'>Lead Professor: Ramesh Raskar, MIT Media Lab</p>
</div>

---

**Workshop Instructor:** [Brandon Sneider](https://linkedin.com/in/brandonsneider) — AI Manager at a defense technology startup, building AI systems for highly regulated environments. Brandon brings practical experience deploying LLMs and RAG pipelines where accuracy, compliance, and auditability are non-negotiable.

---

In this **30-minute hands-on workshop**, you'll build a **Retrieval-Augmented Generation (RAG)** system that crawls the [AI Studio course website](https://aiforimpact.github.io/), stores it in a vector database, and lets you **chat with the data** — all inside this notebook.

### What is RAG?

Large Language Models (LLMs) are powerful but have two critical limitations:
1. **Knowledge cutoff** — they don't know about data after their training date
2. **Hallucination** — they confidently generate plausible but incorrect information

**Retrieval-Augmented Generation (RAG)** solves both by retrieving relevant documents from a knowledge base and injecting them into the LLM's prompt as context before generating an answer.

```
User Question → [Retrieve relevant docs] → [Inject into prompt] → LLM → Grounded Answer
```

### Key Academic References

| Paper | Key Contribution |
|-------|------------------|
| Lewis et al. (2020) "[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)" — *NeurIPS 2020* | Introduced RAG: combines retriever + generator, reducing hallucination and improving factual accuracy |
| Karpukhin et al. (2020) "[Dense Passage Retrieval for Open-Domain QA](https://arxiv.org/abs/2004.04906)" — *EMNLP 2020* | Dense embeddings outperform BM25 keyword search for passage retrieval |
| Robertson & Zaragoza (2009) "[The Probabilistic Relevance Framework: BM25 and Beyond](https://doi.org/10.1561/1500000019)" | Foundational work on BM25 scoring |
| Gao et al. (2024) "[Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2312.10997)" | Comprehensive survey: Naive RAG, Advanced RAG, Modular RAG |
| Muennighoff et al. (2023) "[MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)" | Benchmark for comparing embedding models across 56 datasets |

### Learning Outcomes

By the end of this workshop, you will be able to:

1. **Explain** what RAG is and why it matters for business (reducing hallucination, grounding LLMs in private data)
2. **Compare** embedding models and understand when to fine-tune vs use off-the-shelf
3. **Build** a document ingestion pipeline that crawls a website, chunks content, and generates embeddings
4. **Understand** the difference between BM25 keyword search and semantic vector search, and why hybrid search outperforms either alone
5. **Implement** a working RAG chatbot backed by SQLite + sqlite-vec
6. **Evaluate** RAG quality and use AI to tune hyperparameters

### Tech Stack

| Component | Technology | Why |
|-----------|-----------|-----|
| Database | **SQLite + sqlite-vec** | Zero infrastructure, runs anywhere, scales to millions of vectors |
| Keyword Search | **FTS5 (BM25)** | Built into SQLite, great for exact term matching |
| Vector Search | **sqlite-vec** | Native cosine similarity, compact binary storage |
| Embeddings | **MiniLM-L6-v2** (local) | No API key needed, runs on CPU, 384 dimensions |
| LLM | **OpenRouter** | Access to frontier models, OpenAI-compatible API |
| Web Scraping | **BeautifulSoup** | Parse the AI Studio website HTML |
| Chat UI | **IPython widgets** | Interactive chat right in the notebook |

---
## Part 1: Setup & Installation (2 min)

We install everything we need in one cell. This takes ~30 seconds on Colab.

In [1]:
%%capture
# Install all dependencies (run once)
!pip install -q sqlite-vec fastembed numpy openai requests beautifulsoup4 lxml yt-dlp

In [2]:
import json
import os
import re
import sqlite3
import struct
from datetime import datetime
from pathlib import Path
from urllib.parse import urljoin, urlparse

import numpy as np
import requests
import sqlite_vec
from bs4 import BeautifulSoup
from fastembed import TextEmbedding

print("All libraries loaded!")

All libraries loaded!


---
## Part 2: Initialize the Vector Database (3 min)

We use **SQLite** with two extensions:
- **FTS5** — Full-Text Search for BM25 keyword matching (built into SQLite)
- **sqlite-vec** — Vector similarity search using cosine distance

This gives us a **hybrid search** system in a single file — no external database servers needed.

> **Why hybrid?** Karpukhin et al. (2020) showed dense retrieval beats BM25 for semantic queries,
> but BM25 still wins for exact name/term lookups. Combining both gets the best of each.

| Query Example | BM25 (keyword) | Semantic (vector) | Best Approach |
|--------------|:-:|:-:|---|
| "Ramesh Raskar" | Exact match | May miss | **BM25** |
| "Who teaches AI ethics?" | No exact terms | Understands meaning | **Semantic** |
| "MIT Media Lab AI course" | Partial match | Related concepts | **Hybrid** |

In [3]:
DATABASE_PATH = Path("ai_studio_rag.db")

def init_database(db_path: Path) -> sqlite3.Connection:
    """Create database with embeddings table, FTS5 for BM25, and vec0 for vectors."""
    conn = sqlite3.connect(db_path)
    conn.row_factory = sqlite3.Row

    # Load sqlite-vec extension
    conn.enable_load_extension(True)
    sqlite_vec.load(conn)
    conn.enable_load_extension(False)

    cursor = conn.cursor()

    # Metadata table — stores the actual content and source info
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            url TEXT NOT NULL,
            page_title TEXT,
            section_title TEXT,
            content TEXT NOT NULL,
            content_type TEXT DEFAULT 'text',
            metadata TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)

    # Vector table — 384 dimensions for MiniLM-L6-v2 embeddings
    cursor.execute("""
        CREATE VIRTUAL TABLE IF NOT EXISTS vec_documents USING vec0(
            embedding float[384] distance_metric=cosine
        )
    """)

    # FTS5 table — for BM25 keyword search
    cursor.execute("""
        CREATE VIRTUAL TABLE IF NOT EXISTS documents_fts USING fts5(
            content,
            page_title,
            section_title,
            content='documents',
            content_rowid='id'
        )
    """)

    # Triggers to keep FTS5 in sync automatically
    cursor.execute("""
        CREATE TRIGGER IF NOT EXISTS docs_ai AFTER INSERT ON documents BEGIN
            INSERT INTO documents_fts(rowid, content, page_title, section_title)
            VALUES (new.id, new.content, new.page_title, new.section_title);
        END
    """)
    cursor.execute("""
        CREATE TRIGGER IF NOT EXISTS docs_ad AFTER DELETE ON documents BEGIN
            INSERT INTO documents_fts(documents_fts, rowid, content, page_title, section_title)
            VALUES ('delete', old.id, old.content, old.page_title, old.section_title);
        END
    """)

    conn.commit()
    return conn

# Initialize!
db = init_database(DATABASE_PATH)
print(f"Database created at: {DATABASE_PATH}")
print("sqlite-vec loaded — ready for vector search!")

Database created at: ai_studio_rag.db
sqlite-vec loaded — ready for vector search!


---
## Part 3: Crawl the AI Studio Website & Video Transcripts (5 min)

We'll scrape every page of [aiforimpact.github.io](https://aiforimpact.github.io/) including:
- Course overview and structure
- Speaker & mentor bios (40+ people across 7 semesters)
- Schedule and registration info
- Past semester archives (Fall 2023 → Spring 2026)
- **YouTube video transcripts** (Demo Days, NANDA talks, lectures — ~97K words!)

### The RAG Ingestion Pipeline

```
Website HTML → Crawl & Clean → Chunk by Section ─┐
                                                   ├→ Embed → Store in SQLite
YouTube Videos → Download Captions → Chunk ────────┘
```

**Chunking strategy**: We split HTML by sections (headers) and transcripts by ~500-word windows.
This matters — Gao et al. (2024) show that chunk size and boundary selection significantly impact retrieval quality.

**Why transcripts matter**: The course has 9 YouTube videos (Demo Days, Raskar's NANDA talks) with
auto-generated captions. Without transcripts, questions like "What projects were at Demo Day?"
would return nothing — the answers only exist in what speakers *said*.

In [None]:
# ── Website Crawler ──────────────────────────────────────────────────────────

SITE_PAGES = [
    "https://aiforimpact.github.io/",
    "https://aiforimpact.github.io/spring26.html",
    "https://aiforimpact.github.io/fall25.html",
    "https://aiforimpact.github.io/spring25.html",
    "https://aiforimpact.github.io/fall24.html",
    "https://aiforimpact.github.io/spring24.html",
    "https://aiforimpact.github.io/fall23.html",
]

# Labels that appear before person names in HTML cards
BIO_LABELS = {"lead professor:", "co-instructor:", "instructor:", "course ta:",
              "ta:", "course instructor:", "professor:", "guest speaker:",
              "speaker:", "mentor:", "judge:", "panelist:", "moderator:"}


def get_youtube_title(video_id: str) -> tuple[str, str]:
    """Get real YouTube title + author via oEmbed (no API key needed)."""
    try:
        resp = requests.get(
            f"https://www.youtube.com/oembed?url=https://www.youtube.com/watch?v={video_id}&format=json",
            timeout=10)
        if resp.status_code == 200:
            data = resp.json()
            return data.get("title", ""), data.get("author_name", "")
    except Exception:
        pass
    return "", ""


def crawl_page(url: str) -> dict:
    """Fetch and parse a single page, extracting structured content."""
    resp = requests.get(url, timeout=15)
    resp.raise_for_status()
    soup = BeautifulSoup(resp.text, "lxml")

    title = soup.title.string.strip() if soup.title else url.split('/')[-1]

    for tag in soup(["script", "style", "nav"]):
        tag.decompose()

    chunks = []
    seen_names = set()

    # ── Extract speaker/mentor bios (with fixed name parsing + dedup) ──
    for card in soup.select(".card, .speaker-card, .col-md-3, .col-lg-3"):
        img = card.find("img")
        text_parts = [t.strip() for t in card.stripped_strings]
        if not text_parts or len(text_parts) < 2:
            continue

        # Skip label prefixes like "Lead Professor:"
        name_idx = 0
        if text_parts[0].lower().rstrip(":") + ":" in BIO_LABELS or text_parts[0].lower() in BIO_LABELS:
            name_idx = 1
            if len(text_parts) < 3:
                continue

        name = text_parts[name_idx]
        label = text_parts[0] if name_idx > 0 else ""
        role = " ".join(text_parts[name_idx + 1:])
        img_url = urljoin(url, img["src"]) if img and img.get("src") else None

        # Deduplicate within page
        name_key = name.lower().strip()
        if name_key in seen_names or len(name) < 2:
            continue
        seen_names.add(name_key)

        bio_text = f"Speaker/Mentor: {name}."
        if label:
            bio_text += f" Position: {label.rstrip(':').strip()}."
        bio_text += f" Role: {role}."
        if img_url:
            bio_text += f" Photo: {img_url}"

        chunks.append({
            "content": bio_text,
            "section_title": f"Bio: {name}",
            "content_type": "bio",
            "metadata": {"name": name, "role": role, "label": label, "image_url": img_url}
        })

    # ── Extract YouTube videos (with real titles via oEmbed + dedup) ──
    seen_videos = set()
    for iframe in soup.find_all("iframe"):
        src = iframe.get("src", "")
        if "youtube" not in src and "youtu.be" not in src:
            continue
        match = re.search(r'(?:embed/|v=|youtu\.be/)([a-zA-Z0-9_-]{11})', src)
        if not match:
            continue
        vid_id = match.group(1)
        if vid_id in seen_videos:
            continue
        seen_videos.add(vid_id)

        real_title, author = get_youtube_title(vid_id)
        video_title = real_title or iframe.get("title", "Course Video")

        content = f"Video: {video_title}."
        if author:
            content += f" By: {author}."
        content += f" YouTube URL: https://www.youtube.com/watch?v={vid_id}."

        chunks.append({
            "content": content,
            "section_title": f"Video: {video_title[:60]}",
            "content_type": "video",
            "metadata": {"video_id": vid_id, "title": video_title, "author": author}
        })

    # ── Extract images (skip bio photos to avoid duplication) ──
    bio_img_urls = {
        c["metadata"]["image_url"] for c in chunks
        if c["content_type"] == "bio" and c["metadata"].get("image_url")
    }
    for img in soup.find_all("img"):
        src = img.get("src", "")
        alt = img.get("alt", "")
        if not src or src.startswith("data:") or not alt:
            continue
        img_url = urljoin(url, src)
        if img_url in bio_img_urls:
            continue
        parent_text = img.parent.get_text(strip=True)[:200] if img.parent else ""
        chunks.append({
            "content": f"Image: {alt}. URL: {img_url}. Context: {parent_text}",
            "section_title": f"Image: {alt[:50]}",
            "content_type": "image",
            "metadata": {"image_url": img_url, "alt_text": alt}
        })

    # ── Extract main text content by sections ──
    main = soup.find("main") or soup.find("body")
    if main:
        current_section = "Overview"
        current_text = []

        for element in main.find_all(["h1", "h2", "h3", "h4", "p", "li", "td", "th"]):
            if element.name in ["h1", "h2", "h3"]:
                if current_text:
                    text = "\n".join(current_text).strip()
                    if len(text) > 30:
                        chunks.append({
                            "content": f"{current_section}\n\n{text}",
                            "section_title": current_section,
                            "content_type": "text",
                            "metadata": {}
                        })
                current_section = element.get_text(strip=True)
                current_text = []
            else:
                text = element.get_text(strip=True)
                if text and len(text) > 5:
                    current_text.append(text)

        if current_text:
            text = "\n".join(current_text).strip()
            if len(text) > 30:
                chunks.append({
                    "content": f"{current_section}\n\n{text}",
                    "section_title": current_section,
                    "content_type": "text",
                    "metadata": {}
                })

    return {"url": url, "title": title, "chunks": chunks}


def download_transcript(video_id: str) -> str:
    """Download YouTube auto-captions as plain text via yt-dlp."""
    import tempfile

    import yt_dlp

    with tempfile.TemporaryDirectory() as tmpdir:
        out_path = os.path.join(tmpdir, video_id)
        ydl_opts = {
            "skip_download": True,
            "writeautomaticsub": True,
            "subtitleslangs": ["en"],
            "subtitlesformat": "json3",
            "outtmpl": out_path,
            "quiet": True,
            "no_warnings": True,
        }
        try:
            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
                ydl.download([f"https://www.youtube.com/watch?v={video_id}"])

            # Parse the json3 file
            json3_path = f"{out_path}.en.json3"
            if not os.path.exists(json3_path):
                return ""

            with open(json3_path) as f:
                data = json.load(f)

            texts = []
            for event in data.get("events", []):
                segs = event.get("segs", [])
                text = "".join(s.get("utf8", "") for s in segs).strip().replace("\n", " ")
                if text and text not in texts[-1:]:
                    texts.append(text)
            return " ".join(texts)
        except Exception as e:
            print(f"    Transcript error for {video_id}: {e}")
            return ""


def ingest_transcripts(all_chunks: list[dict]) -> list[dict]:
    """Download transcripts for all video chunks and add as transcript chunks."""
    transcript_chunks = []
    seen_ids = set()
    videos = []

    for c in all_chunks:
        if c["content_type"] == "video":
            meta = c["metadata"]
            vid_id = meta.get("video_id", "")
            if vid_id and vid_id not in seen_ids:
                seen_ids.add(vid_id)
                videos.append((vid_id, meta.get("title", "Video")))

    print(f"\n  Downloading transcripts for {len(videos)} unique videos...")
    for vid_id, title in videos:
        print(f"    {title[:50]}...", end=" ")
        text = download_transcript(vid_id)
        if not text:
            print("(no captions)")
            continue

        words = text.split()
        print(f"({len(words):,} words)")

        # Chunk into ~500-word pieces
        chunk_num = 0
        for i in range(0, len(words), 450):
            chunk_words = words[i:i + 500]
            chunk_text = " ".join(chunk_words)
            chunk_num += 1
            transcript_chunks.append({
                "content": f"Video transcript: {title}\nhttps://www.youtube.com/watch?v={vid_id}\n\n{chunk_text}",
                "section_title": f"Transcript: {title[:45]} (part {chunk_num})",
                "content_type": "transcript",
                "url": f"https://www.youtube.com/watch?v={vid_id}",
                "page_title": title,
                "metadata": {"video_id": vid_id, "title": title, "chunk": chunk_num},
            })

    return transcript_chunks


def crawl_site(pages: list[str]) -> list[dict]:
    """Crawl all pages, fetch video transcripts, return flat list of chunks."""
    all_chunks = []
    for url in pages:
        print(f"  Crawling: {url}")
        try:
            page = crawl_page(url)
            for chunk in page["chunks"]:
                chunk["url"] = url
                chunk["page_title"] = page["title"]
            all_chunks.extend(page["chunks"])
            print(f"    -> {len(page['chunks'])} chunks extracted")
        except Exception as e:
            print(f"    Error: {e}")

    # Download and chunk video transcripts
    transcript_chunks = ingest_transcripts(all_chunks)
    all_chunks.extend(transcript_chunks)

    return all_chunks


print("Crawling the AI Studio website...\n")
all_chunks = crawl_site(SITE_PAGES)

# Summary
types = {}
for c in all_chunks:
    types[c["content_type"]] = types.get(c["content_type"], 0) + 1

print(f"\n{'='*50}")
print(f"Total chunks extracted: {len(all_chunks)}")
for t, count in sorted(types.items()):
    print(f"  {t}: {count}")

---
## Part 4: Embedding Models Deep Dive & Store (5 min)

Now we convert each text chunk into a **384-dimensional vector** using the MiniLM-L6-v2 model.

**How embeddings work:**
- Text → Neural network → Dense vector (array of 384 numbers)
- Semantically similar texts produce vectors that are **close together** in vector space
- "Who teaches the course?" and "Professor Ramesh Raskar leads the class" will have similar embeddings, even though they share few words

### Why MiniLM-L6-v2? Embedding Model Comparison

Choosing an embedding model is one of the most impactful decisions in a RAG system. Here's how the major options compare:

| Model | Dims | Size | Cost | Speed | Multilingual | Best For |
|-------|------|------|------|-------|-------------|----------|
| **all-MiniLM-L6-v2** (ours) | 384 | 80 MB | Free (local) | ~50ms | English only | Prototyping, workshops, CPU |
| **BGE-M3** (BAAI) | 1024 | 2.2 GB | Free (local) | ~200ms | 100+ languages | Production multilingual RAG |
| **Amazon Titan Embeddings** | 1024 | API | $0.0001/1K tok | ~100ms | 25+ languages | AWS-integrated pipelines |
| **OpenAI text-embedding-3-large** | 3072 | API | $0.00013/1K tok | ~80ms | Good | Highest quality, cost-tolerant |
| **Cohere embed-v3** | 1024 | API | $0.0001/1K tok | ~90ms | 100+ languages | Search-optimized |

**We chose MiniLM-L6-v2 because:**
1. **Zero cost** — no API key needed for embeddings, runs entirely on CPU
2. **Fast** — smallest model, perfect for a 30-min workshop
3. **Good enough** — for English-only <100K docs, quality is comparable to larger models
4. **384 dimensions** — smaller vectors = faster search, less storage (vs 1024 or 3072)

> **When to upgrade:** For multilingual, use BGE-M3. For max quality, use OpenAI's `text-embedding-3-large`.
> See Muennighoff et al. (2023) "[MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)" for comprehensive benchmarks.

### When and How to Fine-Tune an Embedding Model

**When fine-tuning helps** (Wang et al. 2024, "[Improving Text Embeddings with LLMs](https://arxiv.org/abs/2401.00368)"):
- Your domain has **specialized vocabulary** (legal, medical, finance) that general models don't capture
- You have **query-document pairs** showing what users search for vs what they should find
- General models score below ~0.7 accuracy on your test queries
- Example: A legal RAG where "consideration" means contract terms, not "thinking about"

**When fine-tuning is NOT worth it:**
- Your content is general English (like our course website)
- You have fewer than ~1,000 labeled query-document pairs
- Switching to a larger pre-trained model would solve the problem
- You're still iterating on chunking strategy or retrieval logic

**How fine-tuning works (conceptual):**
```
1. Collect pairs:    (query, relevant_document, irrelevant_document)
2. Contrastive loss: Push relevant pairs closer, irrelevant pairs apart
3. Result:           Model now "understands" your domain's similarity
```

Tools: [Sentence-Transformers fine-tuning](https://www.sbert.net/docs/training/overview.html), or generate synthetic pairs with LLMs (Wang et al. 2024).

In [5]:
# Load the embedding model (downloads ~80MB on first run)
print("Loading MiniLM-L6-v2 embedding model (384 dimensions, ONNX runtime)...")
embedding_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded!\n")


def generate_embedding(text: str) -> list[float]:
    """Generate a 384-dim embedding for a single text."""
    return list(embedding_model.embed([text]))[0].tolist()


def generate_embeddings_batch(texts: list[str]) -> list[list[float]]:
    """Batch embed for efficiency."""
    if not texts:
        return []
    return [emb.tolist() for emb in embedding_model.embed(texts)]


def serialize_embedding(embedding: list[float]) -> bytes:
    """Pack embedding as binary for sqlite-vec."""
    return struct.pack(f'{len(embedding)}f', *embedding)


# ── Quick demo: see how embeddings capture meaning ──
demo_texts = [
    "Professor Raskar teaches AI at MIT",
    "The lead instructor of the course is from Media Lab",
    "I had pizza for lunch today",
]
demo_embs = generate_embeddings_batch(demo_texts)

print("Embedding similarity demo:")
print(f"  Text A: '{demo_texts[0]}'")
print(f"  Text B: '{demo_texts[1]}'  (related)")
print(f"  Text C: '{demo_texts[2]}'  (unrelated)")

from numpy import dot
from numpy.linalg import norm


def cosine_sim(a, b): return dot(a, b) / (norm(a) * norm(b))

print(f"\n  Similarity A<->B (related):   {cosine_sim(demo_embs[0], demo_embs[1]):.3f}")
print(f"  Similarity A<->C (unrelated): {cosine_sim(demo_embs[0], demo_embs[2]):.3f}")
print("  (Higher = more similar. Related texts cluster together!)")

Loading MiniLM-L6-v2 embedding model (384 dimensions, ONNX runtime)...
Model loaded!

Embedding similarity demo:
  Text A: 'Professor Raskar teaches AI at MIT'
  Text B: 'The lead instructor of the course is from Media Lab'  (related)
  Text C: 'I had pizza for lunch today'  (unrelated)

  Similarity A<->B (related):   0.344
  Similarity A<->C (unrelated): -0.009
  (Higher = more similar. Related texts cluster together!)


In [6]:
# ── Ingest all chunks into the database ──
print("Embedding and storing all chunks...")

BATCH_SIZE = 32
total_stored = 0

for i in range(0, len(all_chunks), BATCH_SIZE):
    batch = all_chunks[i:i + BATCH_SIZE]
    texts = [c["content"] for c in batch]
    embeddings = generate_embeddings_batch(texts)

    cursor = db.cursor()
    for chunk, emb in zip(batch, embeddings):
        cursor.execute("""
            INSERT INTO documents (url, page_title, section_title, content, content_type, metadata)
            VALUES (?, ?, ?, ?, ?, ?)
        """, (
            chunk["url"],
            chunk["page_title"],
            chunk["section_title"],
            chunk["content"],
            chunk["content_type"],
            json.dumps(chunk["metadata"]),
        ))
        rowid = cursor.lastrowid
        cursor.execute("""
            INSERT INTO vec_documents (rowid, embedding) VALUES (?, ?)
        """, (rowid, serialize_embedding(emb)))

    db.commit()
    total_stored += len(batch)
    print(f"  Stored {total_stored}/{len(all_chunks)} chunks...")

print(f"\nDone! {total_stored} chunks embedded and stored in {DATABASE_PATH}")
print(f"Database size: {DATABASE_PATH.stat().st_size / 1024:.0f} KB")

Embedding and storing all chunks...
  Stored 32/748 chunks...
  Stored 64/748 chunks...
  Stored 96/748 chunks...
  Stored 128/748 chunks...
  Stored 160/748 chunks...
  Stored 192/748 chunks...
  Stored 224/748 chunks...
  Stored 256/748 chunks...
  Stored 288/748 chunks...
  Stored 320/748 chunks...
  Stored 352/748 chunks...
  Stored 384/748 chunks...
  Stored 416/748 chunks...
  Stored 448/748 chunks...
  Stored 480/748 chunks...
  Stored 512/748 chunks...
  Stored 544/748 chunks...
  Stored 576/748 chunks...
  Stored 608/748 chunks...
  Stored 640/748 chunks...
  Stored 672/748 chunks...
  Stored 704/748 chunks...
  Stored 736/748 chunks...
  Stored 748/748 chunks...

Done! 748 chunks embedded and stored in ai_studio_rag.db
Database size: 4096 KB


---
## Part 5: Hybrid Search — BM25 + Semantic (5 min)

This is the **retrieval** core of RAG. We implement two search strategies and combine them:

### BM25 (Keyword Search)
- Uses SQLite's **FTS5** extension
- Scores based on term frequency / inverse document frequency
- Great for exact names, acronyms, specific terms
- Based on Robertson & Zaragoza (2009)

### Semantic Search (Vector Similarity)
- Uses **sqlite-vec**'s native cosine distance
- Finds conceptually similar content even with different wording
- Based on Karpukhin et al. (2020) Dense Passage Retrieval

### Hybrid Fusion
We normalize both scores to [0, 1] and combine with configurable weights:

```
final_score = keyword_weight * BM25_score + semantic_weight * cosine_similarity
```

In [None]:
# ── Search Functions ──────────────────────────────────────────────────────────

def bm25_search(conn, query: str, limit: int = 50) -> dict[int, float]:
    """BM25 keyword search via FTS5. Returns {doc_id: score}."""
    cursor = conn.cursor()
    safe_query = query.replace('"', '""')
    try:
        cursor.execute("""
            SELECT rowid, bm25(documents_fts) as score
            FROM documents_fts
            WHERE documents_fts MATCH ?
            LIMIT ?
        """, (safe_query, limit))
        return {row[0]: row[1] for row in cursor.fetchall()}
    except sqlite3.OperationalError:
        return {}


def semantic_search(conn, query_embedding: list[float], limit: int = 50) -> dict[int, float]:
    """Cosine similarity search via sqlite-vec. Returns {doc_id: distance}."""
    cursor = conn.cursor()
    cursor.execute("""
        SELECT rowid, distance
        FROM vec_documents
        WHERE embedding MATCH ? AND k = ?
        ORDER BY distance
    """, (serialize_embedding(query_embedding), limit))
    return {row[0]: row[1] for row in cursor.fetchall()}


def normalize_scores(scores: dict[int, float], higher_is_better: bool = True) -> dict[int, float]:
    """Normalize scores to [0, 1]."""
    if not scores:
        return {}
    vals = list(scores.values())
    min_v, max_v = min(vals), max(vals)
    if min_v == max_v:
        return {k: 1.0 for k in scores}
    if higher_is_better:
        return {k: (v - min_v) / (max_v - min_v) for k, v in scores.items()}
    else:
        return {k: (max_v - v) / (max_v - min_v) for k, v in scores.items()}


def hybrid_search(
    conn,
    query: str,
    query_embedding: list[float],
    keyword_weight: float = 0.3,
    semantic_weight: float = 0.7,
    top_k: int = 5,
) -> list[dict]:
    """Combine BM25 + semantic search with weighted fusion and content diversity.

    Diversity: limits any single content_type to at most (top_k - 1) results,
    ensuring compound queries surface different types (bios, text, transcripts).
    """
    bm25_raw = bm25_search(conn, query)
    bm25_norm = normalize_scores(bm25_raw, higher_is_better=False)

    sem_raw = semantic_search(conn, query_embedding)
    sem_norm = normalize_scores(sem_raw, higher_is_better=False)

    all_ids = set(bm25_norm.keys()) | set(sem_norm.keys())
    if not all_ids:
        return []

    cursor = conn.cursor()
    placeholders = ",".join("?" * len(all_ids))
    cursor.execute(f"""
        SELECT id, url, page_title, section_title, content, content_type, metadata
        FROM documents WHERE id IN ({placeholders})
    """, list(all_ids))
    cols = ["id","url","page_title","section_title","content","content_type","metadata"]
    docs = {row[0]: dict(zip(cols, row)) for row in cursor.fetchall()}

    results = []
    for doc_id in all_ids:
        bm25_s = bm25_norm.get(doc_id, 0.0)
        sem_s = sem_norm.get(doc_id, 0.0)
        final = keyword_weight * bm25_s + semantic_weight * sem_s
        doc = docs.get(doc_id, {})
        results.append({**doc, "bm25_score": bm25_s, "semantic_score": sem_s, "final_score": final})

    results.sort(key=lambda x: x["final_score"], reverse=True)

    # ── Content-type diversity: cap any single type at (top_k - 1) ──
    # This prevents e.g. 5 bios drowning out course description text chunks.
    max_per_type = max(top_k - 1, 1)
    diverse = []
    type_counts = {}
    for r in results:
        ct = r.get("content_type", "text")
        type_counts[ct] = type_counts.get(ct, 0) + 1
        if type_counts[ct] <= max_per_type:
            diverse.append(r)
        if len(diverse) >= top_k:
            break

    return diverse[:top_k]


# ── Test it! ──
test_queries = [
    "Who is the lead professor of the course?",
    "What are the course pillars?",
    "Tell me about the guest speakers from venture capital",
]

for q in test_queries:
    print(f"\n{'='*60}")
    print(f"Query: {q}")
    print(f"{'='*60}")
    emb = generate_embedding(q)
    results = hybrid_search(db, q, emb, top_k=3)
    for i, r in enumerate(results, 1):
        print(f"  {i}. [{r.get('content_type','?')}] Score: {r['final_score']:.3f} "
              f"(BM25: {r['bm25_score']:.2f}, Semantic: {r['semantic_score']:.2f})")
        content_preview = r.get('content', '')[:120].replace('\n', ' ')
        print(f"     {content_preview}...")

---
## Part 6: Connect the LLM for Generation (3 min)

Now we add the **G** in RA**G** — we take retrieved context and use an LLM to generate a grounded answer.

### How to Get an API Key

We support **two providers** — use whichever you already have:

**Option A: OpenRouter (recommended for this workshop)**
1. Go to [openrouter.ai/keys](https://openrouter.ai/keys)
2. Sign in with Google/GitHub (free, no credit card)
3. Click **"Create Key"** — copy the key (starts with `sk-or-`)
4. In Colab: click the **key icon** in the left sidebar → add secret named `OPENROUTER_API_KEY`

*OpenRouter gives you access to many models (Gemini, Llama, Mistral) through one API.*

**Option B: OpenAI API key**
1. Go to [platform.openai.com/api-keys](https://platform.openai.com/api-keys)
2. Create a new key (starts with `sk-proj-` or `sk-`)
3. In Colab: add secret named `OPENAI_API_KEY`

*Uses GPT-4o-mini by default. Requires billing enabled on your OpenAI account.*

> **Note on Anthropic/Claude keys:** Claude's API uses a different message format (not OpenAI-compatible).
> Keys starting with `sk-ant-` will **not work** here. Use OpenRouter instead — it routes to Claude
> models via the OpenAI-compatible format if you want Claude.

### The RAG Prompt Pattern

```
System: You are a helpful assistant for the MIT AI Studio course.
        Answer ONLY using the provided context.

User:   Context: [retrieved documents]
        Question: [user's question]
```

This is the key insight from Lewis et al. (2020): by conditioning generation on retrieved evidence, the model produces **factual, verifiable answers** rather than hallucinating.

In [8]:
from openai import OpenAI

# ── Auto-detect API key type and configure provider ──
# Supports: OpenRouter (sk-or-), OpenAI (sk-proj- / sk-), rejects Anthropic (sk-ant-)

api_key = None
provider = None

# Try Colab Secrets first (key icon in left sidebar)
try:
    from google.colab import userdata
    for secret_name in ["OPENROUTER_API_KEY", "OPENAI_API_KEY"]:
        try:
            key = userdata.get(secret_name)
            if key:
                api_key = key
                break
        except Exception:
            continue
except ImportError:
    # Not running in Colab — check environment variables
    import os
    api_key = os.environ.get("OPENROUTER_API_KEY") or os.environ.get("OPENAI_API_KEY")

# Fall back to manual input
if not api_key:
    api_key = input(
        "Enter your API key (OpenRouter or OpenAI):\n"
        "  OpenRouter: https://openrouter.ai/keys (free, recommended)\n"
        "  OpenAI:     https://platform.openai.com/api-keys\n> "
    ).strip()

# ── Detect provider from key prefix ──
if api_key.startswith("sk-ant-"):
    raise ValueError(
        "\n\nAnthropic/Claude key detected (sk-ant-...).\n"
        "Claude uses a different API message format and is NOT compatible with the OpenAI SDK.\n\n"
        "Options:\n"
        "  1. Get a free OpenRouter key at https://openrouter.ai/keys\n"
        "     (OpenRouter can route to Claude models via OpenAI-compatible format)\n"
        "  2. Use an OpenAI key from https://platform.openai.com/api-keys\n"
    )
elif api_key.startswith("sk-or-"):
    provider = "openrouter"
    base_url = "https://openrouter.ai/api/v1"
    LLM_MODEL = "google/gemini-2.0-flash-001"
    print(f"OpenRouter key detected -> using model: {LLM_MODEL}")
elif api_key.startswith("sk-proj-") or api_key.startswith("sk-"):
    provider = "openai"
    base_url = "https://api.openai.com/v1"
    LLM_MODEL = "gpt-4o-mini"
    print(f"OpenAI key detected -> using model: {LLM_MODEL}")
else:
    # Unknown prefix — assume OpenRouter (most permissive)
    provider = "openrouter"
    base_url = "https://openrouter.ai/api/v1"
    LLM_MODEL = "google/gemini-2.0-flash-001"
    print(f"Unknown key format — defaulting to OpenRouter with model: {LLM_MODEL}")

client = OpenAI(base_url=base_url, api_key=api_key)
print(f"\nLLM client connected! Provider: {provider}")

OpenRouter key detected -> using model: google/gemini-2.0-flash-001

LLM client connected! Provider: openrouter


In [9]:
# ── RAG Answer Function ───────────────────────────────────────────────────────

SYSTEM_PROMPT = """You are a helpful AI assistant for the MIT AI Studio course
(MAS.664 / MAS.665 / EC.731 / IDS.865), taught at MIT Media Lab by Professor Ramesh Raskar.

Rules:
- Answer ONLY using the provided context. Do not use outside knowledge.
- If the context mentions images or videos, include the URLs so the user can view them.
- If the context doesn't contain enough information, say so clearly.
- Be concise but thorough. Use bullet points for lists.
- When mentioning people, include their role/affiliation if available."""


def format_context(results: list[dict], max_chars: int = 4000) -> str:
    """Format retrieved documents into context for the LLM."""
    if not results:
        return "No relevant documents found."
    parts = []
    chars = 0
    for i, r in enumerate(results, 1):
        entry = f"[Source {i}: {r.get('page_title', '?')} > {r.get('section_title', '?')}]\n{r.get('content', '')}"
        if chars + len(entry) > max_chars:
            break
        parts.append(entry)
        chars += len(entry)
    return "\n\n".join(parts)


def rag_answer(question: str, conn=db, top_k: int = 5) -> tuple[str, list[dict]]:
    """Full RAG pipeline: retrieve -> format -> generate."""
    query_emb = generate_embedding(question)
    results = hybrid_search(conn, question, query_emb, top_k=top_k)
    context = format_context(results)

    response = client.chat.completions.create(
        model=LLM_MODEL,
        max_tokens=800,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    answer = response.choices[0].message.content.strip()
    return answer, results


# ── Quick test ──
question = "Who are the lead instructors and what is the course about?"
answer, sources = rag_answer(question)
print(f"Q: {question}\n")
print(f"A: {answer}\n")
print(f"Sources used: {len(sources)} documents")
for s in sources:
    print(f"  - [{s.get('content_type','')}] {s.get('section_title', '')[:60]} (score: {s['final_score']:.2f})")

Q: Who are the lead instructors and what is the course about?

A: Based on the context, Ramesh Raskar (Lead Professor, MIT Media Lab) is a lead instructor for the AI Studio course. The context does not provide information about what the course is about.

*   **Ramesh Raskar:** Lead Professor, MIT Media Lab. Photo: [https://aiforimpact.github.io/assets/img/speakers/ramesh.png](https://aiforimpact.github.io/assets/img/speakers/ramesh.png)

Sources used: 5 documents
  - [bio] Bio: David Shrier (score: 0.70)
  - [bio] Bio: Nikolay Vyahhi (score: 0.63)
  - [bio] Bio: Nikolay Vyahhi (score: 0.63)
  - [bio] Bio: Ramesh Raskar (score: 0.61)
  - [bio] Bio: John Werner (score: 0.56)


---
## Part 7: Interactive Chat Interface (5 min)

Now let's build an interactive chat — right here in the notebook!

Try asking:
- "Who are the venture capital speakers?"
- "What's the difference between Spring 2025 and Fall 2025?"
- "What are the course pillars?"
- "Tell me about the demo day"
- "Are there any videos I can watch?"

In [None]:
# ── In-Notebook Chat UI ───────────────────────────────────────────────────────

import ipywidgets as widgets
from IPython.display import HTML, clear_output, display

chat_output = widgets.Output(layout=widgets.Layout(
    width='100%', min_height='300px', max_height='500px',
    overflow_y='auto', border='1px solid #444', padding='10px',
))
input_box = widgets.Text(
    placeholder='Ask anything about the MIT AI Studio course...',
    layout=widgets.Layout(width='85%'),
)
send_btn = widgets.Button(description='Send', button_style='primary',
                          layout=widgets.Layout(width='14%'))
show_sources = widgets.Checkbox(value=False, description='Show sources', indent=False)

chat_history = []


def _extract_source_images(sources: list[dict], max_images: int = 3) -> str:
    """Extract speaker photos from bio sources and return as HTML img tags."""
    if not sources:
        return ""
    imgs = []
    seen_urls = set()
    for s in sources:
        meta = s.get("metadata", "")
        if isinstance(meta, str):
            try:
                meta = json.loads(meta)
            except (json.JSONDecodeError, TypeError):
                meta = {}
        img_url = meta.get("image_url", "")
        if img_url and img_url not in seen_urls and s.get("content_type") == "bio":
            seen_urls.add(img_url)
            name = meta.get("name", "Speaker")
            imgs.append(
                f"<img src='{img_url}' alt='{name}' title='{name}' "
                f"style='width:60px;height:60px;border-radius:50%;object-fit:cover;"
                f"margin:2px 4px;border:2px solid #555;'>"
            )
        if len(imgs) >= max_images:
            break
    if imgs:
        return f"<div style='margin:6px 0;'>{''.join(imgs)}</div>"
    return ""


def render_chat():
    with chat_output:
        clear_output()
        for msg in chat_history:
            if msg["role"] == "user":
                display(HTML(f"<div style='margin:8px 0;padding:10px 14px;background:#1a3a5c;"
                    f"border-radius:12px;color:white;max-width:80%;margin-left:auto;"
                    f"text-align:right;font-size:14px;'><b>You:</b> {msg['content']}</div>"))
            else:
                # Render speaker photos from bio sources
                images_html = _extract_source_images(msg.get("sources", []))
                sources_html = ""
                if show_sources.value and msg.get("sources"):
                    src_items = "".join(
                        f"<br>- [{s.get('content_type','')}] {s.get('section_title','')[:50]} (score: {s['final_score']:.2f})"
                        for s in msg["sources"]
                    )
                    sources_html = f"<br><details><summary><small>Sources</small></summary><small>{src_items}</small></details>"
                display(HTML(f"<div style='margin:8px 0;padding:10px 14px;background:#2d2d2d;"
                    f"border-radius:12px;color:#e0e0e0;max-width:90%;font-size:14px;"
                    f"line-height:1.5;'><b>AI Studio Assistant:</b><br>{msg['content']}"
                    f"{images_html}{sources_html}</div>"))

def on_send(btn):
    question = input_box.value.strip()
    if not question:
        return
    input_box.value = ""
    chat_history.append({"role": "user", "content": question})
    render_chat()
    with chat_output:
        display(HTML("<div style='color:#888;padding:5px;font-style:italic;'>Searching & generating...</div>"))
    try:
        answer, sources = rag_answer(question)
        chat_history.append({"role": "assistant", "content": answer, "sources": sources})
    except Exception as e:
        chat_history.append({"role": "assistant", "content": f"Error: {e}", "sources": []})
    render_chat()

send_btn.on_click(on_send)
input_box.on_submit(lambda _: on_send(None))
show_sources.observe(lambda _: render_chat(), names='value')

header = widgets.HTML("<h3 style='margin:0;padding:8px 0;'>Chat with the AI Studio Course</h3>")
display(widgets.VBox([header, chat_output, widgets.HBox([input_box, send_btn]), widgets.HBox([show_sources])]))

chat_history.append({
    "role": "assistant",
    "content": "Welcome! I've ingested the entire <b>MIT AI Studio course website</b> "
               "(7 pages, all semesters, 40+ speaker bios, videos, and images). "
               "Ask me anything!<br><br>"
               "<b>Try:</b> 'Who are the VC speakers?' or 'What is the course about?' "
               "or 'How do I register for Spring 2026?'",
    "sources": []
})
render_chat()

---
## Part 8: Understanding What Just Happened — RAG Recap

Let's trace what happens when you type a question in the chat:

```
 1. EMBED        Your question -> 384-dim vector
                 "Who teaches AI ethics?" -> [0.03, -0.12, ...]

 2. RETRIEVE     Hybrid search over the database
                 BM25:     keyword match on "teaches", "AI"
                 Semantic: cosine similarity to find related
                 -> Top 5 most relevant chunks

 3. AUGMENT      Inject retrieved chunks into the LLM prompt
                 "Context: [Source 1: ...] [Source 2: ...]"

 4. GENERATE     LLM produces answer grounded in evidence
                 -> Factual, verifiable, no hallucination
```

### Why RAG Matters for Business (MBA Perspective)

| Use Case | Without RAG | With RAG |
|----------|-------------|----------|
| Customer support chatbot | Generic answers, can't access your docs | Answers from your actual knowledge base |
| Legal document review | May hallucinate case law | Grounded in your actual contracts |
| Internal company wiki Q&A | Outdated training data | Always current with your docs |
| Due diligence research | May confuse companies | Grounded in actual filings |

### Key Takeaways

1. **RAG = Retrieve + Augment + Generate** — give the LLM the right context, get grounded answers
2. **Hybrid search > either alone** — combine keyword (BM25) and semantic (vector) for best results
3. **Embeddings are the bridge** — they convert text to numbers so we can measure similarity
4. **Model choice matters** — MiniLM for prototypes, BGE-M3/OpenAI for production
5. **The entire system runs locally** — embeddings are local, only the LLM call needs an API

---
## Exercises

### Exercise 1: Compare Search Strategies
Run the cell below to see how BM25-only, semantic-only, and hybrid search differ.

In [None]:
# Exercise 1: Compare search strategies for different query types

test_cases = [
    ("Ramesh Raskar", "Exact name lookup - BM25 should excel"),
    ("venture capital investors in AI", "Conceptual query - semantic should excel"),
    ("MIT Media Lab course spring 2026", "Mix of specific + conceptual - hybrid wins"),
]

for query, description in test_cases:
    print(f"\n{'='*70}")
    print(f"Query: \"{query}\"")
    print(f"Expected: {description}")
    print(f"{'='*70}")

    emb = generate_embedding(query)

    r_bm25 = hybrid_search(db, query, emb, keyword_weight=1.0, semantic_weight=0.0, top_k=3)
    r_sem = hybrid_search(db, query, emb, keyword_weight=0.0, semantic_weight=1.0, top_k=3)
    r_hyb = hybrid_search(db, query, emb, keyword_weight=0.3, semantic_weight=0.7, top_k=3)

    for label, results in [("BM25 Only", r_bm25), ("Semantic Only", r_sem), ("Hybrid", r_hyb)]:
        print(f"\n  {label}:")
        if not results:
            print("    (no results)")
        for r in results[:2]:
            print(f"    - {r.get('section_title', '?')[:50]} (score: {r['final_score']:.3f})")

### Exercise 2: Add Your Own Content
Add a new document to the knowledge base and query it.

In [None]:
# Exercise 2: Add your own content to the RAG system

my_doc = """
Replace this with your own text! For example, paste your startup idea,
a paragraph from a paper you're reading, or notes from a lecture.
"""

emb = generate_embedding(my_doc)
cursor = db.cursor()
cursor.execute("""
    INSERT INTO documents (url, page_title, section_title, content, content_type, metadata)
    VALUES (?, ?, ?, ?, ?, ?)
""", ("user://custom", "My Document", "Custom Content", my_doc, "text", "{}"))
rowid = cursor.lastrowid
cursor.execute("INSERT INTO vec_documents (rowid, embedding) VALUES (?, ?)",
               (rowid, serialize_embedding(emb)))
db.commit()
print(f"Added your document (id={rowid}). Now try asking about it in the chat above!")

### Exercise 3: Let AI Tune Your RAG Hyperparameters

Our RAG system has several "knobs" (hyperparameters) that affect quality. Right now they're set to reasonable defaults — but are they optimal for *this* dataset?

**The hyperparameters we can tune:**
- `keyword_weight` / `semantic_weight` — balance between BM25 and vector search
- `top_k` — how many documents to retrieve

Below, we first test with **untuned defaults**, then ask the LLM to analyze results and suggest better values. This is a real technique used in production RAG systems!

In [None]:
# ── Exercise 3: Untuned vs AI-Tuned Hyperparameters ─────────────────────────

EVAL_QUERIES = [
    {"query": "Who is the lead professor?",
     "expected_keywords": ["ramesh", "raskar", "media lab"]},
    {"query": "What are the three course pillars?",
     "expected_keywords": ["innovation", "human centered", "technical"]},
    {"query": "Which venture capitalists are speakers?",
     "expected_keywords": ["khosla", "lux", "pillar"]},
    {"query": "How do I register for Spring 2026?",
     "expected_keywords": ["questionnaire", "register", "step"]},
    {"query": "What happens at demo day?",
     "expected_keywords": ["demo", "presentations", "investors"]},
]

def score_results(results: list[dict], expected_keywords: list[str]) -> float:
    """Score retrieval: what fraction of expected keywords appear in top results?"""
    if not results:
        return 0.0
    all_content = " ".join(r.get("content", "") for r in results).lower()
    hits = sum(1 for kw in expected_keywords if kw.lower() in all_content)
    return hits / len(expected_keywords)

def evaluate_config(kw_w, sem_w, top_k):
    """Run all eval queries and return average retrieval score."""
    scores = []
    for eq in EVAL_QUERIES:
        emb = generate_embedding(eq["query"])
        results = hybrid_search(db, eq["query"], emb,
                               keyword_weight=kw_w, semantic_weight=sem_w, top_k=top_k)
        scores.append(score_results(results, eq["expected_keywords"]))
    return sum(scores) / len(scores)

# ── Step 1: Test with UNTUNED defaults ──
print("=" * 60)
print("STEP 1: Testing with UNTUNED defaults")
print("  keyword_weight=0.3, semantic_weight=0.7, top_k=5")
print("=" * 60)

untuned_score = evaluate_config(0.3, 0.7, 5)
print(f"\n  Overall retrieval score: {untuned_score:.1%}\n")

for eq in EVAL_QUERIES:
    emb = generate_embedding(eq["query"])
    results = hybrid_search(db, eq["query"], emb, keyword_weight=0.3, semantic_weight=0.7, top_k=5)
    s = score_results(results, eq["expected_keywords"])
    status = "PASS" if s >= 0.66 else "WEAK" if s > 0 else "FAIL"
    print(f"  [{status}] \"{eq['query']}\" -> {s:.0%} keywords found")

# ── Step 2: Ask AI to suggest better hyperparameters ──
print(f"\n{'=' * 60}")
print("STEP 2: Asking AI to analyze and suggest better config...")
print("=" * 60)

tuning_prompt = f"""You are optimizing a RAG system's hyperparameters.
The system uses hybrid search combining BM25 keyword search and semantic vector search.

Current config: keyword_weight=0.3, semantic_weight=0.7, top_k=5
Current retrieval score: {untuned_score:.1%}

Eval results:
"""
for eq in EVAL_QUERIES:
    emb = generate_embedding(eq["query"])
    results = hybrid_search(db, eq["query"], emb, keyword_weight=0.3, semantic_weight=0.7, top_k=5)
    s = score_results(results, eq["expected_keywords"])
    top_sections = [r.get("section_title", "?")[:40] for r in results[:3]]
    tuning_prompt += (f"\n- Query: \"{eq['query']}\"\n"
                      f"  Expected keywords: {eq['expected_keywords']}\n"
                      f"  Score: {s:.0%}, Top results: {top_sections}\n")

tuning_prompt += """\nSuggest improved hyperparameters. Consider:
- If name lookups fail, increase keyword_weight
- If conceptual queries fail, increase semantic_weight
- If not enough context, increase top_k (max 10)

Respond ONLY with JSON: {"keyword_weight": 0.4, "semantic_weight": 0.6, "top_k": 7, "reasoning": "brief"}"""

response = client.chat.completions.create(
    model=LLM_MODEL,
    max_tokens=200,
    messages=[{"role": "user", "content": tuning_prompt}],
)
ai_response = response.choices[0].message.content.strip()
print(f"\n  AI suggestion: {ai_response}\n")

# Parse and test
json_match = re.search(r'\{[^}]+\}', ai_response)
if json_match:
    suggestion = json.loads(json_match.group())
    new_kw = suggestion.get("keyword_weight", 0.4)
    new_sem = suggestion.get("semantic_weight", 0.6)
    new_topk = min(suggestion.get("top_k", 7), 10)

    print(f"{'=' * 60}")
    print("STEP 3: Testing with AI-TUNED config")
    print(f"  keyword_weight={new_kw}, semantic_weight={new_sem}, top_k={new_topk}")
    print(f"{'=' * 60}")

    tuned_score = evaluate_config(new_kw, new_sem, new_topk)
    print(f"\n  Overall retrieval score: {tuned_score:.1%}\n")

    for eq in EVAL_QUERIES:
        emb = generate_embedding(eq["query"])
        results = hybrid_search(db, eq["query"], emb,
                               keyword_weight=new_kw, semantic_weight=new_sem, top_k=new_topk)
        s = score_results(results, eq["expected_keywords"])
        status = "PASS" if s >= 0.66 else "WEAK" if s > 0 else "FAIL"
        print(f"  [{status}] \"{eq['query']}\" -> {s:.0%} keywords found")

    # Summary
    print(f"\n{'=' * 60}")
    print("SUMMARY")
    print(f"  Untuned score:  {untuned_score:.1%}")
    print(f"  AI-tuned score: {tuned_score:.1%}")
    delta = tuned_score - untuned_score
    if delta > 0:
        print(f"  Improvement:    +{delta:.1%}")
    elif delta == 0:
        print("  No change (defaults were already good for this dataset)")
    else:
        print(f"  Change:         {delta:.1%} (AI suggestion didn't help here)")
    print(f"{'=' * 60}")
else:
    print("  Could not parse AI suggestion. Try running the cell again.")

---
## What's Next? Taking RAG to Production

What we built today is a **Naive RAG** system (per Gao et al. 2024 taxonomy). Here's how real products level up:

| Level | Technique | Why |
|-------|-----------|-----|
| **Advanced RAG** | Query rewriting, reranking, HyDE | Better retrieval quality |
| **Agentic RAG** | Multi-step retrieval, tool use | Handle complex multi-hop questions |
| **Modular RAG** | Routing, adaptive retrieval | Only retrieve when needed |
| **Evaluation** | RAGAS, faithfulness metrics | Measure and improve systematically |

### Production Considerations for MBAs

- **Cost**: Embedding is free (local model). LLM costs ~$0.001-0.01 per query.
- **Latency**: Embedding ~50ms, search ~5ms, LLM ~1-3s. Total: ~2-4 seconds.
- **Scale**: sqlite-vec handles millions of vectors. For billions, consider pgvector or Pinecone.
- **Privacy**: Everything except the LLM call stays local. For full privacy, use local LLMs.

### Further Reading

- Lewis et al. (2020) — [RAG for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401)
- Gao et al. (2024) — [RAG for LLMs: A Survey](https://arxiv.org/abs/2312.10997)
- Wang et al. (2024) — [Improving Text Embeddings with LLMs](https://arxiv.org/abs/2401.00368)
- Muennighoff et al. (2023) — [MTEB: Massive Text Embedding Benchmark](https://arxiv.org/abs/2210.07316)
- [sqlite-vec documentation](https://alexgarcia.xyz/sqlite-vec/)
- [fastembed documentation](https://qdrant.github.io/fastembed/)

---
*Workshop created for [MIT AI Studio](https://aiforimpact.github.io/) (MAS.664/665, EC.731, IDS.865) — Spring 2026*

*Instructor: [Brandon Sneider](https://linkedin.com/in/brandonsneider)*