# Generative Search (RAG) with OpenAI + Local Vector DB (Chroma)

This notebook mirrors the lesson flow:

1. Build a text archive
2. Chunk it
3. Create embeddings
4. Store & search with a lightweight local vector DB
5. Add a generation step to answer questions using retrieved context

> Requirements: an environment variable `OPENAI_API_KEY` must be set.


In [None]:
# (Optional) If you run this notebook locally, you may need:
# !pip -q install -U openai chromadb python-dotenv tiktoken


## 1) Define a question and build a small text archive

Replace `text` with your own documents (articles, chapters, PDFs converted to text, etc.).


In [None]:
question = "Are side projects important when you are starting to learn about AI?"

# Example mini-archive (replace with your own corpus)
text = """
The rapid rise of AI has led to a rapid rise in AI jobs, and many people are building exciting careers in this field. A career is a decades-long journey, and the path is not always straight. In the early stages, focus on building skills, shipping small projects, and learning to learn.

Side projects can be a powerful way to explore ideas and practice new skills. Even if you have a full-time job, a side hustle or a fun project can stir the creative juices and help you grow.

As you build your portfolio, be mindful of your employer's policies. Avoid conflicts of interest and respect confidentiality.
"""
print(question)


## 2) Setup

Load the API key and initialize the OpenAI client.


In [None]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # reads .env if present

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("Missing OPENAI_API_KEY. Set it as an environment variable or in a .env file.")

from openai import OpenAI
client = OpenAI()


## 3) Chunking

Split the archive into chunks that each contain one coherent idea.

Tip: For real corpora, consider chunk sizes of ~200â€“800 tokens with overlap.


In [None]:
import re

# Simple paragraph chunking (works well for clean text)
chunks = [c.strip() for c in text.split("\n\n") if c.strip()]

# Optional: remove excessive whitespace
chunks = [re.sub(r"\s+", " ", c).strip() for c in chunks]

print("Num chunks:", len(chunks))
chunks[:3]


## 4) Embeddings (OpenAI)

We embed each chunk so that semantic search can work by similarity in vector space.


In [None]:
from typing import List

EMBED_MODEL = "text-embedding-3-small"  # lightweight + solid default

def embed_texts(texts: List[str], batch_size: int = 64) -> List[List[float]]:
    vectors = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        resp = client.embeddings.create(model=EMBED_MODEL, input=batch)
        vectors.extend([d.embedding for d in resp.data])
    return vectors

chunk_embeddings = embed_texts(chunks)
len(chunk_embeddings), len(chunk_embeddings[0])


## 5) Store vectors in a lightweight local Vector DB (Chroma)

Chroma is easy to run locally (no server required) and supports metadata + filtering.

We will store:
- `documents`: the text chunks
- `embeddings`: their vectors
- `metadatas`: optional info (source, paragraph id, etc.)


In [None]:
import chromadb
from chromadb.config import Settings

# Use persistent storage on disk (change the path if you want)
CHROMA_PATH = "./chroma_rag_db"

chroma_client = chromadb.PersistentClient(path=CHROMA_PATH)

COLLECTION_NAME = "lesson5_rag"
# Recreate collection to keep this notebook idempotent
try:
    chroma_client.delete_collection(COLLECTION_NAME)
except Exception:
    pass

collection = chroma_client.create_collection(name=COLLECTION_NAME, metadata={"hnsw:space": "cosine"})

ids = [f"chunk-{i}" for i in range(len(chunks))]
metadatas = [{"chunk_id": i} for i in range(len(chunks))]

collection.add(
    ids=ids,
    documents=chunks,
    embeddings=chunk_embeddings,
    metadatas=metadatas
)

collection.count()


## 6) Semantic search (Dense Retrieval)

We embed the query and retrieve the top-$k$ most similar chunks.


In [None]:
def retrieve(query: str, k: int = 3):
    q_emb = embed_texts([query])[0]
    res = collection.query(
        query_embeddings=[q_emb],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )
    # Flatten one-query result
    docs = res["documents"][0]
    metas = res["metadatas"][0]
    dists = res["distances"][0]
    return list(zip(docs, metas, dists))

top = retrieve(question, k=3)
for i,(doc, meta, dist) in enumerate(top, 1):
    print(f"#{i} | chunk_id={meta['chunk_id']} | distance={dist:.4f}")
    print(doc[:200] + ("..." if len(doc) > 200 else ""))
    print()


## 7) Generation step (RAG)

Now we **inject retrieved context** into the prompt and ask an LLM to answer.

Design choices you can tune:
- number of retrieved chunks ($k$)
- prompt format (instructions, citations, style)
- model and max output tokens


In [None]:
LLM_MODEL = "gpt-4.1-mini"  # choose any model available in your account

def answer_with_rag(question: str, k: int = 3, max_output_tokens: int = 200):
    retrieved = retrieve(question, k=k)
    context_blocks = []
    for rank,(doc, meta, dist) in enumerate(retrieved, 1):
        context_blocks.append(f"[{rank}] (chunk_id={meta['chunk_id']}) {doc}")
    context = "\n\n".join(context_blocks)

    prompt = f"""You are answering using ONLY the provided context.
If the answer is not contained in the context, say: "I don't know based on the provided text."

Context:
{context}

Question: {question}

Answer (concise):
"""

    # Preferred: Responses API
    resp = client.responses.create(
        model=LLM_MODEL,
        input=prompt,
        max_output_tokens=max_output_tokens,
    )
    return resp.output_text, retrieved

final_answer, retrieved = answer_with_rag(question, k=3, max_output_tokens=200)
print(final_answer)


## 8) Optional: multiple generations for quick prompt evaluation

Generate multiple answers to see variability and test prompt robustness.


In [None]:
def answer_with_rag_n(question: str, k: int = 3, n: int = 3, max_output_tokens: int = 200):
    retrieved = retrieve(question, k=k)
    context = "\n\n".join([f"[{i+1}] {doc}" for i,(doc,meta,dist) in enumerate(retrieved)])

    prompt = f"""You are answering using ONLY the provided context.
If the answer is not contained in the context, say: "I don't know based on the provided text."

Context:
{context}

Question: {question}

Answer (concise):
"""

    answers = []
    for _ in range(n):
        r = client.responses.create(model=LLM_MODEL, input=prompt, max_output_tokens=max_output_tokens)
        answers.append(r.output_text)
    return answers

answers = answer_with_rag_n(question, k=3, n=3, max_output_tokens=200)
for i,a in enumerate(answers, 1):
    print(f"--- Generation {i} ---")
    print(a)
    print()
