# DPI681 — Vector DB & RAG Activities (Colab Notebook)

This notebook consolidates three activities into a single, reproducible workflow:
1. **Bulk Image Analysis** — multimodal prompts over a set of image URLs, results saved to CSV.
2. **RAG (One-Shot)** — retrieve context from a FAISS index and answer a single legal question.
3. **RAG (Chat)** — iterative chat with retrieval for Massachusetts real-estate law.

**Key decisions**  
- Uses **OpenAI Responses API** (not Chat Completions).  
- All tabular work is done with **Polars** (no pandas).  
- API key is stored and read from Colab **Secrets** (`google.colab.userdata`) — not `os.environ`.

## Setup — API Key in Colab Secrets

We use Colab's `google.colab.userdata` to persist your `OPENAI_API_KEY`.  
If it's missing, you'll be prompted (input is hidden) and your key will be saved to Secrets for future sessions.

In [None]:
from google.colab import userdata
from getpass import getpass

key = userdata.get('OPENAI_API_KEY')
if not key:
    print("No OPENAI_API_KEY found in Colab Secrets.")
    _k = getpass("Enter your OPENAI_API_KEY (input hidden): ")
    if not _k or not _k.strip():
        raise ValueError("No key provided. Rerun this cell and enter a valid key.")
    userdata.set('OPENAI_API_KEY', _k.strip())
    print("Saved OPENAI_API_KEY to Colab Secrets.")
else:
    print("OPENAI_API_KEY is set in Colab Secrets.")

## Setup — Install Dependencies

Installs runtime libraries. We pin FAISS to CPU flavor for portability. Also installs `gdown` to fetch the shared folder by ID if needed.

In [None]:
!pip -q install polars==1.* faiss-cpu openai==1.* tqdm pillow gdown

## Setup — Mount Drive or Auto-Download Shared Folder

The class folder is shared at:  
`https://drive.google.com/drive/folders/1XTbuR5vcz7FsikBRPh1SXwFjbUY4mx46`

- We **first** try to use the mounted Drive path:  
  `/content/drive/MyDrive/Teaching/DPI681 Materials/RAG Materials`  
- If it’s not found (e.g., students didn’t add a shortcut to Drive), we **auto-download** the folder contents using its ID into `/content/class_materials`.

In [None]:
import os
from google.colab import drive
drive.mount('/content/drive')

# Preferred path if students added the shared folder (or a shortcut) to their MyDrive:
CLASS_FOLDER = "/content/drive/MyDrive/Teaching/DPI681 Materials/RAG Materials"

# Fallback: download shared folder by ID (read-only) into local runtime.
FOLDER_ID = "1XTbuR5vcz7FsikBRPh1SXwFjbUY4mx46"
LOCAL_FALLBACK = "/content/class_materials"

def ensure_class_folder():
    global CLASS_FOLDER
    if os.path.exists(CLASS_FOLDER):
        print("Using mounted class folder:", CLASS_FOLDER)
        return CLASS_FOLDER
    print("Mounted path not found. Downloading shared folder by ID...")
    os.makedirs(LOCAL_FALLBACK, exist_ok=True)
    # gdown will create a subdir named by the folder; normalize to the first/only dir if needed
    !gdown --folder "$FOLDER_ID" -O "$LOCAL_FALLBACK" -q
    # Try to detect the actual directory with our expected files.
    candidates = []
    for root, dirs, files in os.walk(LOCAL_FALLBACK):
        if "images.csv" in files or ("faiss_index.bin" in files and "metadata.json" in files):
            candidates.append(root)
    if candidates:
        candidates.sort(key=len)
        CLASS_FOLDER = candidates[0]
        print("Using downloaded class folder:", CLASS_FOLDER)
        return CLASS_FOLDER
    else:
        raise FileNotFoundError("Downloaded folder does not contain expected files. Please add the shared folder to your Drive or adjust paths.")

CLASS_FOLDER = ensure_class_folder()

# Student output location
STUDENT_WORKDIR = "/content/drive/MyDrive/dpi681_student_work"
os.makedirs(STUDENT_WORKDIR, exist_ok=True)
print("STUDENT_WORKDIR:", STUDENT_WORKDIR)

In [None]:
# Common imports
import os, json
from typing import List, Dict, Any
import polars as pl
import numpy as np
import faiss
from tqdm import tqdm
from openai import OpenAI

from google.colab import userdata
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("OPENAI_API_KEY missing from Colab Secrets. Rerun the secrets cell.")
client = OpenAI(api_key=OPENAI_API_KEY)

---
## Activity 1 — Bulk Image Analysis (Multimodal)

**Goal:** send a text+image prompt to the OpenAI Responses API for each image in `images.csv` and save results.

**Iterate on the prompt** by editing `PROMPT_TEXT` and re-running the processing cell.

In [None]:
IMAGES_CSV = os.path.join(CLASS_FOLDER, "images.csv")       # must include columns: image_id, url
RESULTS_CSV = os.path.join(STUDENT_WORKDIR, "images_analysis_results.csv")
MODEL_VISION = "gpt-4o-mini"

# Your prompt: edit freely and rerun
PROMPT_TEXT = "Return one word capturing the sentiment of the image."

### Load the dataset (Polars) and validate columns

In [None]:
if not os.path.exists(IMAGES_CSV):
    raise FileNotFoundError(f"images.csv not found at {IMAGES_CSV}. Check the shared folder or fallback download.")

df_images = pl.read_csv(IMAGES_CSV)
required_cols = {"image_id", "url"}
missing = required_cols - set(df_images.columns)
if missing:
    raise ValueError(f"images.csv must have columns {required_cols}, missing: {missing}")

df_images.head(3)

### Process images with the Responses API
Sends a single user message containing text **and** an image URL per row. Saves a Polars DataFrame to CSV in your `STUDENT_WORKDIR`.

In [None]:
results: List[Dict[str, Any]] = []

for row in tqdm(df_images.iter_rows(named=True), total=df_images.height, desc="Processing Images"):
    image_id = row["image_id"]
    image_url = row["url"]

    try:
        resp = client.responses.create(
            model=MODEL_VISION,
            input=[{
                "role": "user",
                "content": [
                    {"type": "input_text", "text": PROMPT_TEXT},
                    {"type": "input_image", "image_url": str(image_url)},
                ],
            }],
        )
        output_text = (resp.output_text or "").strip()
    except Exception as e:
        output_text = f"[ERROR] {e}"

    results.append({"image_id": image_id, "url": image_url, "output_text": output_text})

pl.DataFrame(results).write_csv(RESULTS_CSV)
print(f"Image analysis complete. Results saved to: {RESULTS_CSV}")

---
## Activity 2 — Legal Assistant with RAG (One-Shot)

**Pipeline**
1. Load FAISS index and `metadata.json` (prebuilt).
2. Embed the query (`text-embedding-3-small`).
3. Retrieve top-K documents and format brief citations.
4. Call the **Responses API** with a system prompt + retrieved context + user question.  
**Model replies in plain text (no markdown)** and cites as `Chapter [X] Section [Y]` with a link.

In [None]:
FAISS_INDEX_FILE = os.path.join(CLASS_FOLDER, "faiss_index.bin")
METADATA_FILE = os.path.join(CLASS_FOLDER, "metadata.json")
EMBEDDING_MODEL = "text-embedding-3-small"
GEN_MODEL = "gpt-4o"
TOP_K = 3

if not os.path.exists(FAISS_INDEX_FILE) or not os.path.exists(METADATA_FILE):
    raise FileNotFoundError("Missing FAISS index or metadata.json in CLASS_FOLDER.")

faiss_index = faiss.read_index(FAISS_INDEX_FILE)
with open(METADATA_FILE, "r", encoding="utf-8") as f:
    metadata = json.load(f)

def get_embedding(text: str) -> np.ndarray:
    txt = text if isinstance(text, str) else str(text)
    txt = txt[:8150]
    r = client.embeddings.create(model=EMBEDDING_MODEL, input=txt)
    emb = r.data[0].embedding
    return np.asarray(emb, dtype=np.float32)

def retrieve_context(query: str, top_k: int = TOP_K) -> str:
    q = get_embedding(query)
    q = np.expand_dims(q, axis=0)
    D, I = faiss_index.search(q, top_k)
    lines = []
    for idx in I[0]:
        if 0 <= idx < len(metadata):
            doc = metadata[idx]
            citation = f"(Chapter {doc.get('chapter','?')} Section {doc.get('section','?')}, {doc.get('link','No link')})"
            snippet = (doc.get('full_text','').replace('\n',' ')[:200]).strip()
            lines.append(f"{citation}: {snippet}")
    return "Retrieved context:\n" + "\n".join(lines) + "\n" if lines else ""

BASE_SYSTEM_PROMPT = (
    "You are a legal assistant helping non-lawyers understand Massachusetts real-estate law. "
    "You are not a lawyer and cannot provide legal advice. "
    "Point users to the relevant section of the law and explain how it applies in plain English. "
    "Do not return your replies in markdown, only plain text. "
    "Cite sources as 'Chapter [Chapter] Section [Section]' with a link at the end."
)

def ask_once(question: str) -> str:
    ctx = retrieve_context(question)
    sys_prompt = BASE_SYSTEM_PROMPT + "\n" + ctx
    resp = client.responses.create(
        model=GEN_MODEL,
        input=[
            {"role":"system","content":sys_prompt},
            {"role":"user","content":question}
        ],
    )
    return (resp.output_text or "").strip()

### Try it (one-shot)

In [None]:
q = "What are the notice requirements for eviction in Massachusetts?"
print(ask_once(q))

---
## Activity 3 — Legal Assistant with RAG (Chat)

Maintains a small conversation history and augments each turn with retrieved context.  
Streaming omitted for simplicity; we return final text.

In [None]:
from typing import List, Dict

conversation_history: List[Dict[str,str]] = []

CHAT_SYSTEM_PROMPT = (
    "You are a legal assistant tasked with helping non-lawyers understand their questions about Massachusetts real-estate law. "
    "You do not help with any other requests. "
    "You are not a lawyer and cannot provide legal advice. "
    "Use retrieved context when available. "
    "Reply in plain text (no markdown). "
    "Cite as 'Chapter [Chapter] Section [Section]' with the link at the end."
)

def chat_once(user_text: str, max_context_msgs: int = 10) -> str:
    ctx = retrieve_context(user_text)
    sys_prompt = CHAT_SYSTEM_PROMPT + "\n" + ctx
    msgs = [{"role":"system","content":sys_prompt}] + conversation_history[-max_context_msgs:] + [
        {"role":"user","content":user_text}
    ]
    r = client.responses.create(model=GEN_MODEL, input=msgs)
    out = (r.output_text or "").strip()
    conversation_history.append({"role":"user","content":user_text})
    conversation_history.append({"role":"assistant","content":out})
    return out

def chat():
    print("RAG Chat — type 'exit' to quit.\n")
    while True:
        try:
            u = input("> ").strip()
        except EOFError:
            break
        if u.lower() in ("exit","quit"):
            print("Goodbye.")
            break
        reply = chat_once(u)
        print("\n" + reply + "\n")

# Uncomment to start an interactive chat session:
# chat()