
# Semantic Search Engine (OpenAI Embeddings + FAISS)

This notebook implements a **semantic search engine** that encodes documents and queries with **OpenAI embeddings** and performs fast nearest-neighbor retrieval with **FAISS**.

**What you'll get:**
- Dataset loader for 200+ documents (a ready-made sample corpus is included, or plug in your own).
- Batched embedding of documents with OpenAI (robust to timeouts; averages chunk embeddings per doc).
- FAISS index creation (cosine-similarity equivalent via inner product on L2-normalized vectors).
- A clean search interface that returns the most semantically relevant documents.
- Five diverse sample queries with ranked outputs.
- Clear comments and explanations of the full flow from input → embeddings → index → search results.

> **Tip:** If you do not yet have an OpenAI API key or FAISS installed, the notebook can fall back to a pure-NumPy demo embedding/index so you can still run the end-to-end flow. For your submission, run the cells in the **OpenAI + FAISS** configuration.



## Prerequisites

**Python packages** (install in a terminal or first notebook cell):
```bash
pip install -U openai faiss-cpu numpy pandas tqdm pyarrow tiktoken
# If you use conda/mamba, on some platforms:
# mamba install -c conda-forge -c pytorch faiss-cpu
```

**OpenAI API key**
- Set an environment variable before launching the notebook:
  - macOS/Linux: `export OPENAI_API_KEY="sk-..."`
  - Windows (PowerShell): `$Env:OPENAI_API_KEY="sk-..."`
- Or, you can paste it into the notebook when prompted (not recommended for shared environments).

> **Costs:** Embedding 200–500 short documents with `text-embedding-3-small` is typically inexpensive. Always review current pricing.


In [None]:

# OPTIONAL: Install dependencies in this environment (uncomment if needed).
# In some hosted notebooks (like Colab), you may need this.
# %pip -q install -U openai faiss-cpu numpy pandas tqdm pyarrow tiktoken


In [None]:

import os, sys, re, json, math, time, textwrap, glob
from pathlib import Path
from typing import List, Dict, Tuple
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# Try OpenAI SDK (new style, >= 1.0)
try:
    from openai import OpenAI
    _openai_available = True
except Exception:
    _openai_available = False
    OpenAI = None

# Try FAISS
try:
    import faiss  # type: ignore
    _faiss_available = True
except Exception:
    _faiss_available = False
    faiss = None  # type: ignore

# ---------------- Configuration ----------------
DATA_DIR = Path("./sample_corpus")           # Folder of .txt files (each file = one document)
EMBED_CACHE = Path("./embeddings.npy")       # Optional cache of document embeddings
META_CACHE = Path("./doc_meta.parquet")      # Optional cache of document metadata
INDEX_PATH = Path("./faiss.index")           # Optional saved FAISS index

# Embedding model (OpenAI). "text-embedding-3-small" (1536 dims) is a solid default.
EMBEDDING_MODEL = "text-embedding-3-small"

# Fallback embedding dimension for local demo embedding (hashing trick)
FALLBACK_DIM = 1024

# Toggles (the notebook will auto-detect what is possible):
USE_OPENAI = True        # will flip to False if no key or SDK unavailable
USE_FAISS  = True        # will flip to False if FAISS not installed

# Detect API key
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", None)
if not _openai_available or not OPENAI_API_KEY:
    USE_OPENAI = False

if not _faiss_available:
    USE_FAISS = False

print(f"OpenAI available: {_openai_available}, key found: {bool(OPENAI_API_KEY)} -> USE_OPENAI={USE_OPENAI}")
print(f"FAISS available: {_faiss_available} -> USE_FAISS={USE_FAISS}")

# Make directories relative to notebook location
DATA_DIR = Path(DATA_DIR)


In [None]:

def l2_normalize(mat: np.ndarray, eps: float = 1e-12) -> np.ndarray:
    """Row-wise L2 normalization."""
    norms = np.linalg.norm(mat, axis=1, keepdims=True) + eps
    return mat / norms

def batched(iterable, n: int):
    """Yield successive n-sized batches from an iterable."""
    batch = []
    for item in iterable:
        batch.append(item)
        if len(batch) == n:
            yield batch
            batch = []
    if batch:
        yield batch

def read_text_file(path: Path) -> str:
    with open(path, "r", encoding="utf-8", errors="ignore") as f:
        return f.read()

def preview(text: str, n: int = 240) -> str:
    t = re.sub(r"\s+", " ", text.strip())
    return (t[:n] + "…") if len(t) > n else t



## 1) Data loading & preprocessing

This section loads a folder of `.txt` files (each treated as an independent document).  
The included **`sample_corpus/`** has 240+ documents across many topics.

Feel free to replace `DATA_DIR` with your own directory, or adapt the loader to read from CSV/JSON.


In [None]:

# Ensure the sample corpus exists (if you downloaded the zip next to this notebook, unzip it here).
# If you cloned this notebook with the provided zip, run the next cell to confirm the file count.
DATA_DIR = Path("./sample_corpus")
if not DATA_DIR.exists():
    # If running this notebook in the same folder as the zip provided,
    # you can unzip it like so (uncomment):
    # import zipfile
    # with zipfile.ZipFile("sample_corpus.zip", "r") as zf:
    #     zf.extractall(".")
    pass

all_files = sorted([Path(p) for p in glob.glob(str(DATA_DIR / "**" / "*.txt"), recursive=True)])
print(f"Found {len(all_files)} text files in {DATA_DIR!s}")
assert len(all_files) >= 200, "Please provide at least 200 documents."

# Read the documents into a DataFrame with columns: id, path, title, text, n_chars, n_words, category
rows = []
for doc_id, p in enumerate(all_files):
    txt = read_text_file(p)
    # title = first line or Title: ... fallback
    m = re.search(r"^Title:\s*(.*?)$", txt, flags=re.IGNORECASE | re.MULTILINE)
    title = m.group(1).strip() if m else (txt.splitlines()[0].strip() if txt.strip() else p.stem)
    m2 = re.search(r"^Category:\s*(.*?)$", txt, flags=re.IGNORECASE | re.MULTILINE)
    category = m2.group(1).strip() if m2 else (p.parent.name if p.parent.name else "unknown")
    n_chars = len(txt)
    n_words = len(re.findall(r"\w+", txt))
    rows.append({
        "doc_id": doc_id,
        "path": str(p),
        "title": title,
        "text": txt,
        "n_chars": n_chars,
        "n_words": n_words,
        "category": category
    })

docs = pd.DataFrame(rows)
docs.head(3)



## 2) Embedding generation

We convert each document into a dense vector so we can search by **meaning** rather than keywords.

**OpenAI path (recommended):**
- We chunk long documents (by characters) to keep embeddings efficient.
- Embed each chunk with `text-embedding-3-small` (or switch to `text-embedding-3-large`).
- Aggregate a **single vector per document** by averaging its chunk embeddings.
- Cache the resulting matrix and metadata to speed up re-runs.

**Fallback (demo mode, no API):**
- A lightweight **hashing trick** turns tokens into a fixed-size vector (1024 dims by default).
- This is not as accurate as OpenAI embeddings but keeps the end-to-end flow runnable anywhere.


In [None]:

def chunk_text_by_chars(text: str, max_chars: int = 2000, overlap: int = 200) -> List[str]:
    """Split text into overlapping chunks by character count."""
    text = text.strip()
    if len(text) <= max_chars:
        return [text]
    chunks = []
    start = 0
    while start < len(text):
        end = min(len(text), start + max_chars)
        chunk = text[start:end]
        chunks.append(chunk)
        if end == len(text):
            break
        start = end - overlap
        if start < 0:
            start = 0
    return chunks

# ---------- OpenAI embedding (if enabled) ----------
_openai_client = None
if USE_OPENAI:
    _openai_client = OpenAI(api_key=OPENAI_API_KEY)

def embed_openai(texts: List[str], model: str = EMBEDDING_MODEL, batch_size: int = 96, max_retries: int = 5) -> np.ndarray:
    """Embed a list of strings with OpenAI, returning a float32 matrix of shape [N, D]."""
    if not USE_OPENAI or _openai_client is None:
        raise RuntimeError("OpenAI embedding requested but client not available.")
    out = []
    for batch in tqdm(list(batched(texts, batch_size)), desc="Embedding (OpenAI)"):
        for attempt in range(max_retries):
            try:
                resp = _openai_client.embeddings.create(model=model, input=batch)
                # The Embeddings API returns a list in resp.data mirroring input order
                vecs = [np.array(d.embedding, dtype=np.float32) for d in resp.data]
                out.extend(vecs)
                break
            except Exception as e:
                wait = min(60, 2 ** attempt)
                print(f"OpenAI error: {e}. Retrying in {wait}s (attempt {attempt+1}/{max_retries})...")
                time.sleep(wait)
        else:
            raise RuntimeError("OpenAI embedding failed after retries.")
    return np.vstack(out)

def embed_documents_openai(docs_df: pd.DataFrame) -> Tuple[np.ndarray, int]:
    """Compute one embedding per document by averaging chunk embeddings."""
    all_chunks = []
    doc_ptrs = []
    for _, row in docs_df.iterrows():
        chunks = chunk_text_by_chars(row["text"], max_chars=2000, overlap=200)
        doc_ptrs.append((len(all_chunks), len(all_chunks) + len(chunks)))
        all_chunks.extend(chunks)
    chunk_matrix = embed_openai(all_chunks, model=EMBEDDING_MODEL)
    # Average chunks per doc
    doc_vecs = []
    for start, end in doc_ptrs:
        vec = chunk_matrix[start:end].mean(axis=0)
        doc_vecs.append(vec.astype(np.float32))
    mat = np.vstack(doc_vecs)
    return mat, mat.shape[1]

# ---------- Fallback: hashing-trick embedding (no API needed) ----------
def _tokenize(text: str) -> List[str]:
    return re.findall(r"[a-zA-Z0-9_]+", text.lower())

def embed_hashing(texts: List[str], dim: int = FALLBACK_DIM, salt: int = 13) -> np.ndarray:
    """A simple, deterministic feature-hashing embedding for demo mode."""
    mat = np.zeros((len(texts), dim), dtype=np.float32)
    for i, t in enumerate(texts):
        tokens = _tokenize(t)
        for tok in tokens:
            h = int(hashlib.md5((tok + str(salt)).encode("utf-8")).hexdigest(), 16)
            idx = h % dim
            mat[i, idx] += 1.0
        # Log-scaling + L2 normalization improves behavior a bit
        mat[i, :] = np.log1p(mat[i, :])
    return mat

def embed_documents_fallback(docs_df: pd.DataFrame, dim: int = FALLBACK_DIM) -> Tuple[np.ndarray, int]:
    texts = docs_df["text"].tolist()
    mat = embed_hashing(texts, dim=dim)
    return mat, dim



## 3) Build (or load) document embeddings

This will compute one vector per document and cache it to speed up future runs.


In [None]:

if EMBED_CACHE.exists() and META_CACHE.exists():
    print(f"Loading cached embeddings from {EMBED_CACHE} and metadata from {META_CACHE}")
    doc_embeddings = np.load(EMBED_CACHE)
    docs = pd.read_parquet(META_CACHE)
    vector_dim = doc_embeddings.shape[1]
else:
    if USE_OPENAI:
        print("Using OpenAI embeddings...")
        doc_embeddings, vector_dim = embed_documents_openai(docs)
    else:
        print("Using fallback hashing embeddings (demo mode)...")
        doc_embeddings, vector_dim = embed_documents_fallback(docs, dim=FALLBACK_DIM)

    # Normalize for cosine similarity via inner product in FAISS
    doc_embeddings = l2_normalize(doc_embeddings.astype(np.float32))

    # Cache
    np.save(EMBED_CACHE, doc_embeddings)
    docs.to_parquet(META_CACHE, index=False)

doc_embeddings.shape, vector_dim



## 4) FAISS index creation

We build an **inner-product (IP)** index over **L2-normalized** vectors. This makes IP scores equal to cosine similarity.


In [None]:

class NumpyFlatIndex:
    """Minimal in-memory index as a fallback when FAISS is unavailable."""
    def __init__(self, xb: np.ndarray):
        self.xb = xb.astype(np.float32)
    def search(self, q: np.ndarray, k: int) -> Tuple[np.ndarray, np.ndarray]:
        # q: [B, D], xb: [N, D] -> scores [B, N] via dot product (cosine if vectors are normalized)
        scores = q @ self.xb.T
        # Top-k
        idx = np.argpartition(-scores, kth=np.minimum(k, scores.shape[1]-1), axis=1)[:, :k]
        # sort within the top-k for each query
        sorted_idx = np.take_along_axis(idx, np.argsort(-np.take_along_axis(scores, idx, axis=1)), axis=1)
        sorted_scores = np.take_along_axis(scores, sorted_idx, axis=1)
        return sorted_scores, sorted_idx

# Build the index
if USE_FAISS:
    index = faiss.IndexFlatIP(vector_dim)
    index.add(doc_embeddings)  # doc_ids correspond to row indices
    print("FAISS index built:", type(index), "| size:", index.ntotal)
    # Optionally persist
    try:
        faiss.write_index(index, str(INDEX_PATH))
        print(f"Saved FAISS index to {INDEX_PATH}")
    except Exception as e:
        print("Could not save FAISS index:", e)
else:
    index = NumpyFlatIndex(doc_embeddings)
    print("Using NumpyFlatIndex fallback (FAISS not available). Size:", doc_embeddings.shape[0])



## 5) Query interface

`search(query, k)`:
- Embed the query (OpenAI or fallback)
- L2-normalize the query vector
- Lookup top-`k` nearest neighbors in FAISS
- Return a tidy pandas DataFrame with `rank`, `score`, `title`, `category`, `path`, and a preview snippet


In [None]:

def embed_query(text: str) -> np.ndarray:
    if USE_OPENAI:
        q = embed_openai([text], model=EMBEDDING_MODEL)
    else:
        q = embed_hashing([text], dim=doc_embeddings.shape[1])
    q = l2_normalize(q.astype(np.float32))
    return q

def search(query: str, k: int = 5) -> pd.DataFrame:
    qv = embed_query(query)
    D, I = index.search(qv, k)
    # single-query case
    scores = D[0].tolist()
    ids = I[0].tolist()
    rows = []
    for rank, (score, idx_) in enumerate(zip(scores, ids), 1):
        row = docs.iloc[idx_]
        rows.append({
            "rank": rank,
            "score": float(score),
            "doc_id": int(row.doc_id),
            "title": row.title,
            "category": row.category,
            "path": row.path,
            "preview": preview(row.text, 280)
        })
    return pd.DataFrame(rows)

# Quick smoke test
search("How do I build a FAISS index for semantic search?", k=5)



## 6) Demonstration — Five diverse queries

Run the cell below to see the ranked outputs.


In [None]:

example_queries = [
    "How do I vectorize text for semantic search?",
    "Beginner tips for visiting Kyoto in spring",
    "Common early symptoms of type 2 diabetes",
    "ETF vs mutual fund: what are the key differences?",
    "Troubleshooting slow SQL queries and index usage"
]

results = {}
for q in example_queries:
    print("\n=== Query:", q)
    df = search(q, k=5)
    display(df)
    results[q] = df

# Optionally: save a CSV of all demo results
pd.concat([df.assign(query=q) for q, df in results.items()]).to_csv("demo_results.csv", index=False)
print("Saved demo results to demo_results.csv")



## 7) Interactive search (optional)

Run this cell and type queries. Press Enter on an empty line to stop.


In [None]:

try:
    while True:
        q = input("\nEnter a query (blank to stop): ").strip()
        if not q:
            break
        df = search(q, k=5)
        display(df)
except EOFError:
    # Some environments don't support interactive input
    pass



## Appendix — Save / Load

This shows how to reload the saved artifacts later without recomputing everything.


In [None]:

def load_cached() -> Tuple[pd.DataFrame, np.ndarray, object]:
    docs_df = pd.read_parquet(META_CACHE)
    embs = np.load(EMBED_CACHE)
    if _faiss_available and INDEX_PATH.exists():
        idx = faiss.read_index(str(INDEX_PATH))
    else:
        idx = NumpyFlatIndex(embs)
    return docs_df, embs, idx

# Example usage:
# docs2, embs2, idx2 = load_cached()
# display(docs2.head(2))
# print(embs2.shape, type(idx2))



---

### Notes
- **Cosine vs inner product**: If vectors are **L2-normalized**, maximizing inner product is equivalent to maximizing cosine similarity.
- **Chunking**: Long documents are split into overlapping chunks and averaged so each doc yields a single vector.
- **Quality**: For best relevance, use `text-embedding-3-small` or `text-embedding-3-large`. The fallback hashing embedding is for offline demos only.
- **Scaling up**: For millions of docs, consider FAISS IVF/IVF+PQ or HNSW indexes and persist them to disk.
