In [None]:
import sys, subprocess
if "google.colab" in sys.modules:
    subprocess.run(["pip", "install", "-q", "pandas", "numpy", "scikit-learn", "requests", "pydantic", "jsonschema"])


# Build a TF-IDF Retrieval Index

**What:** Fit a TF-IDF vectorizer on synthetic abstracts and save the index for later querying.

**Why:** Retrieval-augmented workflows rely on a consistent index; this practice run is safe and repeatable.

**How:** Run the install cell in Colab if needed, confirm data generation, then execute cells. TF-IDF turns text into numbers; cosine similarity (used later) measures how close two vectors are.

**You will learn:** How to build and persist a text index suitable for downstream querying and evaluation.

By the end of this notebook, you will have completed the listed steps and produced the outputs described in the success criteria.

### Success criteria
- You fit a TF-IDF vectorizer on abstracts.
- You saved an index file (vector_index.pkl).
- You know the document and feature counts.

In [None]:
from pathlib import Path


def find_data_dir() -> Path:
    candidates = [Path.cwd() / "data", Path.cwd().parent / "data", Path.cwd().parent.parent / "data"]
    for candidate in candidates:
        if (candidate / "sample_texts" / "articles_sample.csv").exists():
            return candidate
    raise FileNotFoundError("data directory not found. Run scripts/generate_synthetic_data.py.")

DATA_DIR = find_data_dir()


In [None]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

articles = pd.read_csv(DATA_DIR / "sample_texts" / "articles_sample.csv")
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(articles["abstract"].fillna(""))

index_payload = {
    "vectorizer": vectorizer,
    "tfidf_matrix": tfidf_matrix,
    "article_ids": articles["article_id"].tolist(),
}
index_path = DATA_DIR / "vector_index.pkl"
with open(index_path, "wb") as handle:
    pickle.dump(index_payload, handle)
print(f"Index saved to {index_path} with shape {tfidf_matrix.shape}")


### If you get stuck / What to try next

If you get stuck: confirm data generation, rerun installs, and check that TF-IDF parameters match available RAM. What to try next: query the index in pipelines/rag/rag_query.ipynb and compare queries in pipelines/rag/rag_evaluation.ipynb.