In [None]:
import sys, subprocess
if "google.colab" in sys.modules:
    subprocess.run(["pip", "install", "-q", "pandas", "numpy", "scikit-learn", "requests", "pydantic", "jsonschema"])


# Cluster and Explore Topics

**What:** Turn synthetic abstracts into TF-IDF vectors and group them with k-means.

**Why:** Clustering reveals themes and helps you triage documents before deeper modeling or manual review.

**How:** Run the Colab install cell if needed, confirm data generation, then execute cells. TF-IDF is a simple way of turning text into numbers for comparison. K-means is a straightforward method that groups similar items based on distance.

**You will learn:** How to vectorize text, cluster documents, and read cluster summaries in either Colab or a local Jupyter session.

By the end of this notebook, you will have completed the listed steps and produced the outputs described in the success criteria.

### Success criteria
- You built a TF-IDF matrix and k-means clusters.
- You inspected cluster counts.
- You have cluster labels for each abstract.

In [None]:
from pathlib import Path


def find_data_dir() -> Path:
    candidates = [Path.cwd() / "data", Path.cwd().parent / "data", Path.cwd().parent.parent / "data"]
    for candidate in candidates:
        if (candidate / "sample_texts" / "articles_sample.csv").exists():
            return candidate
    raise FileNotFoundError("data directory not found. Run scripts/generate_synthetic_data.py.")

DATA_DIR = find_data_dir()


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

articles = pd.read_csv(DATA_DIR / "sample_texts" / "articles_sample.csv")
vectorizer = TfidfVectorizer(stop_words="english")
tfidf_matrix = vectorizer.fit_transform(articles["abstract"].fillna(""))

model = KMeans(n_clusters=4, random_state=42, n_init=10)
articles["cluster"] = model.fit_predict(tfidf_matrix)
articles[["title", "cluster"]].head()


## Inspect cluster composition

In [None]:
cluster_counts = articles.groupby("cluster").size().reset_index(name="count")
cluster_counts


### If you get stuck / What to try next

If you get stuck: confirm data generation and rerun the first cell to install dependencies. What to try next: evaluate summaries in pipelines/text/batch_summarization.ipynb or explore retrieval in pipelines/rag/build_index.ipynb.