## Unsupervised Thematic Analysis with K-Means 
__(on interview utterances)__
1. Load transcripts (expects heritageRoots-style JSON).
2. Vectorize text with TF-IDF (unigrams + bigrams).
3. Choose k via silhouette score.
4. Fit KMeans and interpret clusters by top terms.
5. Inspect sample quotes; 
6. (optional) visualize with PCA.

In [12]:
import json
from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt


### Step 1: Load participant utterances

In [13]:
DATA = Path("data/heritageroots_ux_transcripts.json")  # adjust path if needed
with open(DATA, "r", encoding="utf-8") as f:
    raw = json.load(f)

rows = []
for p in raw["participants"]:
    pid = p["id"]
    for t in p["transcript"]:
        if t["speaker"] == "Participant":
            rows.append({"participant_id": pid, "time": t["time"], "text": t["text"]})

df = pd.DataBox = pd.DataFrame(rows)
print(f"Loaded {len(df)} participant utterances.")

Loaded 120 participant utterances.


In [14]:
df.head()

Unnamed: 0,participant_id,time,text
0,P01,[00:00:48],The lighting and spatial sound immediately mak...
1,P01,[00:01:20],"Teleportation is smooth, but I’d like a quick ..."
2,P01,[00:01:54],The icons are intuitive but slightly small. I ...
3,P01,[00:02:16],Picking up artifacts feels satisfying. However...
4,P01,[00:03:08],"No lag, though the ambient sound loop is a lit..."


In [15]:
# Drop empty or duplicate lines for cleanliness
df["text"] = df["text"].fillna("").str.strip()
df = df[df["text"] != ""].drop_duplicates(subset=["participant_id", "time", "text"]).reset_index(drop=True)

In [16]:
df.head()


Unnamed: 0,participant_id,time,text
0,P01,[00:00:48],The lighting and spatial sound immediately mak...
1,P01,[00:01:20],"Teleportation is smooth, but I’d like a quick ..."
2,P01,[00:01:54],The icons are intuitive but slightly small. I ...
3,P01,[00:02:16],Picking up artifacts feels satisfying. However...
4,P01,[00:03:08],"No lag, though the ambient sound loop is a lit..."


### Step 2: TF-IDF vectorization (uni + bigrams)
__Tune min_df/max_df to control noise.__


## TF–IDF (Term Frequency–Inverse Document Frequency)

### Purpose:
*   TF–IDF is a way to measure how important a word is in a document, compared to how often it appears across all documents in a dataset.
*	It’s widely used in text mining, NLP, and machine learning to turn words into numerical features.

### Term Frequency (TF)
*	TF measures how often a word appears in a single document.
*	Formula: _TF(word,document) = Total words in the document / Number of times word appears_
*	Example: 
	*	In “VR is immersive and VR is fun,”
	* TF(VR) = 2/6 = 0.33

### Inverse Document Frequency (IDF)
*	IDF measures how unique or rare a word is across all documents.
*	Common words (like “the” or “and”) get lower weight
*	Formula: IDF(word)=log(Number of documents containing the word / Total number of documents)
*	Rare words get high IDF; frequent ones get low IDF.

### Combine Them: TF × IDF
*	Multiply the two values to get the TF–IDF score.
*	Words that are frequent in one document but rare across the dataset get the highest scores.
*	Example: If “teleport” appears often only in VR usability interviews, it becomes a strong thematic indicator.

### Why It’s Useful
*	Reduces noise from common filler words.
*	Highlights topic-specific terms that distinguish documents.
*	Used in:
	*	Keyword extraction
	*	Thematic clustering (like k-means)
	*	Search and information retrieval
	*	Text classification and sentiment analysis
*	Quick Analogy
	*	Imagine each document is a conversation.
	*	TF–IDF finds the words people say a lot in one conversation that others rarely mention — those are your unique themes.

In [17]:
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1, 2),
    min_df=2,          # ignore very rare terms; tweak as needed
    max_df=0.85        # drop very common terms
)
X = vectorizer.fit_transform(df["text"])
terms = np.array(vectorizer.get_feature_names_out())
print(f"TF-IDF matrix shape: {X.shape}")

TF-IDF matrix shape: (120, 130)


In [18]:
### Pick k with silhouette score (coarse elbow-style scan)

In [19]:
def choose_k(X, k_values=(2,3,4,5,6,7,8)):
    scores = {}
    for k in k_values:
        km = KMeans(n_clusters=k, random_state=42, n_init="auto")
        labels = km.fit_predict(X)
        score = silhouette_score(X, labels, metric="cosine")
        scores[k] = score
        print(f"k={k:<2}  silhouette={score:.4f}")
    best_k = max(scores, key=scores.get)
    print(f"\nChosen k = {best_k} (max silhouette)")
    return best_k, scores

best_k, scores = choose_k(X, k_values=range(2,9))

k=2   silhouette=0.3412
k=3   silhouette=0.5044
k=4   silhouette=0.6683
k=5   silhouette=0.8386
k=6   silhouette=1.0000
k=7   silhouette=1.0000
k=8   silhouette=1.0000

Chosen k = 6 (max silhouette)


  return fit_method(estimator, *args, **kwargs)
  return fit_method(estimator, *args, **kwargs)


### Fit KMeans with chosen k

In [22]:
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init="auto")
labels = kmeans.fit_predict(X)
df["cluster"] = labels

### Inspect clusters: top terms + sample utterances

In [23]:
def top_terms_per_cluster(kmeans, terms, topn=12):
    centers = kmeans.cluster_centers_
    for k in range(centers.shape[0]):
        idx = np.argsort(centers[k])[::-1][:topn]
        print(f"\n=== Cluster {k} : top {topn} terms ===")
        print(", ".join(terms[idx]))

top_terms_per_cluster(kmeans, terms, topn=12)


=== Cluster 0 : top 12 terms ===
flat, staying, prefer menus, icons, icons intuitive, menus curve, menus, slightly, slightly small, small prefer, curve view, curve

=== Cluster 1 : top 12 terms ===
wasn sure, immersive, noticed floating, make feel, make, wasn, lighting spatial, lighting, immersive noticed, immediately make, panels wasn, immediately

=== Cluster 2 : top 12 terms ===
preview, smooth like, fade, fade helps, quick preview, quick, preview ll, land fade, help, help directionally, helps, helps maybe

=== Cluster 3 : top 12 terms ===
small vibration, artifacts feels, feedback grabbed, feedback, satisfying expected, satisfying, feels satisfying, expected haptic, expected, maybe small, feels, artifacts

=== Cluster 4 : top 12 terms ===
intuitive overall, overlay launch, launch, interaction, interaction cues, cues maybe, cues, overlay, overall just, just, just tweak, pretty intuitive

=== Cluster 5 : top 12 terms ===
little short, sound loop, repetition, lag ambient, ambient, am

In [24]:
# Show a few sample quotes per cluster
def show_samples(df, cluster_id, n=5):
    examples = df[df["cluster"] == cluster_id].sample(min(n, (df["cluster"] == cluster_id).sum()), random_state=42)
    print(f"\n--- Sample quotes from Cluster {cluster_id} ---")
    for _, r in examples.iterrows():
        print(f"[{r['participant_id']} {r['time']}] {r['text']}")

for k in range(best_k):
    show_samples(df, k, n=3)


--- Sample quotes from Cluster 0 ---
[P01 [00:01:54]] The icons are intuitive but slightly small. I prefer when menus curve around my view instead of staying flat in front of me.
[P18 [00:01:56]] The icons are intuitive but slightly small. I prefer when menus curve around my view instead of staying flat in front of me.
[P16 [00:01:38]] The icons are intuitive but slightly small. I prefer when menus curve around my view instead of staying flat in front of me.

--- Sample quotes from Cluster 1 ---
[P01 [00:00:48]] The lighting and spatial sound immediately make it feel immersive. I noticed the floating panels, but wasn’t sure which gesture activates them.
[P18 [00:00:43]] The lighting and spatial sound immediately make it feel immersive. I noticed the floating panels, but wasn’t sure which gesture activates them.
[P16 [00:00:46]] The lighting and spatial sound immediately make it feel immersive. I noticed the floating panels, but wasn’t sure which gesture activates them.

--- Sample quo