# This Notebook is concerned with clustering the preprocessed Item Names (Interpretable column)

# Table of contents
- [Overview](#overview)
    - [We'll be dealing with](#well-be-dealing-with)
    - [Approach](#approach)
- [Setup and load data](#setup-and-load-data)
    - [Initialize embedder wiki.trimmed.align.vec](#initialize-embedder-wikitrimmedalignvec)

- [🚨 Find clusters](#find-clusters)
- [Manual renaming of topics](#manual-renaming-of-topics)
- [🚨 Analysis ready dataset](#analysis-ready-dataset)
- [Appendix](#appendix)
    - [Testing BERTopic](#testing-bertopic)

# Overview:

## We'll be dealing with:

- The dataset has been cleaned in the two previous notebooks (EDA -> NLP)
    - Numerical values handled in EDA.ipynb, and Item Name with most of its underlying issues was handled in NLP.ipynb
- Now we'll focus on clustering and categorizing the embedded tokens and entries in general.

## Approach:

1. Find clusters in embedded tokens
    - Find each row's membership to the clusters
    - Assign rows to clusters based on token voting
2. Manually rename clusters and row level cluster membership to interpretable categories
    - Could call an API here or analyze myself if clear enough
3. Spend analysis-ready

# Setup and load data

In [1]:
# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

import re
from collections import Counter
from collections import defaultdict

from gensim.models import KeyedVectors
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from umap import UMAP
import umap
import hdbscan

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
data_path = "../data/checkpoints/fully_preprocessed_item_names.xlsx"
df_original = pd.read_excel(data_path)
df = df_original.copy()

## Initialize embedder wiki.trimmed.align.vec

In [11]:
# Load the trimmed aligned vectors (300-D, word2vec text format)
kv = KeyedVectors.load_word2vec_format("../assets/wiki.trimmed.align.vec", binary=False)
dim = kv.vector_size  # Should be 300
dim

300

# Find clusters

Now that we have preprocessed Item Names (under Interpretable col) and an AR-EN aligned embedder, we can start embedding and finding the clusters within the Item Name. This was the point all along, when we find the clusters, we can derive approprate categories for analysis without making assumptions beforehand on the categories.

In [20]:
# ===== Processing: tokens → token clusters → row memberships → entry clusters ===== NAIEVE Approach - Frequency Matters

df = pd.read_excel("../data/checkpoints/fully_preprocessed_item_names.xlsx")

# ----- Hyperparameters -----
mixnlp_n_clusters     = 20      # Total clusters for tokens
mixnlp_random_state   = 42      # For reproducibility
mixnlp_umap_neighbors = 30      # More neighbors -> care more about global alignment of the token
mixnlp_umap_min_dist  = 0.05    # UMAP param
mixnlp_remove_top_pcs = 5       # No. of most varied principle components to remove - reduce noise in embeddings
mixnlp_softmax_temp   = 0.05    # Param for softmax stability

# ----- Entry to string tokens -----
def _tokens(s):
    return [t for t in s.split() if t] if isinstance(s, str) else []
k2i = kv.key_to_index # Dict = {word: embedding}

# Now we'll record every instance of the tokens in a new df (totaling 10824 instances from the 3150 entries = ~3 tokens/row)
rows = []
for rid, text in enumerate(df["Interpretable"].to_numpy()): # Get row id (rid) and the text -> break into tokens
    for tok in _tokens(text):
        if tok in k2i: # All tokens are found in kv; only a sanity check
            rows.append((rid, tok, kv.get_vector(tok)))     # Append every token seen as the (rid, token, and its embedding)
mixnlp_tok_df = pd.DataFrame(rows, columns=["row_id", "token", "vec"]) # Create a new df with logging every instance for every token 

# ----- Cosine geometry + optional PCA denoise -----
Xt = np.vstack(mixnlp_tok_df["vec"].to_numpy()).astype(np.float32) # Shape: 10824X300
Xn = normalize(Xt, norm="l2", axis=1)               # Normalize across row (row-wise)
Xc = Xn - Xn.mean(axis=0, keepdims=True)            # Center around mean for PCA

if mixnlp_remove_top_pcs > 0:
    pca = PCA(n_components=min(256, Xc.shape[1]), random_state=mixnlp_random_state)
    Z = pca.fit_transform(Xc)                       # Shape: 10824X256
    Z[:, :mixnlp_remove_top_pcs] = 0                # Shape: 10824X(256-mixnlp_remove_top_pcs)
    Xproc = Z @ pca.components_             # Shape: 10824X(256-mixnlp_remove_top_pcs) @ (256-mixnlp_remove_top_pcs)X300 = 10824X300
    Xproc = normalize(Xproc, axis=1)                # Restore unit norm row-wise
else:
    Xproc = Xn                                      # No modification by PCA

# ----- Token clustering (KMeans) -----
tok_kmeans = KMeans(n_clusters=mixnlp_n_clusters, random_state=mixnlp_random_state, n_init='auto')
mixnlp_tok_df["token_cluster"] = tok_kmeans.fit_predict(Xproc) # Fit the 10824 tokens by k means and record the cluster id

# ----- 2D coords for tokens (for plotting later) -----
um = umap.UMAP(n_neighbors=mixnlp_umap_neighbors, min_dist=mixnlp_umap_min_dist,
               metric="cosine", random_state=mixnlp_random_state)
X2 = um.fit_transform(Xproc)
mixnlp_tok_df["x"] = X2[:, 0]
mixnlp_tok_df["y"] = X2[:, 1]

# ----- Row-level soft/vote memberships -----
def _softmax(z, temp=mixnlp_softmax_temp): # A vectorized [0, 1] cluster vote for each token based on its cluster similarities (z)
    z = z - z.max()
    e = np.exp(z / max(temp, 1e-6))
    return e / (e.sum() + 1e-9)

C  = normalize(tok_kmeans.cluster_centers_, axis=1)  # (k, d) normalized centroids in kX300
Xp = normalize(Xproc, axis=1)                        # (n_tok, d) normalized emedded tokens in 10824X300
cos_tok_cent = Xp @ C.T                              # (n_tok, k) the closeness of each of the tokens to each of the k centroids

tok_soft = np.apply_along_axis(_softmax, 1, cos_tok_cent) # Cluster similarity -> soft vote / as opposed to token_cluster (hard vote)
soft_mat = pd.DataFrame(tok_soft, columns=[f"mixnlp_Psoft_c{c}" for c in range(mixnlp_n_clusters)]) # Store each soft vote in a col.
soft_mat["row_id"] = mixnlp_tok_df["row_id"].values # Add a col to indicate the row the token belongs to
soft_rows = soft_mat.groupby("row_id").mean().reindex(range(len(df)), fill_value=0.0) # Group all the soft votes by the 3150 entries

vote_counts = ( # Hard vote counts for each of the original 3150 for each cluster shape: 3150Xk 
    mixnlp_tok_df.groupby(["row_id","token_cluster"]).size()
    .unstack(fill_value=0).reindex(range(len(df)), fill_value=0) # How many tokens were assigned to cluster k per row?
)
vote_rows = vote_counts.div(vote_counts.sum(axis=1).replace(0,1), axis=0) # Value at (row_i, k_i) / total votes at (row_i, k)
vote_rows.columns = [f"mixnlp_Pvote_c{c}" for c in vote_rows.columns] # Again, votes labeled entry (row) per each cluster (col)

# ----- Write memberships into df (overwrite to avoid duplicates) -----
soft_cols = [f"mixnlp_Psoft_c{c}" for c in range(mixnlp_n_clusters)]
vote_cols = [f"mixnlp_Pvote_c{c}" for c in range(mixnlp_n_clusters)]
df.drop(columns=[c for c in df.columns if c in soft_cols + vote_cols], errors="ignore", inplace=True)

for c in soft_cols:
    df[c] = soft_rows[c].values # Lenient towards rows that don't clear the threshold w.r.t. cols
for c in vote_cols:
    if c in vote_rows.columns:
        df[c] = vote_rows[c].values # Argmax cutoff - won't use but tested; soft rows delivered better entry clusters
    else:
        df[c] = 0.0  # Ensure shape is 3150Xk

# ----- Entry-level clustering on row features (prefer soft) -----
feat_cols = soft_cols if set(soft_cols).issubset(df.columns) else vote_cols
X_entry = df[feat_cols].to_numpy()

entry_k = max(5, min(40, mixnlp_n_clusters // 2)) # Reduce number of clusters for entry (generalize the clusters)
entry_kmeans = KMeans(n_clusters=entry_k, random_state=mixnlp_random_state, n_init='auto')
df["entry_cluster"] = entry_kmeans.fit_predict(X_entry) # Finally, cluster the entries based on the 

# Keep row text handy for plotting hovers
df["row_text"] = df["Interpretable"].astype(str)

# Map entry_cluster back to tokens for hover
row_to_entry = df["entry_cluster"].to_dict()
mixnlp_tok_df["entry_cluster"] = mixnlp_tok_df["row_id"].map(row_to_entry)

In [None]:
# ===== Plotting: token scatter + token table + row table =====


# ---- Token scatter (color = token_cluster), show row + categories ----
tok_plot_df = mixnlp_tok_df.merge(
    df[["row_text", "entry_cluster"]].reset_index(drop=True).rename_axis("row_id").reset_index(),
    on=["row_id", "entry_cluster"],
    how="left"
)

fig_tok = px.scatter(
    tok_plot_df, x="x", y="y",
    color=tok_plot_df["token_cluster"].astype(str),
    hover_data={
        "token": True,
        "token_cluster": True,
        "row_id": True,
        "entry_cluster": True,
        "row_text": True,
        "x": False, "y": False
    },
    title="Tokens: clusters (color) with source row and entry-cluster"
)
fig_tok.update_traces(marker=dict(size=7, opacity=0.85))
fig_tok.update_layout(legend_title_text="Token Cluster")
fig_tok.show()

# ---- Table 1: token-cluster representatives (freq + TF-IDF) ----
tokens_by_cluster = mixnlp_tok_df.groupby("token_cluster")["token"].apply(list)
num_clusters = tokens_by_cluster.shape[0]

# frequency + unique
rep_rows = []
for c in range(num_clusters):
    toks = pd.Series(tokens_by_cluster.get(c, []))
    vc = toks.value_counts()
    rep_rows.append({
        "Cluster": c,
        "Count": int(len(toks)),
        "Top Frequent": ", ".join(vc.index[:8].tolist()),
        "Unique/Representative": ", ".join(vc[vc == 1].index[:6].tolist())
    })
rep_freq = pd.DataFrame(rep_rows).sort_values("Cluster").reset_index(drop=True)

# TF-IDF across clusters
dfreq = Counter()
for toks in tokens_by_cluster:
    for t in set(toks):
        dfreq[t] += 1
Ndocs = len(tokens_by_cluster)

def _tfidf_top(tokens, k=8):
    tf = Counter(tokens); total = sum(tf.values()) or 1
    scored = [(t, (f/total) * (np.log((Ndocs+1)/(1+dfreq[t])) + 1.0)) for t, f in tf.items()]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [t for t, _ in scored[:k]]

rep_tfidf = pd.DataFrame(
    [{"Cluster": c, "TFIDF Representatives": ", ".join(_tfidf_top(tokens_by_cluster.get(c, []), 8))}
     for c in range(num_clusters)]
).sort_values("Cluster").reset_index(drop=True)

token_table = rep_freq.merge(rep_tfidf, on="Cluster")
token_table["Auto Label (TF-IDF top3)"] = token_table["TFIDF Representatives"].apply(
    lambda s: " / ".join(s.split(", ")[:3]) if isinstance(s, str) else ""
)
token_table = token_table[
    ["Cluster","Auto Label (TF-IDF top3)","Count","Top Frequent","Unique/Representative","TFIDF Representatives"]
]

fig_token_tbl = go.Figure(data=[go.Table(
    header=dict(values=list(token_table.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[token_table[c] for c in token_table.columns], fill_color='white', align='left')
)])
fig_token_tbl.update_layout(title="Token-Cluster Representatives")
fig_token_tbl.show()

# ---- Table 2: row (entry-cluster) summary ----
soft_cols = [c for c in df.columns if c.startswith("mixnlp_Psoft_c")]
entry_means = df.groupby("entry_cluster")[soft_cols].mean()

def _topk_token_clusters(row, k=5):
    vals = row.values
    idx = np.argsort(vals)[::-1][:k]
    labs = [f"c{int(row.index[j].split('mixnlp_Psoft_c')[1])}" for j in idx]
    return ", ".join(labs)

# tokens per entry-cluster for TF-IDF
tokens_by_entry = (
    mixnlp_tok_df
    .groupby("entry_cluster")["token"]
    .apply(list)
    .reindex(sorted(df["entry_cluster"].unique()), fill_value=[])
)
N_docs_ec = len(tokens_by_entry)
dfreq_ec = Counter()
for toks in tokens_by_entry:
    for t in set(toks):
        dfreq_ec[t] += 1

def _tfidf_top_tokens(tokens, k=10):
    tf = Counter(tokens); total = sum(tf.values()) or 1
    scored = [(t, (f/total) * (np.log((N_docs_ec+1)/(1+dfreq_ec[t])) + 1.0)) for t, f in tf.items()]
    scored.sort(key=lambda x: x[1], reverse=True)
    return ", ".join([t for t, _ in scored[:k]])

entry_sizes = df["entry_cluster"].value_counts().sort_index()
examples = {
    c: " | ".join(df.loc[df["entry_cluster"] == c, "row_text"].head(3).tolist())
    for c in entry_sizes.index
}

row_table = pd.DataFrame({
    "Entry Cluster": entry_sizes.index,
    "Count": entry_sizes.values,
    "Top Token-Clusters (mean soft)": entry_means.apply(_topk_token_clusters, axis=1).reindex(entry_sizes.index).fillna(""),
    "Top Tokens (TF-IDF)": tokens_by_entry.apply(_tfidf_top_tokens).reindex(entry_sizes.index).fillna(""),
    "Example Rows": pd.Series(examples).reindex(entry_sizes.index).fillna("")
}).reset_index(drop=True)

fig_row_tbl = go.Figure(data=[go.Table(
    header=dict(values=list(row_table.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[row_table[c] for c in row_table.columns], fill_color='white', align='left')
)])
fig_row_tbl.update_layout(title="Row (Entry-Cluster) Summary")
fig_row_tbl.show()

In [21]:
# Approach 2: Downweighting
# =========================

# ---- Imports (safe re-imports) ----
import numpy as np, pandas as pd
from collections import Counter
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
import umap
import plotly.express as px
import plotly.graph_objects as go

# ---- Config / Hyperparameters ----
mixnlp_n_clusters     = 20      # number of token clusters
mixnlp_random_state   = 42
mixnlp_umap_neighbors = 30
mixnlp_umap_min_dist  = 0.05
mixnlp_softmax_temp   = 0.05

# Optional: pruning toggles (keep False for pure downweighting)
PRUNE_COMMON  = False  # drop tokens appearing in > pr_common_frac of rows
PRUNE_RARE    = False  # drop tokens with df < pr_rare_min
pr_common_frac = 0.60
pr_rare_min    = 2

# ---- Tokenization to token-instance dataframe ----
def _tokens(s):
    return [t for t in s.split() if t] if isinstance(s, str) else []

# `kv` = embedding lookup (e.g., gensim KeyedVectors); `df["Interpretable"]` exists
# Build (row_id, token, vec) for each token instance
rows = []
for rid, text in enumerate(df["Interpretable"].to_numpy()):
    for tok in _tokens(text):
        if tok in kv:  # membership works with gensim KeyedVectors; for dict, same
            rows.append((rid, tok, kv.get_vector(tok)))
mixnlp_tok_df = pd.DataFrame(rows, columns=["row_id", "token", "vec"])

# ---- (Optional) pruning by document frequency ----
if PRUNE_COMMON or PRUNE_RARE:
    df_per_tok = mixnlp_tok_df[["token","row_id"]].drop_duplicates().groupby("token")["row_id"].size()
    keep = pd.Series(True, index=df_per_tok.index)
    if PRUNE_COMMON:
        keep &= (df_per_tok <= pr_common_frac * len(df))
    if PRUNE_RARE:
        keep &= (df_per_tok >= pr_rare_min)
    keep = keep[keep].index
    mixnlp_tok_df = mixnlp_tok_df[mixnlp_tok_df["token"].isin(keep)].reset_index(drop=True)

# ---- Cosine geometry (no PCA) ----
Xt = np.vstack(mixnlp_tok_df["vec"].to_numpy()).astype(np.float32)  # (n_tok, d)
Xproc = normalize(Xt, norm="l2", axis=1)                             # unit-norm tokens

# ---- Token clustering (KMeans on instances) ----
tok_kmeans = KMeans(n_clusters=mixnlp_n_clusters, random_state=mixnlp_random_state, n_init='auto')
mixnlp_tok_df["token_cluster"] = tok_kmeans.fit_predict(Xproc)

# ---- 2D coords for tokens (UMAP, cosine) ----
um = umap.UMAP(
    n_neighbors=mixnlp_umap_neighbors,
    min_dist=mixnlp_umap_min_dist,
    metric="cosine",
    random_state=mixnlp_random_state
)
X2 = um.fit_transform(Xproc)
mixnlp_tok_df["x"] = X2[:, 0]
mixnlp_tok_df["y"] = X2[:, 1]

# ---- Row-level memberships with DOWNWEIGHTING ----
# Vectorized cosine similarities token->centroids
C  = normalize(tok_kmeans.cluster_centers_, axis=1)   # (k, d)
Xp = normalize(Xproc, axis=1)                         # (n_tok, d) (redundant normalize but safe)
cos_tok_cent = Xp @ C.T                               # (n_tok, k)

# Vectorized softmax with temperature
temp = max(mixnlp_softmax_temp, 1e-6)
S = cos_tok_cent - cos_tok_cent.max(axis=1, keepdims=True)
E = np.exp(S / temp)
tok_soft = E / (E.sum(axis=1, keepdims=True) + 1e-9)  # (n_tok, k)

# Per-instance weights: IDF by default (rows as documents)
df_per_tok = (
    mixnlp_tok_df[["token","row_id"]]
    .drop_duplicates()
    .groupby("token")["row_id"].size()
)
idf_map = np.log((len(df) + 1) / (df_per_tok + 1)) + 1.0
w_tok = mixnlp_tok_df["token"].map(idf_map).fillna(1.0).to_numpy()

# Weighted SOFT memberships (row-wise weighted mean of token soft distributions)
soft_cols = [f"mixnlp_Psoft_c{c}" for c in range(mixnlp_n_clusters)]
soft_df = pd.DataFrame(tok_soft, columns=soft_cols)
soft_df["row_id"] = mixnlp_tok_df["row_id"].values
soft_df["w"] = w_tok

num_soft = soft_df.groupby("row_id").apply(
    lambda g: (g[soft_cols].to_numpy() * g["w"].to_numpy()[:, None]).sum(axis=0)
)
den_soft = soft_df.groupby("row_id")["w"].sum().replace(0, 1.0)

soft_rows = (
    pd.DataFrame(np.vstack(num_soft.values) / den_soft.to_numpy()[:, None],
                 index=num_soft.index, columns=soft_cols)
    .reindex(range(len(df)), fill_value=0.0)
)

# Weighted HARD (vote) memberships (row-wise weighted histogram of token_cluster)
mixnlp_tok_df["_w"] = w_tok
vote_weights = mixnlp_tok_df.pivot_table(
    index="row_id", columns="token_cluster", values="_w", aggfunc="sum", fill_value=0.0
).reindex(range(len(df)), fill_value=0.0)
vote_rows = vote_weights.div(vote_weights.sum(axis=1).replace(0,1), axis=0)
vote_rows.columns = [f"mixnlp_Pvote_c{c}" for c in vote_rows.columns]

# ---- Write memberships into df (prefer soft) ----
vote_cols = [f"mixnlp_Pvote_c{c}" for c in range(mixnlp_n_clusters)]
df.drop(columns=[c for c in df.columns if c in soft_cols + vote_cols], errors="ignore", inplace=True)

for c in soft_cols:
    df[c] = soft_rows[c].values
for c in vote_cols:
    if c in vote_rows.columns:
        df[c] = vote_rows[c].values
    else:
        df[c] = 0.0  # ensure rectangular shape

# ---- Entry-level clustering on row features (prefer soft) ----
feat_cols = soft_cols if set(soft_cols).issubset(df.columns) else vote_cols
X_entry = df[feat_cols].to_numpy()
entry_k = max(5, min(40, mixnlp_n_clusters // 2))
entry_kmeans = KMeans(n_clusters=entry_k, random_state=mixnlp_random_state, n_init='auto')
df["entry_cluster"] = entry_kmeans.fit_predict(X_entry)

# Keep row text for hover
df["row_text"] = df["Interpretable"].astype(str)

# Map entry_cluster back to tokens for hover
row_to_entry = df["entry_cluster"].to_dict()
mixnlp_tok_df["entry_cluster"] = mixnlp_tok_df["row_id"].map(row_to_entry)

# =======================
# Visualization section
# =======================

# ---- Token scatter (color = token_cluster), with row hovers ----
tok_plot_df = mixnlp_tok_df.merge(
    df[["row_text", "entry_cluster"]].reset_index(drop=True).rename_axis("row_id").reset_index(),
    on=["row_id", "entry_cluster"],
    how="left"
)
fig_tok = px.scatter(
    tok_plot_df, x="x", y="y",
    color=tok_plot_df["token_cluster"].astype(str),
    hover_data={
        "token": True,
        "token_cluster": True,
        "row_id": True,
        "entry_cluster": True,
        "row_text": True,
        "x": False, "y": False
    },
    title="Tokens: clusters (color) with source row and entry-cluster (IDF-weighted row memberships)"
)
fig_tok.update_traces(marker=dict(size=7, opacity=0.85))
fig_tok.update_layout(legend_title_text="Token Cluster")
fig_tok.show()

# ---- Table 1: token-cluster representatives (freq + TF-IDF across clusters) ----
tokens_by_cluster = mixnlp_tok_df.groupby("token_cluster")["token"].apply(list)
num_clusters = tokens_by_cluster.shape[0]

# frequency + unique
rep_rows = []
for c in range(num_clusters):
    toks = pd.Series(tokens_by_cluster.get(c, []))
    vc = toks.value_counts()
    rep_rows.append({
        "Cluster": c,
        "Count": int(len(toks)),
        "Top Frequent": ", ".join(vc.index[:8].tolist()),
        "Unique/Representative": ", ".join(vc[vc == 1].index[:6].tolist())
    })
rep_freq = pd.DataFrame(rep_rows).sort_values("Cluster").reset_index(drop=True)

# TF-IDF across clusters (treat each cluster as a "doc")
dfreq = Counter()
for toks in tokens_by_cluster:
    for t in set(toks):
        dfreq[t] += 1
Ndocs = len(tokens_by_cluster)

def _tfidf_top(tokens, k=8):
    tf = Counter(tokens); total = sum(tf.values()) or 1
    scored = [(t, (f/total) * (np.log((Ndocs+1)/(1+dfreq[t])) + 1.0)) for t, f in tf.items()]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [t for t, _ in scored[:k]]

rep_tfidf = pd.DataFrame(
    [{"Cluster": c, "TFIDF Representatives": ", ".join(_tfidf_top(tokens_by_cluster.get(c, []), 8))}
     for c in range(num_clusters)]
).sort_values("Cluster").reset_index(drop=True)

token_table = rep_freq.merge(rep_tfidf, on="Cluster")
token_table["Auto Label (TF-IDF top3)"] = token_table["TFIDF Representatives"].apply(
    lambda s: " / ".join(s.split(", ")[:3]) if isinstance(s, str) else ""
)
token_table = token_table[
    ["Cluster","Auto Label (TF-IDF top3)","Count","Top Frequent","Unique/Representative","TFIDF Representatives"]
]

fig_token_tbl = go.Figure(data=[go.Table(
    header=dict(values=list(token_table.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[token_table[c] for c in token_table.columns], fill_color='white', align='left')
)])
fig_token_tbl.update_layout(title="Token-Cluster Representatives (Instance KMeans; Row memberships are IDF-weighted)")
fig_token_tbl.show()

# ---- Table 2: row (entry-cluster) summary ----
soft_cols = [c for c in df.columns if c.startswith("mixnlp_Psoft_c")]
entry_means = df.groupby("entry_cluster")[soft_cols].mean()

def _topk_token_clusters(row, k=5):
    vals = row.values
    idx = np.argsort(vals)[::-1][:k]
    labs = [f"c{int(row.index[j].split('mixnlp_Psoft_c')[1])}" for j in idx]
    return ", ".join(labs)

# tokens per entry-cluster for TF-IDF
tokens_by_entry = (
    mixnlp_tok_df
    .groupby("entry_cluster")["token"]
    .apply(list)
    .reindex(sorted(df["entry_cluster"].unique()), fill_value=[])
)
N_docs_ec = len(tokens_by_entry)
dfreq_ec = Counter()
for toks in tokens_by_entry:
    for t in set(toks):
        dfreq_ec[t] += 1

def _tfidf_top_tokens(tokens, k=10):
    tf = Counter(tokens); total = sum(tf.values()) or 1
    scored = [(t, (f/total) * (np.log((N_docs_ec+1)/(1+dfreq_ec[t])) + 1.0)) for t, f in tf.items()]
    scored.sort(key=lambda x: x[1], reverse=True)
    return ", ".join([t for t, _ in scored[:k]])

entry_sizes = df["entry_cluster"].value_counts().sort_index()
examples = {
    c: " | ".join(df.loc[df["entry_cluster"] == c, "row_text"].head(3).tolist())
    for c in entry_sizes.index
}

row_table = pd.DataFrame({
    "Entry Cluster": entry_sizes.index,
    "Count": entry_sizes.values,
    "Top Token-Clusters (mean soft)": entry_means.apply(_topk_token_clusters, axis=1).reindex(entry_sizes.index).fillna(""),
    "Top Tokens (TF-IDF)": tokens_by_entry.apply(_tfidf_top_tokens).reindex(entry_sizes.index).fillna(""),
    "Example Rows": pd.Series(examples).reindex(entry_sizes.index).fillna("")
}).reset_index(drop=True)

fig_row_tbl = go.Figure(data=[go.Table(
    header=dict(values=list(row_table.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[row_table[c] for c in row_table.columns], fill_color='white', align='left')
)])
fig_row_tbl.update_layout(title="Row (Entry-Cluster) Summary (Row memberships are IDF-weighted)")
fig_row_tbl.show()


# Manual renaming of topics

# Analysis ready dataset

# Appendix

## Testing BERTopic

In [12]:
en_stop = set(__import__('sklearn').feature_extraction.text.ENGLISH_STOP_WORDS)
with open('../assets/AR_stop_words.txt', encoding='utf-8') as f:
    AR_STOP = set(line.strip() for line in f if line.strip())
stop_words = list(en_stop | AR_STOP)

# 1) Load preprocessed texts
texts = df["Combined"].astype(str).tolist() 

# 2) Multilingual embeddings (CPU okay)
embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# 3) Vectorizer tuned for short text
vectorizer_model = CountVectorizer(
    ngram_range=(1,3), min_df=2, stop_words=stop_words
)

# 4) UMAP + HDBSCAN
umap_model = UMAP(n_neighbors=15, n_components=5, metric="cosine", random_state=42)
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=25, metric='euclidean',
                                cluster_selection_method='eom', prediction_data=True)

# 5) Build BERTopic
topic_model = BERTopic(
    embedding_model=embedder,
    vectorizer_model=vectorizer_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    calculate_probabilities=True,
    verbose=False
)

topics, probs = topic_model.fit_transform(texts)

# Inspect
topic_info = topic_model.get_topic_info()     # topic sizes + labels
top_words_0 = topic_model.get_topic(0)        # [(word, weight), ...]
topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,837,-1_توريد_bend_model_cut,"[توريد, bend, model, cut, weicco, wall, سلك, c...","[cut bend rebar, cut bend rebar, cut bend rebar]"
1,0,241,0____,"[, , , , , , , , , ]","[unknown, unknown, unknown]"
2,1,162,1_cu_kv_cable_pvc,"[cu, kv, cable, pvc, cu xlpe, xlpe, earth, len...","[riyadh cable cu xlpe pvc kv black, riyadh cab..."
3,2,114,2_كيبل رياض_رياض_ac_iec,"[كيبل رياض, رياض, ac, iec, fp, كيبل, سرايا, dc...","[كيبل رياض مسلح sta xlpe pvc, كيبل رياض مسلح s..."
4,3,84,3_سابك حديد تسليح_حديد تسليح سابك_تسليح سابك_س...,"[سابك حديد تسليح, حديد تسليح سابك, تسليح سابك,...","[حديد تسليح سابك, حديد تسليح سابك, حديد تسليح ..."
5,4,81,4_ppr_pipe_tahweeltm_ppr tahweeltm,"[ppr, pipe, tahweeltm, ppr tahweeltm, tee ppr,...","[equal tee ppr tahweeltm, ppr pipe, ppr pipe]"
6,5,80,5_صاج حديد_اسود صاج_صاج_اسود,"[صاج حديد, اسود صاج, صاج, اسود, حديد مجلفن, حد...","[صاج حديد اسود, صاج حديد اسود, صاج حديد اسود]"
7,6,80,6_اتفاق حديد_اتفاق_حديد تسليح_تسليح,"[اتفاق حديد, اتفاق, حديد تسليح, تسليح, حديد, ض...","[حديد تسليح اتفاق, حديد تسليح اتفاق, حديد تسلي..."
8,7,78,7_صاج اسود_اسود صاج_صاج_اسود,"[صاج اسود, اسود صاج, صاج, اسود, املس, ويل املس...","[صاج اسود, صاج اسود, صاج اسود]"
9,8,77,8_aquaterra_elbow_ppr_aquaterra ppr,"[aquaterra, elbow, ppr, aquaterra ppr, adaptor...","[aquaterra ppr elbow, aquaterra ppr female elb..."


In [13]:
topic_model.visualize_topics()

In [14]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic._utils import MyLogger
from bertopic.backend import BaseEmbedder
from bertopic.dimensionality import BaseDimensionalityReduction

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans as SKKMeans
import numpy as np
import re, math

# --- Your corpus ---
texts = df["Interpretable"].astype(str).tolist() 

# ---- Multilingual embedding model (good for Arabic + English + short phrases) ----
# Alternatives to try: "sentence-transformers/LaBSE", "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
EMBED_MODEL_NAME = "sentence-transformers/distiluse-base-multilingual-cased-v2"
embedder = SentenceTransformer(EMBED_MODEL_NAME)

def embed(texts):
    return embedder.encode(texts, batch_size=64, show_progress_bar=True, normalize_embeddings=True)

# ---- Vectorizer tuned for Arabic/English tokens and short docs ----
# Token pattern keeps Arabic letters, English, digits, hyphen/dot (e.g., D3-PANEL, M12.5).
token_pattern = r"(?u)\b[\w\u0600-\u06FF][\w\-\.\u0600-\u06FF]{0,}\b"

vectorizer = CountVectorizer(
    token_pattern=token_pattern,
    ngram_range=(1, 2),   # bi-grams help short docs
    min_df=2,             # cut pure singletons noise
    lowercase=True
)

# ---- Embedding-based labeler (far better than c-TF-IDF for tiny docs) ----
repr_model = KeyBERTInspired()  # uses the same embedding backend internally

# (Optional) Encourage diversity in labels:
# repr_model = MaximalMarginalRelevance(diversity=0.3)


import hdbscan

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=15,          # tune up/down to change granularity
    min_samples=None,             # None => equal to min_cluster_size by default
    metric='euclidean',
    prediction_data=True,         # <— enable soft probabilities
    cluster_selection_method='eom'
)

topic_model = BERTopic(
    embedding_model=embedder,
    vectorizer_model=vectorizer,
    representation_model=repr_model,
    hdbscan_model=hdbscan_model,
    verbose=True
)

embeddings = embed(texts)
topics, probs = topic_model.fit_transform(texts, embeddings=embeddings)

# Reassign (-1) using soft probabilities from HDBSCAN (enabled via prediction_data=True)
# Strategy "probabilities" uses membership strengths; threshold ~0.25–0.35 works well on short docs.
topics = topic_model.reduce_outliers(
    documents=texts,
    topics=topics,
    probabilities=probs,
    strategy="probabilities",
    threshold=0.0,        
    embeddings=embeddings  
)

# Refresh labels after reassignment for cleaner c-TF-IDF windows 
topic_model.update_topics(texts, vectorizer_model=vectorizer)
topic_model.generate_topic_labels(nr_words=4, topic_prefix=False)
info = topic_model.get_topic_info()
info

Batches: 100%|██████████| 50/50 [00:14<00:00,  3.38it/s]
2025-09-10 19:47:42,648 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-10 19:47:54,463 - BERTopic - Dimensionality - Completed ✓
2025-09-10 19:47:54,465 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-10 19:47:54,621 - BERTopic - Cluster - Completed ✓
2025-09-10 19:47:54,624 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-10 19:47:56,994 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,552,-1_tee_ppr_tray_reducer,"[tee, ppr, tray, reducer, pcs, motor, class, s...",[cable tray type tr new design mmw hx thk ral ...
1,0,244,0_unknown_قفي_اسمنت_ارضي,"[unknown, قفي, اسمنت, ارضي, , , , , , ]","[unknown, unknown, unknown]"
2,1,90,1_تسليح سابك_سابك حديد_سابك_حديد تسليح,"[تسليح سابك, سابك حديد, سابك, حديد تسليح, تسلي...","[حديد تسليح سابك, حديد تسليح سابك, حديد تسليح ..."
3,2,89,2_سيراميك_ارضي_iec_ac,"[سيراميك, ارضي, iec, ac, مقاوم, turner, غراء, ...","[ac iec, ac iec, ac iec]"
4,3,84,3_اتفاق_حديد تسليح_تسليح_حديد,"[اتفاق, حديد تسليح, تسليح, حديد, حديد اتفاق, م...","[حديد تسليح اتفاق, حديد تسليح اتفاق, حديد تسلي..."
...,...,...,...,...,...
66,65,16,65_سابك black_bar سابك_black deformed_black,"[سابك black, bar سابك, black deformed, black, ...","[black deformed straight bar سابك, black defor..."
67,66,16,66_model_weic_hanger_adjustable,"[model, weic, hanger, adjustable, strap, pipe,...","[weic sprinkle pipe hanger model wpcs, weic sp..."
68,67,16,67_ثلاثي_الخرسانيه_الواح_panel,"[ثلاثي, الخرسانيه, الواح, panel, توريد, اكسسوا...",[فلل شرف توريد اكسسوارات تقني الواح الخرسانيه ...
69,68,16,68_حديد مجدول_مربع_مجدول_md,"[حديد مجدول, مربع, مجدول, md, فارغ, مستطيل, حد...","[حديد مجدول, حديد مجدول, حديد مجدول]"


In [15]:
topic_model.visualize_topics()

In [16]:
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
import numpy as np


# ---- 1) Sentence embeddings (multilingual) ----
# LaBSE = strong AR/EN alignment; swap to paraphrase-multilingual-MiniLM-L12-v2 for speed
sbert = SentenceTransformer("sentence-transformers/LaBSE")
sbert_vecs = sbert.encode(texts, batch_size=128, show_progress_bar=True, normalize_embeddings=True)

# ---- 2) Char n-gram TF-IDF + SVD (subword-ish) ----
char_tfidf = TfidfVectorizer(analyzer="char_wb", ngram_range=(3,5), min_df=2)
X_char = char_tfidf.fit_transform(texts)

svd = TruncatedSVD(n_components=256, random_state=42)
X_char_svd = svd.fit_transform(X_char)
X_char_svd = normalize(X_char_svd)

# ---- 3) Hybrid embedding: concat (dims: 768 + 256 = 1024) ----
emb = np.hstack([sbert_vecs, X_char_svd])

# ---- 4) K-Means (no UMAP) ----
def estimate_k(n): 
    return max(12, int(round(np.sqrt(n))))
K = estimate_k(len(texts))
kmeans = KMeans(n_clusters=K, n_init=20, random_state=42)

# ---- 5) BERTopic: c-TF-IDF/MMR labels; pass our embeddings ----
word_vect = CountVectorizer(ngram_range=(1,3), min_df=2, token_pattern=r"(?u)\b\w+\b")
mmr = MaximalMarginalRelevance(diversity=0.3)

topic_model = BERTopic(
    embedding_model=None,           # we supply embeddings directly
    umap_model=None,                # skip UMAP
    hdbscan_model=kmeans,           # sklearn clusterer accepted in 0.17.x
    vectorizer_model=word_vect,     # drives c-TF-IDF vocab
    representation_model=mmr,       # diversified labels
    calculate_probabilities=False,
    verbose=True
)

topics, _ = topic_model.fit_transform(texts, embeddings=emb)

# ---- 6) (Optional) Auto-merge similar topics inside BERTopic ----
topic_model.reduce_topics(texts, nr_topics="auto")
topic_model.update_topics(texts, n_gram_range=(1,3))

info = topic_model.get_topic_info()
info


Batches: 100%|██████████| 25/25 [00:23<00:00,  1.05it/s]
2025-09-10 19:50:34,077 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-10 19:50:48,225 - BERTopic - Dimensionality - Completed ✓
2025-09-10 19:50:48,227 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-10 19:50:51,532 - BERTopic - Cluster - Completed ✓
2025-09-10 19:50:51,549 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-10 19:50:51,611 - BERTopic - Representation - Completed ✓
2025-09-10 19:50:51,706 - BERTopic - Topic reduction - Reducing number of topics
2025-09-10 19:50:51,732 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-10 19:50:51,753 - BERTopic - Representation - Completed ✓
2025-09-10 19:50:51,753 - BERTopic - Topic reduction - Reduced number of topics from 56 to 26


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1529,0_ppr_pvc_hdg_cable,"[ppr, pvc, hdg, cable, sheet, hr, tray, panel,...","[hr sheet, hr sheet سابك, hr sheet سابك]"
1,1,407,1_md_md md_md md md_,"[md, md md, md md md, , , , , , , ]","[حديد تسليح اتفاق, حديد تسليح سابك, سابك]"
2,2,244,2_unknown unknown unknown_unknown unknown_unkn...,"[unknown unknown unknown, unknown unknown, unk...","[unknown, unknown, unknown]"
3,3,169,3_unknown___,"[unknown, , , , , , , , , ]","[صاج اسود, صاج حديد اسود, ماسور اسود كهرباء un..."
4,4,136,4_xt_ekip_xt tmd_dip,"[xt, ekip, xt tmd, dip, ekip dip, tmd, tmd xt,...","[xt tma, xt tma, xt ekip dip lsi]"
5,5,130,5_deformed straight_deformed straight bar_stra...,"[deformed straight, deformed straight bar, str...","[black deformed straight bar سابك, black defor..."
6,6,83,6_nya_aw nya nya_upright_pendin upright,"[nya, aw nya nya, upright, pendin upright, pen...","[سلك سويدي aw, سلك مفرد nya, سلك مفرد nya]"
7,7,64,7_rebar_steel_saudi_ittefaq,"[rebar, steel, saudi, ittefaq, rebar saudi, re...","[saudi deformed rebar fabricated, rebar saudi ..."
8,8,63,8____,"[, , , , , , , , , ]","[ماسور حديد, ماسور حديد, ماسور حديد]"
9,9,34,9_configuration_size_color_size color,"[configuration, size, color, size color, style...",[brake chamber mx configuration assy lh size c...


In [17]:
topic_model.visualize_topics()

In [18]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

# texts = [...]  # your ~3100 very short docs

# ---- 1) Sentence embeddings only (LaBSE) ----
sbert = SentenceTransformer("sentence-transformers/LaBSE")
sbert_vecs = sbert.encode(texts, batch_size=128, show_progress_bar=True, normalize_embeddings=True)  # (N, 768)

# ---- 2) K-Means (no UMAP) ----
def estimate_k(n):
    return max(12, int(round(np.sqrt(n))))
K = estimate_k(len(texts))
kmeans = KMeans(n_clusters=K, n_init=20, random_state=42)

# ---- 3) BERTopic with KeyBERTInspired (doc & word embeddings both from LaBSE) ----
topic_model = BERTopic(
    embedding_model=sbert,          # used for both docs & words (768-d)
    umap_model=None,                # skip UMAP
    hdbscan_model=kmeans,           # sklearn clusterer accepted in 0.17.x
    vectorizer_model=None,          # labeling via embeddings, not c-TF-IDF
    representation_model=KeyBERTInspired(),
    calculate_probabilities=False,
    verbose=True
)

topics, _ = topic_model.fit_transform(texts, embeddings=sbert_vecs)

# ---- 4) (Optional) Auto-merge similar topics ----
# topic_model.reduce_topics(texts, nr_topics="auto")

info = topic_model.get_topic_info()
info

Batches: 100%|██████████| 25/25 [00:26<00:00,  1.07s/it]
2025-09-10 19:51:24,658 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-10 19:51:38,083 - BERTopic - Dimensionality - Completed ✓
2025-09-10 19:51:38,085 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-10 19:51:38,897 - BERTopic - Cluster - Completed ✓
2025-09-10 19:51:38,901 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-10 19:51:45,843 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,260,0_forged_hollow_fixed_مفرد,"[forged, hollow, fixed, مفرد, british, رقم, ch...","[bolt nut washer, bolt nut washer, nut bolt wa..."
1,1,244,1_unknown_نجران_قفي_فيز,"[unknown, نجران, قفي, فيز, , , , , , ]","[unknown, unknown, unknown]"
2,2,182,2_معنون_متنوع_fresh_recessed,"[معنون, متنوع, fresh, recessed, tinted, approv...","[wall light ip led lamp, square floor drain ga..."
3,3,164,3_unequal_new_معزوله_single,"[unequal, new, معزوله, single, qha, غامق, hung...",[توريد مواد تقني الواح الخرسانيه معزوله ثلاثي ...
4,4,141,4_نمساوي_جاف_محبس_dn,"[نمساوي, جاف, محبس, dn, white, dag, normal, st...","[bend pvc aplac, bend pvc aplac, dn grp pipe pn]"
5,5,138,5_slitted_heb_plain_ogn,"[slitted, heb, plain, ogn, nbtc, مدهون, ipe, r...","[ms plain sheet astm, ms plain sheet astm, ms ..."
6,6,108,6_colored_change_inland_eye,"[colored, change, inland, eye, water, ext, مشت...","[ماسور حراري تحويل, ماسور حراري تحويل, ماسور ح..."
7,7,102,7_different_مختلف_عجم_فارغ,"[different, مختلف, عجم, فارغ, نجران, shadeed, ...","[aluminium profile bars pcs, aluminium profile..."
8,8,96,8_wireless_moe_heb_af,"[wireless, moe, heb, af, inn, uni, managed, in...","[af hz dc contactor, af hz dc contactor, af hz..."
9,9,87,9_مصمت_مسحوب_bare_small,"[مصمت, مسحوب, bare, small, wet, safe, dim, است...","[خرسان جاهز, خرسان جاهز, خرسان جاهز مقاوم src]"


In [19]:
topic_model.visualize_topics()

In [None]:
info.to_excel("temp.xlsx")