# This Notebook is concerned with clustering the preprocessed Item Names (Interpretable column)

# Table of contents
- [Overview](#overview)
    - [We'll be dealing with](#well-be-dealing-with)
    - [Approach](#approach)
- [Setup and load data](#setup-and-load-data)
    - [Initialize embedder wiki.trimmed.align.vec](#initialize-embedder-wikitrimmedalignvec)

- [🚨 Find clusters](#find-clusters)
    - [Kmeans and UMAP](#kmeans-and-umap)
    - [Save entry and token level clusters](#save-entry-and-token-level-clusters)
    - [Grid search on clustering hyperparameters](#grid-search-on-clustering-hyperparameters)
        - [Save models and cluster previews](#save-models-and-cluster-previews)
- [Interpreting clusters into categories](#interpreting-clusters-into-categories)
- [🚨 Analysis ready dataset](#analysis-ready-dataset)
- [Appendix](#appendix)
    - [Testing BERTopic](#testing-bertopic)

# Overview:

## We'll be dealing with:

- The dataset has been cleaned in the two previous notebooks (EDA -> NLP)
    - Numerical values handled in EDA.ipynb, and Item Name with most of its underlying issues was handled in NLP.ipynb
- Now we'll focus on clustering and categorizing the embedded tokens and entries in general.

## Approach:

1. Find clusters in embedded tokens
    - Find each row's membership to the clusters
    - Assign rows to clusters based on token voting
2. Manually rename clusters and row level cluster membership to interpretable categories
    - Could call an API here or analyze myself if clear enough
3. Spend analysis-ready

# Setup and load data

In [106]:
# Import relevant libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import re

from pathlib import Path
from collections import Counter
import joblib
import os

import gensim
from gensim.models import KeyedVectors

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

from umap import UMAP
import umap
import hdbscan

In [76]:
data_path = "../data/checkpoints/fully_preprocessed_item_names.xlsx"
df_original = pd.read_excel(data_path)
df = df_original.copy()

num_path = "../data/checkpoints/cleaned_num.xlsx"
df_num = pd.read_excel(num_path)

## Initialize embedder wiki.trimmed.align.vec

In [25]:
# Load the trimmed aligned vectors (300-D, word2vec text format)
kv = KeyedVectors.load_word2vec_format("../assets/wiki.trimmed.align.vec", binary=False)
dim = kv.vector_size  # Should be 300
dim

300

# Find clusters

Now that we have preprocessed Item Names (under Interpretable col) and an AR-EN aligned embedder, we can start embedding and finding the clusters within the Item Name. This was the point all along, when we find the clusters, we can derive approprate categories for analysis without making assumptions beforehand on the categories.

## Kmeans and UMAP

Our approach in short is to log all instances of tokens seen in the corpus (including repeats) -> embed them -> cluster them -> derive row membership to each cluster by soft voting -> cluster the entries. We'll use interactive visualizations and UMAP on 2D space to further understand the distribution and our approach.

In [None]:
#  Processing: tokens → token clusters → row memberships → entry clusters
#  NAIEVE Approach - Frequency imbalance not mitigated


# ----- Hyperparameters -----
mixnlp_n_clusters     = 30      # Total clusters for tokens 100, 50
mixnlp_random_state   = 42      # For reproducibility
mixnlp_umap_neighbors = 50      # More neighbors -> care more about global neighborhood
mixnlp_umap_min_dist  = 0.05    # UMAP param
mixnlp_remove_top_pcs = 0       # No. of most varied principle components to remove - reduce noise in embeddings; better at 0
mixnlp_softmax_temp   = 0.05    # Softmax temperature (tao) 

# ----- Entry to string tokens -----
def _tokens(s):
    return [t for t in s.split() if t] if isinstance(s, str) else []
k2i = kv.key_to_index # Dict = {word: embedding}

# Now we'll record every instance of the tokens in a new df (totaling 10824 instances from the 3150 entries = ~3 tokens/row)
rows = []
for rid, text in enumerate(df["Interpretable"].to_numpy()): # Get row id (rid) and the text -> break into tokens
    for tok in _tokens(text):
        if tok in k2i: # All tokens are found in kv; only a sanity check
            rows.append((rid, tok, kv.get_vector(tok)))     # Append every token seen as the (rid, token, and its embedding)
mixnlp_tok_df = pd.DataFrame(rows, columns=["row_id", "token", "vec"]) # Create a new df with logging every instance for every token 

# ----- Cosine geometry + optional PCA denoise -----
Xt = np.vstack(mixnlp_tok_df["vec"].to_numpy()).astype(np.float32) # Shape: 10824X300
Xn = normalize(Xt, norm="l2", axis=1)               # Normalize across row (row-wise)
Xc = Xn - Xn.mean(axis=0, keepdims=True)            # Center around mean for PCA

if mixnlp_remove_top_pcs > 0:
    pca = PCA(n_components=min(256, Xc.shape[1]), random_state=mixnlp_random_state)
    Z = pca.fit_transform(Xc)                       # Shape: 10824X256
    Z[:, :mixnlp_remove_top_pcs] = 0                # Shape: 10824X(256-mixnlp_remove_top_pcs)
    Xproc = Z @ pca.components_             # Shape: 10824X(256-mixnlp_remove_top_pcs) @ (256-mixnlp_remove_top_pcs)X300 = 10824X300
    Xproc = normalize(Xproc, axis=1)                # Restore unit norm row-wise
else:
    Xproc = Xn                                      # No modification by PCA

# ----- Token clustering (KMeans) -----
tok_kmeans = KMeans(n_clusters=mixnlp_n_clusters, random_state=mixnlp_random_state, n_init='auto')
mixnlp_tok_df["token_cluster"] = tok_kmeans.fit_predict(Xproc) # Fit the 10824 tokens by k means and record the cluster id

# ----- 2D coords for tokens (for plotting later) -----
um = umap.UMAP(n_neighbors=mixnlp_umap_neighbors, min_dist=mixnlp_umap_min_dist,
               metric="cosine", random_state=mixnlp_random_state)
X2 = um.fit_transform(Xproc)
mixnlp_tok_df["x"] = X2[:, 0]
mixnlp_tok_df["y"] = X2[:, 1]

# ----- Row-level soft/vote memberships -----
def _softmax(z, temp=mixnlp_softmax_temp): # A vectorized [0, 1] cluster vote for each token based on its cluster similarities (z)
    z = z - z.max()
    e = np.exp(z / max(temp, 1e-6))
    return e / (e.sum() + 1e-9)

C  = normalize(tok_kmeans.cluster_centers_, axis=1)  # (k, d) normalized centroids in kX300
Xp = normalize(Xproc, axis=1)                        # (n_tok, d) normalized emedded tokens in 10824X300
cos_tok_cent = Xp @ C.T                              # (n_tok, k) the closeness of each of the tokens to each of the k centroids

tok_soft = np.apply_along_axis(_softmax, 1, cos_tok_cent) # Cluster similarity -> soft vote / as opposed to token_cluster (hard vote)
soft_mat = pd.DataFrame(tok_soft, columns=[f"mixnlp_Psoft_c{c}" for c in range(mixnlp_n_clusters)]) # Store each soft vote in a col.
soft_mat["row_id"] = mixnlp_tok_df["row_id"].values # Add a col to indicate the row the token belongs to
soft_rows = soft_mat.groupby("row_id").mean().reindex(range(len(df)), fill_value=0.0) # Group all the soft votes by the 3150 entries

vote_counts = ( # Hard vote counts for each of the original 3150 for each cluster shape: 3150Xk 
    mixnlp_tok_df.groupby(["row_id","token_cluster"]).size()
    .unstack(fill_value=0).reindex(range(len(df)), fill_value=0) # How many tokens were assigned to cluster k per row?
)
vote_rows = vote_counts.div(vote_counts.sum(axis=1).replace(0,1), axis=0) # Value at (row_i, k_i) / total votes at (row_i, k)
vote_rows.columns = [f"mixnlp_Pvote_c{c}" for c in vote_rows.columns] # Again, votes labeled entry (row) per each cluster (col)

# ----- Write memberships into df (overwrite to avoid duplicates) -----
soft_cols = [f"mixnlp_Psoft_c{c}" for c in range(mixnlp_n_clusters)]
vote_cols = [f"mixnlp_Pvote_c{c}" for c in range(mixnlp_n_clusters)]
df.drop(columns=[c for c in df.columns if c in soft_cols + vote_cols], errors="ignore", inplace=True)

for c in soft_cols:
    df[c] = soft_rows[c].values # Lenient towards rows that don't clear the threshold w.r.t. cols
for c in vote_cols:
    if c in vote_rows.columns:
        df[c] = vote_rows[c].values # Argmax cutoff - won't use but tested; soft rows delivered better entry clusters
    else:
        df[c] = 0.0  # Ensure shape is 3150Xk

# ----- Entry-level clustering on row features (prefer soft) -----
feat_cols = soft_cols if set(soft_cols).issubset(df.columns) else vote_cols
X_entry = df[feat_cols].to_numpy()

entry_k = max(5, min(40, mixnlp_n_clusters)) # Reduce number of clusters for entry (generalize the clusters)
entry_kmeans = KMeans(n_clusters=entry_k, random_state=mixnlp_random_state, n_init='auto')
df["entry_cluster"] = entry_kmeans.fit_predict(X_entry) # Finally, cluster the entries based on the 

# Keep row text handy for plotting hovers
df["row_text"] = df["Interpretable"].astype(str)

# Map entry_cluster back to tokens for hover
row_to_entry = df["entry_cluster"].to_dict()
mixnlp_tok_df["entry_cluster"] = mixnlp_tok_df["row_id"].map(row_to_entry)

In [59]:
# ===== Plotting: token scatter + token table + row table =====


# ---- Token scatter (color = token_cluster), show row + categories ----
tok_plot_df = mixnlp_tok_df.merge(
    df[["row_text", "entry_cluster"]].reset_index(drop=True).rename_axis("row_id").reset_index(),
    on=["row_id", "entry_cluster"],
    how="left"
)

fig_tok = px.scatter(
    tok_plot_df, x="x", y="y",
    color=tok_plot_df["token_cluster"].astype(str),
    hover_data={
        "token": True,
        "token_cluster": True,
        "row_id": True,
        "entry_cluster": True,
        "row_text": True,
        "x": False, "y": False
    },
    title="Tokens: clusters (color) with source row and entry-cluster"
)
fig_tok.update_traces(marker=dict(size=7, opacity=0.85))
fig_tok.update_layout(legend_title_text="Token Cluster")
fig_tok.show()

# ---- Table 1: token-cluster representatives (freq + TF-IDF) ----
tokens_by_cluster = mixnlp_tok_df.groupby("token_cluster")["token"].apply(list)
num_clusters = tokens_by_cluster.shape[0]

# frequency + unique
rep_rows = []
for c in range(num_clusters):
    toks = pd.Series(tokens_by_cluster.get(c, []))
    vc = toks.value_counts()
    rep_rows.append({
        "Cluster": c,
        "Count": int(len(toks)),
        "Top Frequent": ", ".join(vc.index[:8].tolist()),
        "Unique/Representative": ", ".join(vc[vc == 1].index[:6].tolist())
    })
rep_freq = pd.DataFrame(rep_rows).sort_values("Cluster").reset_index(drop=True)

# TF-IDF across clusters
dfreq = Counter()
for toks in tokens_by_cluster:
    for t in set(toks):
        dfreq[t] += 1
Ndocs = len(tokens_by_cluster)

def _tfidf_top(tokens, k=8):
    tf = Counter(tokens); total = sum(tf.values()) or 1
    scored = [(t, (f/total) * (np.log((Ndocs+1)/(1+dfreq[t])) + 1.0)) for t, f in tf.items()]
    scored.sort(key=lambda x: x[1], reverse=True)
    return [t for t, _ in scored[:k]]

rep_tfidf = pd.DataFrame(
    [{"Cluster": c, "TFIDF Representatives": ", ".join(_tfidf_top(tokens_by_cluster.get(c, []), 8))}
     for c in range(num_clusters)]
).sort_values("Cluster").reset_index(drop=True)

token_table = rep_freq.merge(rep_tfidf, on="Cluster")
token_table["Auto Label (TF-IDF top3)"] = token_table["TFIDF Representatives"].apply(
    lambda s: " / ".join(s.split(", ")[:3]) if isinstance(s, str) else ""
)
token_table = token_table[
    ["Cluster","Auto Label (TF-IDF top3)","Count","Top Frequent","Unique/Representative","TFIDF Representatives"]
]

fig_token_tbl = go.Figure(data=[go.Table(
    header=dict(values=list(token_table.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[token_table[c] for c in token_table.columns], fill_color='white', align='left')
)])
fig_token_tbl.update_layout(title="Token-Cluster Representatives")
fig_token_tbl.show()

# ---- Table 2: row (entry-cluster) summary ----
soft_cols = [c for c in df.columns if c.startswith("mixnlp_Psoft_c")]
entry_means = df.groupby("entry_cluster")[soft_cols].mean()

def _topk_token_clusters(row, k=5):
    vals = row.values
    idx = np.argsort(vals)[::-1][:k]
    labs = [f"c{int(row.index[j].split('mixnlp_Psoft_c')[1])}" for j in idx]
    return ", ".join(labs)

# tokens per entry-cluster for TF-IDF
tokens_by_entry = (
    mixnlp_tok_df
    .groupby("entry_cluster")["token"]
    .apply(list)
    .reindex(sorted(df["entry_cluster"].unique()), fill_value=[])
)
N_docs_ec = len(tokens_by_entry)
dfreq_ec = Counter()
for toks in tokens_by_entry:
    for t in set(toks):
        dfreq_ec[t] += 1

def _tfidf_top_tokens(tokens, k=10):
    tf = Counter(tokens); total = sum(tf.values()) or 1
    scored = [(t, (f/total) * (np.log((N_docs_ec+1)/(1+dfreq_ec[t])) + 1.0)) for t, f in tf.items()]
    scored.sort(key=lambda x: x[1], reverse=True)
    return ", ".join([t for t, _ in scored[:k]])

entry_sizes = df["entry_cluster"].value_counts().sort_index()
examples = {
    c: " | ".join(df.loc[df["entry_cluster"] == c, "row_text"].head(3).tolist())
    for c in entry_sizes.index
}

row_table = pd.DataFrame({
    "Entry Cluster": entry_sizes.index,
    "Count": entry_sizes.values,
    "Top Token-Clusters (mean soft)": entry_means.apply(_topk_token_clusters, axis=1).reindex(entry_sizes.index).fillna(""),
    "Top Tokens (TF-IDF)": tokens_by_entry.apply(_tfidf_top_tokens).reindex(entry_sizes.index).fillna(""),
    "Example Rows": pd.Series(examples).reindex(entry_sizes.index).fillna("")
}).reset_index(drop=True)

fig_row_tbl = go.Figure(data=[go.Table(
    header=dict(values=list(row_table.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[row_table[c] for c in row_table.columns], fill_color='white', align='left')
)])
fig_row_tbl.update_layout(title="Row (Entry-Cluster) Summary")
fig_row_tbl.show()

While the summaries above show intresting relationships derived from KMeans, UMAP, and our embeddings, I'm not really satisfied with the result. For now let's log the relationships and embeddings then tackle the issue once again with a more in depth approach.

## Save entry and token level clusters

Now we'll load our original FastTexts' aligned vecs, in preparation for a cool step. We'll use the cluster (after normalization), then search for the most similar n tokens in the relevant AR or EN vec, and use them to describe the cluster *ordered by their cosine similarity*.

In [None]:
# ft_ar_aligned = gensim.models.KeyedVectors.load_word2vec_format("../assets/wiki.ar.align.vec", binary=False)
# ft_en_aligned = gensim.models.KeyedVectors.load_word2vec_format("../assets/wiki.en.align.vec", binary=False)

In [None]:
# Find descriptive tokens for each cluster

ARABIC_RE = re.compile(r"[\u0600-\u06FF]")  # Arabic block

def is_arabic_token(tok: str) -> bool:
    return bool(ARABIC_RE.search(tok))

def get_ft_vec(tok: str):
    """
    Returns (vec, lang) from aligned FastText. lang in {"ar","en","?"}.
    Heuristic:
      - If token has any Arabic chars -> try AR then EN
      - Else -> try EN then AR
    """
    if not isinstance(tok, str) or not tok:
        return None, "?"
    try_order = ("ar","en") if is_arabic_token(tok) else ("en","ar")
    for lang in try_order:
        kv = ft_ar_aligned if lang == "ar" else ft_en_aligned
        if tok in kv:
            return kv.get_vector(tok), lang
    return None, "?"
 
# Build aligned vectors for each token occurrence (matching order of mixnlp_tok_df)
aligned_vecs = []
tok_langs = []
for tok in mixnlp_tok_df["token"].tolist():
    v, lang = get_ft_vec(tok)
    aligned_vecs.append(v)
    tok_langs.append(lang)

# Record in token df
mixnlp_tok_df["ft_lang"] = tok_langs
# NOTE: some vecs may be None if OOV in both models
aligned_vecs = np.array([v if v is not None else np.nan for v in aligned_vecs], dtype=float)  # shape (n_tok, 300)

# helper: row-wise L2 normalize ignoring NaNs
def row_norm(x):
    n = np.linalg.norm(x, axis=1, keepdims=True)
    n[n == 0] = 1.0
    return x / n

# Build centroids per token cluster in the aligned space
def cluster_centroids_aligned(tok_df: pd.DataFrame, aligned_matrix: np.ndarray, k: int):
    """
    Returns:
      centroids: (k, d) np.array (L2-normalized); NaN for empty clusters
      counts:    (k,) int
    """
    d = aligned_matrix.shape[1]
    cents = np.full((k, d), np.nan, dtype=float)
    counts = np.zeros(k, dtype=int)
    for c in range(k):
        idx = tok_df.index[tok_df["token_cluster"] == c].to_numpy()
        if len(idx) == 0: 
            continue
        V = aligned_matrix[idx]
        V = V[~np.isnan(V).any(axis=1)]  # drop OOV rows
        if V.size == 0:
            continue
        m = V.mean(axis=0)
        n = np.linalg.norm(m) or 1.0
        cents[c] = m / n
        counts[c] = V.shape[0]
    return cents, counts

centroids_aligned, counts_aligned = cluster_centroids_aligned(
    mixnlp_tok_df.reset_index(drop=True), aligned_vecs, mixnlp_n_clusters
)

# Get descriptors from BOTH vocabs (since spaces are aligned)
def cluster_descriptors(centroid_vec: np.ndarray, topn=15):
    """
    Query both models with the centroid vector and merge results by max similarity.
    Returns list of (word, sim, lang) sorted by sim.
    """
    if centroid_vec is None or np.isnan(centroid_vec).any():
        return []

    # Gensim >=4 supports passing vectors to most_similar via "positive=[vec]"
    en = ft_en_aligned.most_similar(positive=[centroid_vec], topn=topn)  # [(w, sim), ...]
    ar = ft_ar_aligned.most_similar(positive=[centroid_vec], topn=topn)

    merged = {}
    for w, s in en:
        merged[w] = (s, "en")
    for w, s in ar:
        if w not in merged or s > merged[w][0]:
            merged[w] = (s, "ar")

    out = [(w, s, lang) for w, (s, lang) in merged.items()]
    out.sort(key=lambda x: x[1], reverse=True)
    return out[:topn]

# Build a small table for centroids + top descriptors
desc_rows = []
for c in range(mixnlp_n_clusters):
    v = centroids_aligned[c] if not np.isnan(centroids_aligned[c]).any() else None
    top = cluster_descriptors(v, topn=15) if v is not None else []
    # join descriptors into a text column; keep language tags
    desc = "; ".join([f"{w}({lang}:{s:.3f})" for w, s, lang in top])
    desc_rows.append({
        "token_cluster": c,
        "n_tokens_in_centroid": int(counts_aligned[c]),
        "descriptors_top15": desc
    })

cluster_desc_df = pd.DataFrame(desc_rows)
cluster_desc_df.head(10)

Unnamed: 0,token_cluster,n_tokens_in_centroid,descriptors_top15
0,0,290,سابك(ar:1.000); سابكو(ar:0.730); وسابك(ar:0.72...
1,1,37,clamp(en:0.958); clamps(en:0.708); clamping(en...
2,2,270,unknown(en:1.000); unknowned(en:0.752); unknow...
3,3,405,switchable(en:0.721); hdmi/dvi(en:0.703); conn...
4,4,122,straight(en:1.000); traight(en:0.750); straigh...
5,5,577,حديد(ar:1.000); سكة(ar:0.780); الحديد(ar:0.758...
6,6,97,ppr(en:1.000); pprj(en:0.532); pppr(en:0.528);...
7,7,121,hr(en:0.816); ac(en:0.580); kv(en:0.572); hr+(...
8,8,320,تسليح(ar:1.000); كتسليح(ar:0.848); التسليح(ar:...
9,9,351,pipe(en:0.773); downpipes(en:0.724); drainpipe...


Now we'll save the entry level descriptors and average its tokens' embeddings, save token level info per entry, and also save centroids into excel files for easy retrieval later

In [75]:
# --- ENTRY embedding in aligned space: mean of available token aligned vectors per row ---
entry_vecs = np.full((len(df), aligned_vecs.shape[1]), np.nan, dtype=float)
for rid in range(len(df)):
    idx = mixnlp_tok_df.index[mixnlp_tok_df["row_id"] == rid].to_numpy()
    V = aligned_vecs[idx]
    V = V[~np.isnan(V).any(axis=1)]
    if V.size == 0:
        continue
    m = V.mean(axis=0)
    n = np.linalg.norm(m) or 1.0
    entry_vecs[rid] = m / n  # L2-normalized doc embedding in aligned space

# === SAVE: two Excel files ===
base = Path("../data/clusters")
entry_path = base / "mixnlp_entry_level.xlsx"
token_path = base / "mixnlp_token_level.xlsx"

# ----- ENTRY FILE -----
entry_cols_basic = ["Interpretable", "entry_cluster", "row_text"]
entry_cols_probs = [c for c in df.columns if c.startswith("mixnlp_Psoft_") or c.startswith("mixnlp_Pvote_")]
entry_df = df[entry_cols_basic + entry_cols_probs].copy()

# attach 300-d entry embedding columns
dim = entry_vecs.shape[1]
for j in range(dim):
    entry_df[f"entry_ft_vec_{j}"] = entry_vecs[:, j]

with pd.ExcelWriter(entry_path, engine="openpyxl") as w:
    entry_df.to_excel(w, sheet_name="entry_level", index=False, float_format="%.6f")

# ----- TOKEN FILE -----
# Token sheet with aligned language + optional token aligned vector columns
token_df = mixnlp_tok_df.drop(columns=["vec"]).copy()
token_df["ft_lang"] = tok_langs
# Include 300-d aligned token vectors
tok_dim = aligned_vecs.shape[1]
for j in range(tok_dim):
    token_df[f"tok_ft_vec_{j}"] = aligned_vecs[:, j]

with pd.ExcelWriter(token_path, engine="openpyxl") as w:
    token_df.to_excel(w, sheet_name="token_level", index=False, float_format="%.6f")
    # Add centroids & descriptors as a separate sheet
    # also dump raw centroid vectors (one row per cluster)
    cent_df = pd.DataFrame(centroids_aligned, columns=[f"cent_ft_vec_{j}" for j in range(tok_dim)])
    cent_df.insert(0, "token_cluster", range(mixnlp_n_clusters))
    cent_df.insert(1, "n_tokens_in_centroid", counts_aligned)
    cent_df.to_excel(w, sheet_name="cluster_centroids", index=False, float_format="%.6f")
    cluster_desc_df.to_excel(w, sheet_name="cluster_descriptors", index=False)


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`


DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented fr

## Grid search on clustering hyperparameters

Now the real work begins. The above results suffice for analysis but they don't provide adequate granularity, we'll try to search over a predetermined hyperparameter space and evaluate them with metrics like `Calinski Harabasz Score`.

In [None]:
# Result of the grid search below (step 2)
best = {
        "n_neighbors": 30, "min_dist": 0.00, "n_components": 25.0, "k": 40.0,
        "silhouette": 0.662235, "calinski_harabasz": 33685.510763, "davies_bouldin": 0.307364
    }

In [None]:
# The commented grid search took 2 hrs to run on my cpu - the top 10 hyperparameter combinations are saved and printed

RANDOM_STATE = 42
SOFTMAX_TEMP = 0.05 # Larger/sharper peaks

def softmax_rows(z, temp=SOFTMAX_TEMP):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z / max(temp, 1e-6))
    return e / (e.sum(axis=1, keepdims=True) + 1e-9)

def zscore(s):
    s = (s - np.nanmean(s)) / (np.nanstd(s) + 1e-12)
    return np.where(np.isfinite(s), s, 0.0)

def simple_tokenize(s: str):
    s = str(s).lower()
    s = re.sub(r"[^0-9a-z\u0600-\u06FF\s]+", " ", s)
    return [t for t in s.split() if t]

# --- 1) Load entry-level embeddings and vocab
entry_df = pd.read_excel("../data/clusters/mixnlp_entry_level.xlsx", sheet_name="entry_level")
vocab_df = pd.read_excel("../data/vocab/finalized_vocabulary.xlsx")  # path as per your setup
vocab = set(str(x).strip().lower() for x in vocab_df.iloc[:,0].dropna().astype(str))

emb_cols = [c for c in entry_df.columns if c.startswith("entry_ft_vec_")]
X = entry_df[emb_cols].to_numpy(np.float32)
X = normalize(X, axis=1)

# # --- 2) Search space
# grid = []
# for n_neighbors in [5,10,15,30]:
#     for min_dist in [0.0,0.05,0.1,0.3]:
#         for n_components in [10,25,50]:
#             for k in [10,15,20,25,30,35,40]:
#                 grid.append((n_neighbors, min_dist, n_components, k))

# rows = []
# for (nn, md, nc, k) in grid:
#     um = umap.UMAP(n_neighbors=nn, min_dist=md, n_components=nc,
#                    metric="cosine", random_state=RANDOM_STATE)
#     Zu = um.fit_transform(X)

#     km = KMeans(n_clusters=k, n_init=30, random_state=RANDOM_STATE)
#     labels = km.fit_predict(Zu)

#     sil = silhouette_score(Zu, labels, metric="euclidean") if k>1 else np.nan
#     ch  = calinski_harabasz_score(Zu, labels) if k>1 else np.nan
#     db  = davies_bouldin_score(Zu, labels) if k>1 else np.nan

#     rows.append({
#         "n_neighbors": nn, "min_dist": md, "n_components": nc, "k": k,
#         "silhouette": sil, "calinski_harabasz": ch, "davies_bouldin": db
#     })

# res = pd.DataFrame(rows)
# res["rank_score"] = zscore(res["silhouette"]) + zscore(res["calinski_harabasz"]) - zscore(res["davies_bouldin"])
# res_top = res.sort_values("rank_score", ascending=False).head(40)
# print(res_top.head(10))

# # --- 3) Fit best and produce probabilities
# best = res_top.iloc[0]
um = umap.UMAP(n_neighbors=int(best['n_neighbors']), min_dist=float(best['min_dist']),
               n_components=int(best['n_components']), metric="cosine", random_state=RANDOM_STATE)
Zu = um.fit_transform(X)

km = KMeans(n_clusters=int(best['k']), n_init=50, random_state=RANDOM_STATE)
labels = km.fit_predict(Zu)
dists = km.transform(Zu)
probs = softmax_rows(-dists, temp=SOFTMAX_TEMP)

entry_df = entry_df.copy()
entry_df["entry_cluster_tuned"] = labels
for j in range(int(best['k'])):
    entry_df[f"cat_prob_{j}"] = probs[:, j]

# --- 4) Cluster previews for naming
previews = []
for cid in range(int(best['k'])):
    idx = np.where(labels == cid)[0]
    toks = []
    for i in idx:
        toks.extend([t for t in simple_tokenize(entry_df.loc[i, "Interpretable"]) if t in vocab])
    common = Counter(toks).most_common(12)
    previews.append({"cluster": cid, "size": len(idx), "top_terms": ", ".join([w for w,_ in common])})

preview_df = pd.DataFrame(previews).sort_values("size", ascending=False).reset_index(drop=True)

print("""
Best 10 combinations:
     n_neighbors  min_dist  n_components   k  silhouette  calinski_harabasz  
265           30      0.00            25  40    0.662235       33685.510763   
258           30      0.00            10  40    0.647479       34981.466116   
272           30      0.00            50  40    0.662784       30579.911122   
293           30      0.05            50  40    0.651980       27324.189877   
279           30      0.05            10  40    0.641933       29129.018371   
286           30      0.05            25  40    0.656583       26283.725487   
257           30      0.00            10  35    0.642186       26088.166305   
307           30      0.10            25  40    0.653649       23738.015577   
256           30      0.00            10  30    0.729176       17417.443680   
263           30      0.00            25  30    0.716514       17612.109369   
""")


Best 10 combinations:
     n_neighbors  min_dist  n_components   k  silhouette  calinski_harabasz  
265           30      0.00            25  40    0.662235       33685.510763   
258           30      0.00            10  40    0.647479       34981.466116   
272           30      0.00            50  40    0.662784       30579.911122   
293           30      0.05            50  40    0.651980       27324.189877   
279           30      0.05            10  40    0.641933       29129.018371   
286           30      0.05            25  40    0.656583       26283.725487   
257           30      0.00            10  35    0.642186       26088.166305   
307           30      0.10            25  40    0.653649       23738.015577   
256           30      0.00            10  30    0.729176       17417.443680   
263           30      0.00            25  30    0.716514       17612.109369   



### Save models and cluster previews

In [108]:
preview_df.to_excel("../data/clusters/cluster_previews.xlsx", index=False)
preview_df.head(10) 

Unnamed: 0,cluster,size,top_terms
0,39,425,"clamp, model, copper, pipe, wall, weic, bar, r..."
1,3,387,"اسود, توريد, unknown, ماسور, الواح, تحويل, سرا..."
2,24,385,"pvc, cable, size, configuration, color, cu, st..."
3,2,243,"unknown, قفي, فيز"
4,0,145,"سابك, حديد, تسليح, نظامي, watani, مفصل, طلب, خ..."
5,25,132,"ppr, aquaterra, elbow, tah, adaptor, reducer, ..."
6,4,107,"حديد, تسليح, سعودي, مجدول, راجح, سابك, اتفاق, ..."
7,5,105,"حديد, تسليح, اتفاق, يمام, ضغط, توريج"
8,29,95,"steel, bar, tube, reinforcing, rebar, saudi, i..."
9,21,82,"حديد, صاج, اسود, عجم, مبسط, نبل, نقا, ياباني, ..."


In [107]:
# Save fitted models
os.makedirs("../assets/models", exist_ok=True) 
joblib.dump(um, "../assets/models/umap_best.pkl")
joblib.dump(km, "../assets/models/kmeans_best.pkl")

['../assets/models/kmeans_best.pkl']

Now that we have the best clusters possible, we'll need to interpret them and rename them to easy to understand categories.

# Interpreting clusters into categories

After analyzing the data, I've come up with this categorization scheme:
1. Metals & Structural
    - Rebar & Reinforcement
    - Steel Sheets/Coils/Plates
    - Structural Sections (Channels/IPE/HEB)
2. Pipes & Plumbing / Mechanical
    - PPR Pipes & Fittings
    - Copper Pipes & Fittings (HVAC/Plumbing)
    - Steel/Black Pipes
3. Electrical Cables & Cable Mgmt
    - Power Cables (PVC/XLPE LV)
    - Wires & Earthing
    - Cable Trays & Ladder / HDG Accessories
4. Electrical Distribution & Control
    - Switchgear & Metering (Breakers/ACBs)
5. Building Materials & Consumables
    - Concrete & Cement (SRC)
    - Fasteners / Lubricants / Misc.
6. Unknown / Noise
    - Unknown / Uninterpretable

*A total of 6 high level categories and 13 subcategories.*

In [104]:
# Reload after reset
previews = pd.read_excel("../data/clusters/cluster_previews.xlsx")

def tokenize_terms(s: str):
    s = str(s).lower()
    s = re.sub(r"[^0-9a-z\u0600-\u06FF,\s\-]+", " ", s)
    toks = [t.strip(" -_/") for t in re.split(r"[,;\s]+", s) if t.strip()]
    return set(toks)

PIPES_CORE = {
    "pipe","ماسور","ماسورة","ماسير","انبوب","انابيب","كوع","tee","elbow","adaptor","adapter",
    "reducer","coupling","union","socket","nipple","cap","plug","class","كلاس","sch40","schedule",
    "seamless","upvc","u-pvc","pvc-pipe","pvcpipe","منار","manar","tah","thread","thd","length"
}
PIPES_BRANDS = {"ppr","aquaterra"}
COPPER_PIPE_HINTS = {"copper","cu","hvac","ac","split","flare","swage","insulation","clamp","hanger"}
BLACK_PIPE_HINTS = {"black","اسود","seamless"}

REBAR_STRONG = {
    "rebar","تسليح","deformed","straight","ittifaq","اتفاق","watani","وطني","سابك","راجح","يمام","saudi"
}
SECTIONS = {"ipe","heb","beam","channel","main","jr","mmx","pcs","ipe100","ipe200"}
SHEETS = {"sheet","sheets","plate","plates","coil","coils","gi","hr","checkered","صاج","مشرح","مشرّح","ويل"}
MESH_ROD = {"mesh","wire-rod","wire_rod","rod","wire","املس"}
FABRICATION = {"cut","bend","bending","قص","ثني"}

CABLES = {"cable","xlpe","swa","armoured","armored","awg","meter","size","configuration","color"}
WIRES = {"سلك","earthing","ارضى","ارضي","ground","bare","نحاس"}
TRAYS = {"hdg","tray","ladder","cover","angle","osf"}

SWITCHGEAR = {"xt","ekip","ls","tmd","tmg","tmf","iec","contactor","acb","mccb","breaker","sg","hm","box","af","hz","dc","tma","inn"}
METERING = {"measuring","meter","wmp","kbi","lsi","ct","vt","touch"}

READY_MIX = {"خرسان","خرسانه","جاهز","src"}
CONSUMABLES_MISC = {"anchor","screw","hook","fix","stand","petromin","hydraulic","oil","lubricant","سيراميك","ديكور"}

UNKNOWN_TOK = {"unknown"}

def score_category(tokens: set):
    if (tokens & UNKNOWN_TOK) and (len(tokens) <= 3):
        return ("Unknown / Noise", "Unknown / Uninterpretable")

    has_pipe_core = len(tokens & PIPES_CORE) > 0
    has_pipe_brand = len(tokens & PIPES_BRANDS) > 0
    has_copper_hint = len(tokens & COPPER_PIPE_HINTS) > 0
    has_black_hint = len(tokens & BLACK_PIPE_HINTS) > 0

    has_rebar = len(tokens & REBAR_STRONG) > 0
    has_sections = len(tokens & SECTIONS) > 0
    has_sheets = len(tokens & SHEETS) > 0
    has_meshrod = len(tokens & MESH_ROD) > 0
    has_fab = len(tokens & FABRICATION) > 0

    has_cables = len(tokens & CABLES) > 0
    has_wires  = len(tokens & WIRES) > 0
    has_trays  = len(tokens & TRAYS) > 0

    has_switchgear = len(tokens & SWITCHGEAR) > 0
    has_metering   = len(tokens & METERING) > 0

    has_ready = len(tokens & READY_MIX) > 0
    has_cons  = len(tokens & CONSUMABLES_MISC) > 0

    # D) Electrical Distribution & Control
    if has_switchgear or has_metering:
        return ("Electrical Distribution & Control", "Switchgear & Metering (Breakers/ACBs)")

    # C) Electrical Cables & Cable Mgmt
    if has_trays:
        return ("CElectrical Cables & Cable Mgmt", "Cable Trays & Ladder / HDG Accessories")
    if ("pvc" in tokens and "cable" in tokens) or (has_cables and not has_switchgear):
        return ("Electrical Cables & Cable Mgmt", "Power Cables (PVC/XLPE LV)")
    if has_wires and not has_switchgear:
        return ("Electrical Cables & Cable Mgmt", "Wires & Earthing")

    # B) Pipes & Plumbing / Mechanical  (precedence over A)
    if has_pipe_brand:
        return ("Pipes & Plumbing / Mechanical", "PPR Pipes & Fittings")
    if has_pipe_core:
        if has_copper_hint:
            return ("Pipes & Plumbing / Mechanical", "Copper Pipes & Fittings (HVAC/Plumbing)")
        if has_black_hint or "length" in tokens:
            return ("Pipes & Plumbing / Mechanical", "Steel/Black Pipes")
        return ("Pipes & Plumbing / Mechanical", "Steel/Black Pipes")
    if has_copper_hint and ({"pipe","ماسور","ماسورة","انابيب"} & tokens):
        return ("Pipes & Plumbing / Mechanical", "Copper Pipes & Fittings (HVAC/Plumbing)")

    # A) Metals & Structural
    if has_rebar or ("rebar" in tokens):
        return ("Metals & Structural", "Rebar & Reinforcement")
    if has_sheets:
        return ("Metals & Structural", "Steel Sheets/Coils/Plates")
    if has_sections:
        return ("Metals & Structural", "Structural Sections (Channels/IPE/HEB)")

    # E) Building Materials & Consumables
    if has_ready:
        return ("Building Materials & Consumables", "Ready-Mix Concrete & Cement (SRC)")
    if has_cons:
        return ("Building Materials & Consumables", "Fasteners/Lubricants/Misc")

    return ("Unknown / Noise", "Unknown / Uninterpretable")

mapped_rows = []
for _, r in previews.iterrows():
    cid = int(r["cluster"])
    toks = tokenize_terms(r["top_terms"])
    overall, sub = score_category(toks)
    mapped_rows.append({
        "cluster": cid,
        "size": int(r["size"]),
        "top_terms": r["top_terms"],
        "overall_category": overall,
        "subcategory": sub
    })

mapped = pd.DataFrame(mapped_rows).sort_values(["overall_category","subcategory","size"], ascending=[True,True,False]).reset_index(drop=True)

out_map = "../data/clusters/cluster_to_category_mapping.xlsx"
with pd.ExcelWriter(out_map) as writer:
    mapped.to_excel(writer, sheet_name="cluster_mapping", index=False)

In [110]:
mapped

Unnamed: 0,cluster,size,top_terms,overall_category,subcategory
0,37,7,"petromin, hydraulic, aw, oil, lubricant, nepr",Building Materials & Consumables,Fasteners/Lubricants/Misc
1,18,13,"خرسان, مقاوم, جاهز, مسلح, خرسانه, src",Building Materials & Consumables,Ready-Mix Concrete & Cement (SRC)
2,1,72,"hdg, tray, osf, plate, length, angle, cover, d...",CElectrical Cables & Cable Mgmt,Cable Trays & Ladder / HDG Accessories
3,24,385,"pvc, cable, size, configuration, color, cu, st...",Electrical Cables & Cable Mgmt,Power Cables (PVC/XLPE LV)
4,35,65,"سلك, سويدي, كيبل, نحاس, ارضي, احمر, اسود, اصفر...",Electrical Cables & Cable Mgmt,Wires & Earthing
5,16,79,"xt, ekip, ls, dip, dc, iec, af, hz, contactor,...",Electrical Distribution & Control,Switchgear & Metering (Breakers/ACBs)
6,11,53,"xt, tmd, tmg, tmf, fp, ef",Electrical Distribution & Control,Switchgear & Metering (Breakers/ACBs)
7,9,31,"breaker, circuit, sg, hm, battery, box",Electrical Distribution & Control,Switchgear & Metering (Breakers/ACBs)
8,10,27,"xt, tma, inn",Electrical Distribution & Control,Switchgear & Metering (Breakers/ACBs)
9,34,23,"ekip, lsi, wmp, dip, touch, sw, measuring, kbi...",Electrical Distribution & Control,Switchgear & Metering (Breakers/ACBs)


Now we can preprare the analysis ready dataset!

# Analysis ready dataset

To our original purchase-order-items (after EDA: cleaned_num.xlsx), we'll add Item Name Clean, the category and subcategory for each entry. 

We'll save the processed dataset at: `../data/processed-purchase-order-items.xlsx`

# Appendix

## Testing BERTopic

Analysis of each cell is found below it

In [79]:
# First attempt - used paraphrase-multilingual-MiniLM-L12-v2 as embedder, CountVectorizer, UMAP and HDBSCAN for dim reduction
# and hierarchy retention
en_stop = set(__import__('sklearn').feature_extraction.text.ENGLISH_STOP_WORDS)
with open('../assets/AR_stop_words.txt', encoding='utf-8') as f:
    AR_STOP = set(line.strip() for line in f if line.strip())
stop_words = list(en_stop | AR_STOP)

# 1) Load preprocessed texts
texts = df["Combined"].astype(str).tolist() 

# 2) Multilingual embeddings (CPU okay)
embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")

# 3) Vectorizer tuned for short text
vectorizer_model = CountVectorizer(
    ngram_range=(1,3), min_df=2, stop_words=stop_words
)

# 4) UMAP + HDBSCAN
umap_model = UMAP(n_neighbors=15, n_components=5, metric="cosine", random_state=42)
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=25, metric='euclidean',
                                cluster_selection_method='eom', prediction_data=True)

# 5) Build BERTopic
topic_model = BERTopic(
    embedding_model=embedder,
    vectorizer_model=vectorizer_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    calculate_probabilities=True,
    verbose=False
)

topics, probs = topic_model.fit_transform(texts)

# Inspect
topic_info = topic_model.get_topic_info()     # topic sizes + labels
top_words_0 = topic_model.get_topic(0)        # [(word, weight), ...]
topic_info.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,837,-1_توريد_bend_model_cut,"[توريد, bend, model, cut, weicco, wall, سلك, c...","[cut bend rebar, cut bend rebar, cut bend rebar]"
1,0,241,0____,"[, , , , , , , , , ]","[unknown, unknown, unknown]"
2,1,162,1_cu_kv_cable_pvc,"[cu, kv, cable, pvc, cu xlpe, xlpe, earth, len...","[riyadh cable cu xlpe pvc kv black, riyadh cab..."
3,2,114,2_كيبل رياض_رياض_ac_iec,"[كيبل رياض, رياض, ac, iec, fp, كيبل, سرايا, dc...","[كيبل رياض مسلح sta xlpe pvc, كيبل رياض مسلح s..."
4,3,84,3_سابك حديد تسليح_حديد تسليح سابك_تسليح سابك_س...,"[سابك حديد تسليح, حديد تسليح سابك, تسليح سابك,...","[حديد تسليح سابك, حديد تسليح سابك, حديد تسليح ..."
5,4,81,4_ppr_pipe_tahweeltm_ppr tahweeltm,"[ppr, pipe, tahweeltm, ppr tahweeltm, tee ppr,...","[equal tee ppr tahweeltm, ppr pipe, ppr pipe]"
6,5,80,5_صاج حديد_اسود صاج_صاج_اسود,"[صاج حديد, اسود صاج, صاج, اسود, حديد مجلفن, حد...","[صاج حديد اسود, صاج حديد اسود, صاج حديد اسود]"
7,6,80,6_اتفاق حديد_اتفاق_حديد تسليح_تسليح,"[اتفاق حديد, اتفاق, حديد تسليح, تسليح, حديد, ض...","[حديد تسليح اتفاق, حديد تسليح اتفاق, حديد تسلي..."
8,7,78,7_صاج اسود_اسود صاج_صاج_اسود,"[صاج اسود, اسود صاج, صاج, اسود, املس, ويل املس...","[صاج اسود, صاج اسود, صاج اسود]"
9,8,77,8_aquaterra_elbow_ppr_aquaterra ppr,"[aquaterra, elbow, ppr, aquaterra ppr, adaptor...","[aquaterra ppr elbow, aquaterra ppr female elb..."


In [80]:
topic_model.visualize_topics()

Although it is pretty cool BERTopic can create all these topics and their hierachies, but many were flagged as anomaly, topic -1.

In [81]:
# --- Corpus ---
texts = df["Interpretable"].astype(str).tolist() 

# ---- Multilingual embedding model (good for Arabic + English + short phrases) ----
# Alternatives to try: "sentence-transformers/LaBSE", "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
EMBED_MODEL_NAME = "sentence-transformers/distiluse-base-multilingual-cased-v2"
embedder = SentenceTransformer(EMBED_MODEL_NAME)

def embed(texts):
    return embedder.encode(texts, batch_size=64, show_progress_bar=True, normalize_embeddings=True)

# ---- Vectorizer tuned for Arabic/English tokens and short docs ----
# Token pattern keeps Arabic letters, English, digits, hyphen/dot (e.g., D3-PANEL, M12.5).
token_pattern = r"(?u)\b[\w\u0600-\u06FF][\w\-\.\u0600-\u06FF]{0,}\b"

vectorizer = CountVectorizer(
    token_pattern=token_pattern,
    ngram_range=(1, 2),   # bi-grams help short docs
    min_df=2,             # cut pure singletons noise
    lowercase=True
)

# ---- Embedding-based labeler (far better than c-TF-IDF for tiny docs) ----
repr_model = KeyBERTInspired()  # uses the same embedding backend internally

# (Optional) Encourage diversity in labels:
# repr_model = MaximalMarginalRelevance(diversity=0.3)



hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=15,          # tune up/down to change granularity
    min_samples=None,             # None => equal to min_cluster_size by default
    metric='euclidean',
    prediction_data=True,         # <— enable soft probabilities
    cluster_selection_method='eom'
)

topic_model = BERTopic(
    embedding_model=embedder,
    vectorizer_model=vectorizer,
    representation_model=repr_model,
    hdbscan_model=hdbscan_model,
    verbose=True
)

embeddings = embed(texts)
topics, probs = topic_model.fit_transform(texts, embeddings=embeddings)

# Reassign (-1) using soft probabilities from HDBSCAN (enabled via prediction_data=True)
# Strategy "probabilities" uses membership strengths; threshold ~0.25–0.35 works well on short docs.
topics = topic_model.reduce_outliers(
    documents=texts,
    topics=topics,
    probabilities=probs,
    strategy="probabilities",
    threshold=0.0,        
    embeddings=embeddings  
)

# Refresh labels after reassignment for cleaner c-TF-IDF windows 
topic_model.update_topics(texts, vectorizer_model=vectorizer)
topic_model.generate_topic_labels(nr_words=4, topic_prefix=False)
info = topic_model.get_topic_info()
info

Batches: 100%|██████████| 50/50 [00:19<00:00,  2.62it/s]
2025-09-11 14:39:24,855 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-11 14:39:35,749 - BERTopic - Dimensionality - Completed ✓
2025-09-11 14:39:35,752 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-11 14:39:35,918 - BERTopic - Cluster - Completed ✓
2025-09-11 14:39:35,930 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-11 14:39:38,770 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,505,-1_tee_ppr_golde_tray,"[tee, ppr, golde, tray, motor, class, mmw, elb...",[cable tray horizontal unequal tee mmw hx thk ...
1,0,244,0_unknown_قفي_اسمنت_ارضي,"[unknown, قفي, اسمنت, ارضي, , , , , , ]","[unknown, unknown, unknown]"
2,1,90,1_تسليح سابك_سابك حديد_سابك_حديد تسليح,"[تسليح سابك, سابك حديد, سابك, حديد تسليح, تسلي...","[حديد تسليح سابك, حديد تسليح سابك, حديد تسليح ..."
3,2,86,2_سيراميك_ارضي_iec_ac,"[سيراميك, ارضي, iec, ac, مقاوم, غراء, stud, جل...","[ac iec, ac iec, ac iec]"
4,3,84,3_اتفاق_حديد تسليح_تسليح_حديد,"[اتفاق, حديد تسليح, تسليح, حديد, حديد اتفاق, م...","[حديد تسليح اتفاق, حديد تسليح اتفاق, حديد تسلي..."
...,...,...,...,...,...
67,66,17,66_حديد ويل_ويل_iron_حديد,"[حديد ويل, ويل, iron, حديد, , , , , , ]","[حديد ويل, حديد ويل, حديد ويل]"
68,67,16,67_حديد مجدول_مربع_مجدول_md,"[حديد مجدول, مربع, مجدول, md, فارغ, مستطيل, حد...","[حديد مجدول, حديد مجدول, حديد مجدول]"
69,68,16,68_ثلاثي_الخرسانيه_الواح_panel,"[ثلاثي, الخرسانيه, الواح, panel, توريد, اكسسوا...",[توريد اكسسوارات الواح الخرسانيه معزوله ثلاثي ...
70,69,16,69_model_weic_hanger_adjustable,"[model, weic, hanger, adjustable, strap, pipe,...","[weic sprinkle pipe hanger model wpcs, weic sp..."


In [82]:
topic_model.visualize_topics()

A little better, but many topics are duplicates with very minor differences.

In [83]:
# ---- 1) Sentence embeddings (multilingual) ----
# LaBSE = strong AR/EN alignment; swap to paraphrase-multilingual-MiniLM-L12-v2 for speed
sbert = SentenceTransformer("sentence-transformers/LaBSE")
sbert_vecs = sbert.encode(texts, batch_size=128, show_progress_bar=True, normalize_embeddings=True)

# ---- 2) Char n-gram TF-IDF + SVD (subword-ish) ----
char_tfidf = TfidfVectorizer(analyzer="char_wb", ngram_range=(3,5), min_df=2)
X_char = char_tfidf.fit_transform(texts)

svd = TruncatedSVD(n_components=256, random_state=42)
X_char_svd = svd.fit_transform(X_char)
X_char_svd = normalize(X_char_svd)

# ---- 3) Hybrid embedding: concat (dims: 768 + 256 = 1024) ----
emb = np.hstack([sbert_vecs, X_char_svd])

# ---- 4) K-Means (no UMAP) ----
def estimate_k(n): 
    return max(12, int(round(np.sqrt(n))))
K = estimate_k(len(texts))
kmeans = KMeans(n_clusters=K, n_init=20, random_state=42)

# ---- 5) BERTopic: c-TF-IDF/MMR labels; pass our embeddings ----
word_vect = CountVectorizer(ngram_range=(1,3), min_df=2, token_pattern=r"(?u)\b\w+\b")
mmr = MaximalMarginalRelevance(diversity=0.3)

topic_model = BERTopic(
    embedding_model=None,           # we supply embeddings directly
    umap_model=None,                # skip UMAP
    hdbscan_model=kmeans,           # sklearn clusterer accepted in 0.17.x
    vectorizer_model=word_vect,     # drives c-TF-IDF vocab
    representation_model=mmr,       # diversified labels
    calculate_probabilities=False,
    verbose=True
)

topics, _ = topic_model.fit_transform(texts, embeddings=emb)

# ---- 6) (Optional) Auto-merge similar topics inside BERTopic ----
topic_model.reduce_topics(texts, nr_topics="auto")
topic_model.update_topics(texts, n_gram_range=(1,3))

info = topic_model.get_topic_info()
info


Batches: 100%|██████████| 25/25 [00:28<00:00,  1.15s/it]
2025-09-11 14:40:14,920 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-11 14:40:28,883 - BERTopic - Dimensionality - Completed ✓
2025-09-11 14:40:28,885 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-11 14:40:30,128 - BERTopic - Cluster - Completed ✓
2025-09-11 14:40:30,130 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-11 14:40:30,224 - BERTopic - Representation - Completed ✓
2025-09-11 14:40:30,374 - BERTopic - Topic reduction - Reducing number of topics
2025-09-11 14:40:30,396 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-11 14:40:30,414 - BERTopic - Representation - Completed ✓
2025-09-11 14:40:30,414 - BERTopic - Topic reduction - Reduced number of topics from 56 to 23


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1582,0_ppr_pvc_sheet_hdg,"[ppr, pvc, sheet, hdg, cable, hr, tray, panel,...","[ppr pipe, ppr equal tee, aquaterra ppr equal ..."
1,1,428,1_watani watani_watani watani watani_watani_md,"[watani watani, watani watani watani, watani, ...","[watani حديد نظامي, watani حديد نظامي, watani ..."
2,2,244,2_unknown unknown unknown_unknown unknown_unkn...,"[unknown unknown unknown, unknown unknown, unk...","[unknown, unknown, unknown]"
3,3,182,3_cp_ot cp_change_change switch,"[cp, ot cp, change, change switch, ot cp chang...","[ot cp change switch, ot cp change switch, ot ..."
4,4,136,4_xt_ekip_xt tmd_dip,"[xt, ekip, xt tmd, dip, ekip dip, tmd, tmd xt,...","[xt tmg, xt ekip dip lsi, xt tmd]"
5,5,130,5_deformed straight_deformed straight bar_stra...,"[deformed straight, deformed straight bar, str...","[black deformed straight bar سابك, black defor..."
6,6,83,6_nya_aw nya nya_upright_pendin upright,"[nya, aw nya nya, upright, pendin upright, pen...","[سلك سويدي aw, سلك مفرد nya, سلك مفرد nya]"
7,7,63,7____,"[, , , , , , , , , ]","[ماسور حديد, ماسور حديد, ماسور حديد مقا]"
8,8,48,8_configuration_size_color_size color,"[configuration, size, color, size color, style...",[brake pads shoes lin ers mx configuration dru...
9,9,33,9_model_weic_lined split clamp_rubber lined,"[model, weic, lined split clamp, rubber lined,...","[weic rubber lined split clamp model wsc, weic..."


In [84]:
topic_model.visualize_topics()

Many subtle categories/topics ingored


In [85]:
# ---- 1) Sentence embeddings only (LaBSE) ----
sbert = SentenceTransformer("sentence-transformers/LaBSE")
sbert_vecs = sbert.encode(texts, batch_size=128, show_progress_bar=True, normalize_embeddings=True)  # (N, 768)

# ---- 2) K-Means (no UMAP) ----
def estimate_k(n):
    return max(12, int(round(np.sqrt(n))))
K = estimate_k(len(texts))
kmeans = KMeans(n_clusters=K, n_init=20, random_state=42)

# ---- 3) BERTopic with KeyBERTInspired (doc & word embeddings both from LaBSE) ----
topic_model = BERTopic(
    embedding_model=sbert,          # used for both docs & words (768-d)
    umap_model=None,                # skip UMAP
    hdbscan_model=kmeans,           # sklearn clusterer accepted in 0.17.x
    vectorizer_model=None,          # labeling via embeddings, not c-TF-IDF
    representation_model=KeyBERTInspired(),
    calculate_probabilities=False,
    verbose=True
)

topics, _ = topic_model.fit_transform(texts, embeddings=sbert_vecs)

# ---- 4) (Optional) Auto-merge similar topics ----
# topic_model.reduce_topics(texts, nr_topics="auto")

info = topic_model.get_topic_info()
info

Batches: 100%|██████████| 25/25 [00:28<00:00,  1.16s/it]
2025-09-11 14:41:04,662 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-11 14:41:18,437 - BERTopic - Dimensionality - Completed ✓
2025-09-11 14:41:18,438 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-11 14:41:19,741 - BERTopic - Cluster - Completed ✓
2025-09-11 14:41:19,748 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-09-11 14:41:26,394 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,285,0_hollow_fixed_british_checkered,"[hollow, fixed, british, checkered, water, whi...","[british chef utility knife, british chef util..."
1,1,244,1_unknown_نجران_قفي_فيز,"[unknown, نجران, قفي, فيز, , , , , , ]","[unknown, unknown, unknown]"
2,2,164,2_unequal_new_معزوله_single,"[unequal, new, معزوله, single, qha, غامق, hung...",[توريد مواد تقني الواح الخرسانيه معزوله ثلاثي ...
3,3,164,3_معنون_متنوع_fresh_solt,"[معنون, متنوع, fresh, solt, نائم, recessed, ap...","[square floor cleanout gasket fd, wall light i..."
4,4,145,4_نمساوي_tinted_محبس_dn,"[نمساوي, tinted, محبس, dn, dag, normal, ko, st...","[bend pvc aplac, bend pvc aplac, bend pvc aplac]"
5,5,119,5_naked_seamless_low_smart,"[naked, seamless, low, smart, online, needed, ...","[bpe welded pipe plain end sch, bpe welded pip..."
6,6,117,6_wireless_moe_af_inn,"[wireless, moe, af, inn, uni, managed, inbio, ...","[af hz dc contactor, af hz dc contactor, af hz..."
7,7,114,7_enclosed_slitted_heb_ogn,"[enclosed, slitted, heb, ogn, مشترك, قاطع, nbt...","[قاطع علبه امبير فنار, قاطع علبه امبير فنار, ق..."
8,8,105,8_خاص_مصمت_مسحوب_outdoor,"[خاص, مصمت, مسحوب, outdoor, small, indoor, wet...","[كيبل رياض عادي, كيبل رياض عادي, كيبل رياض عادي]"
9,9,88,9_open_plain_hung_checkered,"[open, plain, hung, checkered, india, slotted,...","[hdg osf tray length, hdg osf tray length, hdg..."


In [86]:
topic_model.visualize_topics()

I think we should try another approach. Thus far, BERTopic has entry level understanding but doesn't relay the global relationships so well in our case. It thrives on context, and many examples, and we lack both.

In [None]:
# info.to_excel("temp.xlsx")