# SEMANTIC_LAYER.ipynb 
# Semantic-Based Clustering Method for Hierarchy Engine

---
## Overview
- Docstring and Imports
- K Selection Helper Function
- Core Semantic-Based Clustering Logic

## 1 - Docstring and Imports
The Semantic Layer utilizes string labels to perform embedding-based clustering based on the meaning of the words within the label. 

Embedding-based semantic clustering is especially effective when structured category-like data exists, or if existing text data columns are dense and consistently formatted. In these scenarios, semantic clustering works fast and produces results that are easy to understand/validate. Within the context of Hierarchy, the Semantic Layer clusters groups to organize large numbers of narrow categories underneath broad umbrellas. 

The Semantic Layer is constructed using the sklearn library. Specifically, KMeans is used as the actual clustering agent. NumPy and Pandas are used for data management, and annotations/type hints are incorporated for clarity. Additionally, two useful functions from `.text_utils` are imported as helpers.

In [None]:
# core/semantic_layer.py

"""
This file implements the Semantic Layer.

The Semantic Layer groups unique string labels using embedding-based
clustering to identify semantically similar categories, then maps those
cluster assignments back onto the original dataframe. Each semantic
cluster is also given a human-readable name derived from its member
labels using TF-IDF keyword scoring.

This layer is typically applied to categorical or hierarchical label
columns (e.g., category names) and is intended to sit above or alongside
attribute-based clustering layers.

Public API used by the hierarchy engine:

    build_semantic_layer(
        df,
        *,
        input_label_col,
        n_clusters,
        output_prefix,
        random_state=42
    ) -> pd.DataFrame
        Returns a copy of df with two new columns:
        `{output_prefix}_id` containing integer semantic cluster IDs
        and `{output_prefix}_name` containing human-readable cluster names.

Internal helpers:

    _choose_k(n_items, k_min=2, k_max=12) -> int
        Heuristic for selecting a reasonable number of clusters when
        an explicit value is not provided.
"""

# Type hints
from __future__ import annotations

# External dependencies
import numpy as np
import pandas as pd

# Sklearn dependencies
from sklearn.cluster import KMeans

# Internal dependencies
from .text_utils import build_embeddings_for_labels, tfidf_cluster_label

# 2 - K Heuristic
This helper function determines the optimal k value for KMeans using the following heuristic: k is set to the squareroot of the number of items within the category. This value is clipped to prevent over-fragmentation and ensure that there is more than one cluster. The function takes the following arguments: an integer `n_items`, an integer `k_min`, and an integer `k_max`. In practice, `n_items` is the number of items within a category, while `k_min` and `k_max` are lower and upper bounds for k with default values 2 and 12, respectively. 

Line-by-line breakdown:
- If the number of items within a cluster is 1 or less, then the function returns `k=1` (cannot be more than one cluster in this scenario).
- k is initialized as the squareroot of the number of items within the category. 
- k is clipped as the greater value between `k_min` and `min(k_max, k)`.
- k is again clipped as the lesser value between k and `n_items`.
- The clipped k is returned.

In [None]:
def _choose_k(n_items: int, k_min: int = 2, k_max: int = 12):
    if n_items <= 1:
        return 1
    k = int(np.sqrt(n_items))
    k = max(k_min, min(k_max, k))
    k = min(k, n_items)
    return k

# 3 Core Semantic-Based Clustering Logic

This function contains the semantic-layer construction. The function accepts the following arguments: a Pandas DataFrame `df`, a string `input_label_col`, an optional integer `n_clusters`, a string `output_prefix`, and an integer `random_state`. In practice, `df` is the complete original dataset, `input_lable_col` is a column containing text labels to embed and cluster, `n_clusters` is an explicit cluster amount (can be None), `output_prefix` is a prefix added to output columns, and `random_state` is a seed for reproducibility. The function returns a Pandas DataFrame, which is the original dataset with an additional column for cluster assignments.

Due to the size of the function, it will be broken down into sections and explained line by line. This first section is simply the function definition and docstring.

In [None]:
def build_semantic_layer(
    df: pd.DataFrame,
    *,
    input_label_col: str,
    n_clusters: int | None,
    output_prefix: str,
    random_state: int = 42,
) -> pd.DataFrame:

# 3.1 - Preparation

This section creates a safe copy of the input dataset, generates a cleaned Series of labels using 'input_label_col', and creates a sorted list of unique, non-empty labels.

In [None]:
    df = df.copy()

    labels = (
        df[input_label_col]
        .fillna("")
        .astype(str)
        .str.strip()
    )
    unique_labels = sorted({v for v in labels if v})

# 3.2 Degeneration Guard

This section contains a failsafe, assigning every item to a single cluster named "All" and returning early in the instance that there are no usable labels.

In [None]:
    if not unique_labels:
        df[f"{output_prefix}_id"] = 0
        df[f"{output_prefix}_name"] = "All"
        return df

# 3.3 Decide k

This section evaluates k using the non-public function `_choose_k` based on the number of unique labels. However, if the number of desired clusters is None or less than 1, then k is initialized to `min(max(1, n_clusters), len(unique_labels))`. 

In [None]:
    # k
    if n_clusters is None or n_clusters < 1:
        k = _choose_k(len(unique_labels))
    else:
        k = min(max(1, n_clusters), len(unique_labels))

# 3.4 Embeddings

This section uses the imported function `build_embeddings_for_labels` to construct embeddings stored as `emb`. 

In [None]:
    # embeddings
    emb = build_embeddings_for_labels(unique_labels)

# 3.5 k-Means Clustering

This section actually conducts the semantic clustering algorithm.

Line-by-line breakdown:
- If k is less than or equal to 1, the raw clusters are initialized as the NumPy zero array (everything is assigned to one cluster)
- Otherwise, KMeans is initialized using `n_clusters`, and the `fit_predict` method is used with `emb` as the input parameter. The result is stored as `cluster_raw`. 

In [None]:
    if k <= 1:
        cluster_raw = np.zeros(len(unique_labels), dtype=int)
    else:
        km = KMeans(n_clusters=k, random_state=random_state, n_init="auto")
        cluster_raw = km.fit_predict(emb)

# 3.6 Normalization

This section normalizes the cluster IDs such that the first cluster is assigned the value "1" instead of "0". 

Line-by-line breakdown:
- The distinct cluster IDs KMeans produced (as integers) is sorted and stored as `uniq_raw`.
- The "1-based" mapping is stored as `raw2id`.
- A dictionary mapping of labels and cluster IDs is stored as `label_to_cid`.
- Each row/item of the original dataset is assigned a cluster using the mapping.

In [None]:
    # normalize IDs (1..K)
    uniq_raw = sorted(set(int(x) for x in cluster_raw))
    raw2id = {raw: i + 1 for i, raw in enumerate(uniq_raw)}

    label_to_cid = {
        lbl: raw2id[int(raw)]
        for lbl, raw in zip(unique_labels, cluster_raw)
    }

    df[f"{output_prefix}_id"] = labels.map(label_to_cid).fillna(0).astype(int)

# 3.7 Naming

This section generates human-readable cluster names.

Line-by-line breakdown:
- A dictionary `clusters` is initialized, which will store cluster assignments in the format `cid: item`, where `cid` is the numerical cluster ID, and item is the row/group of rows assigned to the associated cluster ID. 
- A dictionary `name_map` is initialized.
- Iterate through each cluster assignment (in a group of 180 items, there are 180 total cluster assignments).
- For each iteration, attempt to store the return of `tfidf_cluster_labels(members)` as a value for the current `cid` (`name_map` key).
- If the function fails, then the first three labels of a cluster assignment are taken and joined together to name the cluster.
- After iteration concludes, assign the name mapping as a new column in the original dataframe and return.

Note: `tfidf_cluster_labels` can be found in the notebook breakdown for `text_utils.py`.

In [None]:
    # names
    clusters = {}
    for lbl, cid in label_to_cid.items():
        clusters.setdefault(cid, []).append(lbl)

    name_map = {}
    for cid, members in clusters.items():
        try:
            name_map[cid] = tfidf_cluster_label(members)
        except:
            name_map[cid] = " / ".join(members[:3])

    df[f"{output_prefix}_name"] = df[f"{output_prefix}_id"].map(name_map).astype(str)
    return df

# NOTE: WHY TF-IDF?

TF-IDF treats each label (a product category, customer segment, or something similar) as a **document**, which is then split into **tokens** (individual words within each label). TF-IDF then measures the **term frequency**, calculates **inverse document frequency** (log(total_docs / docs_containing_token)), and multiplies both values together (TF * IDF) to identify words that are common inside the cluster AND rare outside of the cluster. This works well with wide, sparse datasets where many tokens are globally rare and locally dense. TF-IDF is also fast and cheap, which works well with the Streamlit application. Hierarchies are rebuilt within the application based on the user's desired edits, so naming needs to be performed quickly and efficiently to prevent long load times and crashing.  

Future versions of this project will experiment with LLMs, which are powerful but expensive. One solution is to use TF-IDF primarily, and optionally use LLM's for refinement. 