# SEMANTIC_LAYER.ipynb 
# Semantic-Based Clustering Method for Hierarchy Engine

---
## Overview
- Docstring and Imports
- Column Selection Helper Function
- Core Sparsity-Based Clustering Logic
- Core Value-Based Clustering Logic
- Public Assignment Method
- Name Generation Purity Method

## 1 - Docstring and Imports
The Semantic Layer utilizes WHAT

The Semantic Layer is constructed using the sklearn library. Specifically, KMeans is used as the actual clustering agent. NumPy and Pandas are used for data management, and annotations/type hints are incorporated for clarity. Additionally, two useful functions from `.text_utils` are imported as helpers.

In [None]:
# core/semantic_layer.py

from __future__ import annotations
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

from .text_utils import build_embeddings_for_labels, tfidf_cluster_label

# 2 - K Heuristic
This helper function determines the optimal k value for KMeans using the following heuristic: k is set to the squareroot of the number of items within the category. This value is clipped to prevent over-fragmentation and ensure that there is more than one cluster. The function takes the following arguments: an integer `n_items`, an integer `k_min`, and an integer `k_max`. In practice, `n_items` is the number of items within a category, while `k_min` and `k_max` are lower and upper bounds for k with default values 2 and 12, respectively. 

Line-by-line breakdown:
- If the number of items within a cluster is 1 or less, then the function returns `k=1` (cannot be more than one cluster in this scenario).
- k is initialized as the squareroot of the number of items within the category. 
- k is clipped as the greater value between `k_min` and `min(k_max, k)`.
- k is again clipped as the lesser value between k and `n_items`.
- The clipped k is returned.

In [None]:
def _choose_k(n_items: int, k_min: int = 2, k_max: int = 12):
    if n_items <= 1:
        return 1
    k = int(np.sqrt(n_items))
    k = max(k_min, min(k_max, k))
    k = min(k, n_items)
    return k

# 3 Core Semantic-Based Clustering Logic

This function contains the semantic-layer construction. The function accepts the following arguments: a Pandas DataFrame `df`, a string `input_label_col`, an optional integer `n_clusters`, a string `output_prefix`, and an integer `random_state`. In practice, `df` is the complete original dataset, `input_lable_col` is a column containing text labels to embed and cluster, `n_clusters` is an explicit cluster amount (can be None), `output_prefix` is a prefix added to output columns, and `random_state` is a seed for reproducibility. The function returns a Pandas DataFrame, which is WHAT

Due to the size of the function, it will be broken down into sections and explained line by line. This first section is simply the function definition and docstring.

In [None]:
def build_semantic_layer(
    df: pd.DataFrame,
    *,
    input_label_col: str,
    n_clusters: int | None,
    output_prefix: str,
    random_state: int = 42,
) -> pd.DataFrame:

# 3.1 - Preparation

This section creates a safe copy of the input dataset, generates a cleaned Series of labels using 'input_label_col', and creates a sorted list of unique, non-empty labels.

In [None]:
    df = df.copy()

    labels = (
        df[input_label_col]
        .fillna("")
        .astype(str)
        .str.strip()
    )
    unique_labels = sorted({v for v in labels if v})

# 3.2 Degeneration Guard

This section contains a failsafe, assigning every item to a single cluster named "All" and returning early in the instance that there are no usable labels.

In [None]:
    if not unique_labels:
        df[f"{output_prefix}_id"] = 0
        df[f"{output_prefix}_name"] = "All"
        return df

# 3.3 Decide k

This section evaluates k using the non-public function `_choose_k` based on the number of unique labels. However, if the number of desired clusters is None or less than 1, then k is initialized to `min(max(1, n_clusters), len(unique_labels))`. 

In [None]:
    # k
    if n_clusters is None or n_clusters < 1:
        k = _choose_k(len(unique_labels))
    else:
        k = min(max(1, n_clusters), len(unique_labels))

# 3.4 Embeddings

This section uses the imported function `build_embeddings_for_labels` to construct embeddings stored as `emb`. 

In [None]:
    # embeddings
    emb = build_embeddings_for_labels(unique_labels)

# 3.5 k-Means Clustering

This section actually conducts the semantic clustering algorithm.

Line-by-line breakdown:
- If k is less than or equal to 1, the raw clusters are initialized as the NumPy zero array (everything is assigned to one cluster)
- Otherwise, KMeans is initialized using `n_clusters`, and the `fit_predict` method is used with `emb` as the input parameter. The result is stored as `cluster_raw`. 

In [None]:
    if k <= 1:
        cluster_raw = np.zeros(len(unique_labels), dtype=int)
    else:
        km = KMeans(n_clusters=k, random_state=random_state, n_init="auto")
        cluster_raw = km.fit_predict(emb)

# 3.6 Normalization

This section normalizes the cluster IDs such that the first cluster is assigned the value "1" instead of "0". 

Line-by-line breakdown:
- The distinct cluster IDs KMeans produced (as integers) is sorted and stored as `uniq_raw`.
- The "1-based" mapping is stored as `raw2id`.
- A dictionary mapping of labels and cluster IDs is stored as `label_to_cid`.
- Each row/item of the original dataset is assigned a cluster using the mapping.

In [None]:
    # normalize IDs (1..K)
    uniq_raw = sorted(set(int(x) for x in cluster_raw))
    raw2id = {raw: i + 1 for i, raw in enumerate(uniq_raw)}

    label_to_cid = {
        lbl: raw2id[int(raw)]
        for lbl, raw in zip(unique_labels, cluster_raw)
    }

    df[f"{output_prefix}_id"] = labels.map(label_to_cid).fillna(0).astype(int)

# 3.7 Naming

This section generates human-readable cluster names.

Line-by-line breakdown:

In [None]:
    # names
    clusters = {}
    for lbl, cid in label_to_cid.items():
        clusters.setdefault(cid, []).append(lbl)

    name_map = {}
    for cid, members in clusters.items():
        try:
            name_map[cid] = tfidf_cluster_label(members)
        except:
            name_map[cid] = " / ".join(members[:3])

    df[f"{output_prefix}_name"] = df[f"{output_prefix}_id"].map(name_map).astype(str)
    return df

# 4

In [None]:
def semantic_relabel(df: pd.DataFrame, id_col: str, name_col: str | None = None, n_clusters=None):

    if name_col and name_col in df.columns:
        labels = df[name_col].fillna("").astype(str)
    else:
        labels = df[id_col].astype(str)

    unique = sorted({x for x in labels if x})
    if not unique:
        df2 = df.copy()
        df2[f"{id_col}_semantic"] = 0
        df2[f"{id_col}_semantic_name"] = "All"
        return df2

    if n_clusters is None or n_clusters < 1:
        k = _choose_k(len(unique))
    else:
        k = min(max(1, n_clusters), len(unique))

    emb = build_embeddings_for_labels(unique)
    if k <= 1:
        raw = np.zeros(len(unique), dtype=int)
    else:
        km = KMeans(n_clusters=k, random_state=42, n_init="auto")
        raw = km.fit_predict(emb)

    uniq_raw = sorted(set(int(x) for x in raw))
    raw2id = {old: i + 1 for i, old in enumerate(uniq_raw)}

    label2cid = {lbl: raw2id[int(r)] for lbl, r in zip(unique, raw)}

    df2 = df.copy()
    df2[f"{id_col}_semantic"] = labels.map(label2cid).fillna(0).astype(int)

    clusters = {}
    for lbl, cid in label2cid.items():
        clusters.setdefault(cid, []).append(lbl)

    names = {}
    for cid, mem in clusters.items():
        try:
            names[cid] = tfidf_cluster_label(mem)
        except:
            names[cid] = " / ".join(mem[:3])

    df2[f"{id_col}_semantic_name"] = df2[f"{id_col}_semantic"].map(names).astype(str)
    return df2

# 5

In [None]:
def split_cluster_embeddings(labels: list[str], k: int):

    emb = build_embeddings_for_labels(labels)

    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    raw = km.fit_predict(emb)

    # The return is simply an array of group indices (0..k-1)
    return raw
