# ATTRIBUTE_LAYER.ipynb 
# Sparsity-Based/Value-Based Clustering Method for Hierarchy Engine

---
## Overview
- Docstring and Imports
- Column Selection Helper Function
- Core Sparsity-Based Clustering Logic
- Core Value-Based Clustering Logic
- Public Assignment Method
- Name Generation Purity Method

## 1 - Docstring and Imports
The Attribute Layer utilizes sparsity-based clustering OR value-based clustering to create groups of items with similar sparsity patterns/attribute values. 

Sparsity-based clustering is especially effective with sparse, wide datasets (think 8,000+ columns). Example applications where sparsity-based clustering is effective includes product master data, term documents, and sensor networks. Within the context of Hierarchy, the Attribute Layer generates clusters in an attempt to group similar items within a greater category, allowing users to build more narrow item categories. 

The Attribute Layer is constructed using the sklearn library. Specifically, TruncatedSVD is used for dimensionality reduction to improve processing time, and KMeans is the actual clustering agent. NumPy and Pandas are used for data management, and annotations/type hints are incorporated for clarity.

In [None]:
# core/attribute_layer.py

"""
This file implements the Attribute Layer.

The Attribute Layer groups rows within each category based on
their attribute sparsity patterns (which columns are populated), and
produces human-readable names for each attribute-based cluster.

Public API used by the hierarchy engine:

    assign_all_clusters(df, random_state=42) -> pd.DataFrame
        Returns a copy of df with an integer "attribute_cluster"
        column indicating the attribute-layer cluster id for each row.

    make_cluster_names(df) -> (dict, pd.DataFrame)
        Returns (cluster_name_map, df_with_names) where
        "attribute_cluster_name" is added to the dataframe.
"""

# Type hints
from __future__ import annotations
from typing import List, Tuple, Dict

# External dependencies
import numpy as np
import pandas as pd

# Sklearn dependencies
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans

## 2 - Column Selection Helper
This section covers the non-public attribute column selection method. It is a simple function that returns the attribute columns intended to be included into clustering. 

## 2.1 - Metadata Column Exclusion
Specific metadata columns will be excluded from clustering. They tend to overpower the algorithm, and clusters built with these metadata columns are not particularly informative. 

In [None]:
# Columns to exclude from attribute consideration
_METADATA_COLS = {
    "category_name",
    "category_cluster",
    "category_cluster_name",
    "attribute_cluster_id",
    "attribute_cluster_name",
    "level_0_id",
    "level_0_name",
    "level_1_id",
    "level_1_name",
}

## 2.2 - Helper Function
The actual function, `_select_attribute_columns()`, accepts two arguments: a Pandas DataFrame object `df`, and a list of strings `extra_excluded_cols`. In practice, `df` is the primary dataset the user intends to generate a hierarchy for, and `extra_excluded_cols` is an optional addional set of columns to not be considered in the clustering algorithm. The function returns a list of strings, which is the selection of attribute columns to be used for clustering. 

Line-by-line breakdown:
- A set containing the metadata columns is stored in the variable `excluded`.
- If `extra_excluded_cols` is not None, then `excluded` is refactored to contain the unique columns in `extra_excluded_cols`.
- The return object `cols` is initialized as an empty list of strings.
- Iterate through each column in the primary dataset.
    - If the current column is in `excluded`, then continue to the next column
    - If a column is a trivial/non-informative column such as `id` or `index`, then continue to the next column
    - If the current column is completely null, then continue to the next column
    - The current column is appended to the return object `cols` as a string
- Return `cols`, the selected attribute columns (as a list of strings)


In [None]:
# Function to select attribute columns for clustering
def _select_attribute_columns(
    df: pd.DataFrame,
    extra_excluded_cols: list[str] | None = None,
) -> List[str]:
    """
    Heuristic to choose attribute columns for sparsity-based clustering.

    We treat any non-all-null column that is not obviously part of the
    hierarchy metadata as an attribute candidate, and also allow the caller
    to explicitly exclude additional columns via `extra_excluded_cols`.

    Parameters
    ----------
    df :
        Primary DataFrame object for use in hierarchy generation
    extra_excluded_cols :
        Columns to exclude from clustering

    Returns
    ------- 
    List[str] :
        An list of column names to be used as attribute columns
    """
    excluded = set(_METADATA_COLS)
    if extra_excluded_cols:
        excluded |= set(map(str, extra_excluded_cols))

    cols: List[str] = []
    for col in df.columns:
        if col in excluded:
            continue
        # Skip index-like or trivial columns
        if str(col).lower() in {"index", "id"}:
            continue
        if df[col].isna().all():
            continue
        cols.append(col)
    return cols

## 3 - Core Sparsity-Based Clustering Logic
This section is responsible for the actual clustering method. The non-public function `_cluster_products_within_category` accepts three arguments: a Pandas DataFrame object `cat_df`, a list of strings `attr_cols`, and an integer `random_state`. In practice, `cat_df` is a subset of the primary dataset containing only the rows within the category of items to be clustered. Future versions of this project will generalize this logic so that attribute layers can be built using other attribute layers, instead of just item categories. Moving on, `attr_cols` is a list of columns to be considered as attributes for clustering. This object is the returned object from the previous helper function `_select_attribute_columns`. Lastly, `random_state` is used as the seed for the sklearn functions for reproducibility. The function returns a NumPy array of integer labels, corresponding to the cluster assignments for each row in `cat_df`. 

Due to the size of the function, it will be broken down into chunks of line-by-line explanations, rather than one giant breakdown. This first snippet is just the function signature and docstring.


In [None]:
def _cluster_products_within_category_sparsity(
    cat_df: pd.DataFrame,
    attr_cols: List[str],
    random_state: int = 42,
) -> np.ndarray:
    """
    Cluster products within a single category using attribute sparsity
    (which columns are non-null / non-empty).

    Parameters
    ----------
    cat_df :
        Subset of the dataframe containing only rows for a single category.
    attr_cols :
        Columns to treat as attributes.
    random_state :
        Seed for KMeans and SVD.

    Returns
    -------
    np.ndarray
        An array of integer labels (cluster ids) with length len(cat_df).
    """

## 3.1 - Degenerate Guard
This section is a failsafe to ensure that downstream code never crashes from empty arguments `attr_cols` and `cat_df`.

Line-by-line breakdown:
- If `attr_cols` does not exist OR `cat_df` is an empty dataframe, a single cluster "0" is created and every row is assigned to that cluster. The function returns this cluster assignment immediately.

In [None]:
    if not attr_cols or cat_df.empty:
        # Degenerate: single cluster
        return np.zeros(len(cat_df), dtype=int)

## 3.2 - Boolean Mask
This section initializes a boolen mask and populates it column by column.

Line-by-line breakdown:
- `mask` is initialized as a Pandas DataFrame, with a binary True/False assignment for each attribute column depending on if the column is null or not-null.
- Iterate through each column in `attr_cols`.
    - The contents of the current column are stored in the variable `col_series`.
    - If `col_series` is type `object` (a string, empty string, or other kind of messy user-enteted data), then a boolean series is created and an attribute is marked present if it is not NaN, whitespace-only strings, or empty strings.
    - If `col_series` is not type `object`, then a boolean series is created and an attribute is marked present if it is not NaN (the string-specific cases are not necessary here since this data is numeric/non-object).
- The mask is converted into a NumPy array `X` with type `float` (True/False is now formatted as 1.0/0.0 since sklearn operates on NumPy arrays).
- The dimensions of `X` are stored as `n_samples` (number of items in `cat_df`) and `n_features` (number of attribute columns).
- If there are only 1 or 0 items in `cat_df`, a simple cluster assignment "0" is returned from this function as a guard.
- If no item has any attribute populated (complete sparsity) then the same as above is returned since clustering is meaningless in this instance.

In [None]:
    # Build a boolean mask: row x attribute_col
    mask = pd.DataFrame(False, index=cat_df.index, columns=attr_cols)

    for col in attr_cols:
        col_series = cat_df[col]
        if col_series.dtype == object:
            mask[col] = col_series.notna() & (col_series.astype(str).str.strip() != "")
        else:
            mask[col] = col_series.notna()

    # Convert to array
    X = mask.to_numpy(dtype=float)
    n_samples, n_features = X.shape

    if n_samples <= 1:
        return np.zeros(n_samples, dtype=int)

    # If all-zero features, give up
    if not np.any(X):
        return np.zeros(n_samples, dtype=int)

## 3.3 - K Heuristic
This section determines the optimal k value for KMeans using the following heuristic: k is set to the squareroot of the number of items within the category. This value is clipped to prevent over-fragmentation and ensure that there is more than one cluster.

Line-by-line breakdown:
- The squareroot of `n_samples` is stored (rounded down to integer) as `k`, clipped such that the `k` is no less than 2 and no greater than 8. 
- If `k` is greater than `n_samples`, then it is set to be equal to `n_samples` (integer rounding sometime causes this).
- If `k` is less than or equal to 1, then the single cluster assignment "0" is returned from this function (final guard for `k`).

In [None]:
  # Determine k using sqrt heuristic, clipped
    k = int(np.clip(np.sqrt(n_samples), 2, 8))
    if k > n_samples:
        k = n_samples
    if k <= 1:
        return np.zeros(n_samples, dtype=int)

## 3.4 - TruncatedSVD Dimensionality Reduction
This section reduces the dimensions of extremely wide data to improve KMeans performance.

Line-by-line breakdown:
- The number of dimensions `n_components` is set to be the minimum of 20, the value of `n_features` - 1, and the value of `n_samples` - 1 IF `n_features` is greater than 1 (more than 1 attribute column). Otherwise, `n_components` is set to 1.
- If `n_components` is greater than or equal to 1, then the function attempts TruncatedSVD using `n_components` and `random_state`, storing it as `svd`. The result of `svd.fit_transform(X)` is stored as `X_reduced`.
- If TruncatedSVD cannot be performed, then `X_reduced` is set equal to X (no reduction).

In [None]:
    # Dimensionality reduction for very wide data
    n_components = min(20, n_features - 1, n_samples - 1) if n_features > 1 else 1
    if n_components >= 1:
        try:
            svd = TruncatedSVD(n_components=n_components, random_state=random_state)
            X_reduced = svd.fit_transform(X)
        except Exception:
            # Fallback: no reduction
            X_reduced = X
    else:
        X_reduced = X

## NOTE - Why Truncated SVD?
There are other dimensionality reduction techniques that can be considered, such as PCA and regular SVD. However, due to the structure of our data, these two methods are not optimal.

**PCA**
- Requires mean-centering
- Densifies sparse data

If a sparse binary dataset is mean-centered, it obstructs the boolean interpretation of the dataset through the introduction of artificial negative values. Additionally, when zero values are no long zero values, the dataset is no longer sparse. As a result, memory requirements skyrocket and computation speed slows down. Therefore, PCA is not an optimal technique for this scenario. 

**Regular SVD**
- Creates exact reconstruction
- Runs slow

Because the pipeline only needs the top 10-20 dimensions for clustering, full SVD is overkill. Additionally, the time complexity, memory requirements, and numerical stability concerns make SVD both slow and risky.

**Truncated SVD**
- No mean-centering
- No densifying matrices
- No perfect reconstruction

Truncated SVD's lack of centering/shifting helps to preserve the boolean interpretation of the data after reduction. Additionally, Truncated SVD does not densify the matrix in the process. Also, the objective of dimensionality reduction in this scenario is for clustering, not reconstruction. Therefore, Truncated SVD is able to get the job done with more speed and less memory requirements. The math will not be as "pure", but this design choice is for robustness. 

## 3.5 - KMeans Clustering
This section houses the actual instance of KMeans clustering

Line-by-line breakdown:
- The script attempts to perform k-means clustering using `n_clusters` for k. The method uses greedy k-means to make several trials at each sampling step, ensuring that the best centroids are chosen. (note: `n_init` simply equals `1` in this context). Following the `fit_predict` method, the resulting clustering assignments are stored in `labels`.
- If k-means fails, then a single cluster is generated as a fallback using NumPy zero array. 


In [None]:
    # Cluster
    try:
        km = KMeans(n_clusters=k, random_state=random_state, n_init="auto")
        labels = km.fit_predict(X_reduced)
    except Exception:
        # Fallback: single cluster
        labels = np.zeros(n_samples, dtype=int)

## 3.6 - Normalizing Labels and Return
This section normalizes the resulting clustering assignments and returns the assignments.

Line-by-line breakdown:
- Iterate through each label produced by clustering, ensure that the label is an integer, create a set of unique cluster IDs and sort the labels.
- Map old labels to the new normalized labels.
- Convert the mapping into a NumPy array.
- Return the resulting clustering assignments.

Note: Normalization is done per category since each category is clustered independently. The IDs are local instead of global, which is useful for naming and UI grouping. This is also one of the reasons why using `LabelEncoder` would be ineffective.

In [None]:
    # Normalize labels to 0..C-1 per category
    unique_raw = sorted(set(int(l) for l in labels))
    raw_to_new = {raw: i for i, raw in enumerate(unique_raw)}
    labels_norm = np.array([raw_to_new[int(l)] for l in labels], dtype=int)

    return labels_norm

## 4 - Core Value-Based Clustering Logic
This section is responsible for an alternate clustering method, using attribute values instead of sparsity. The non-public function `_cluster_products_within_category_value` accepts three arguments: a Pandas DataFrame object `cat_df`, a list of strings `attr_cols`, and an integer `random_state`. In practice, `cat_df` is a subset of the primary dataset containing only the rows within the category of items to be clustered. Future versions of this project will generalize this logic so that attribute layers can be built using other attribute layers, instead of just item categories. Moving on, `attr_cols` is a list of columns to be considered as attributes for clustering. This object is the returned object from the previous helper function `_select_attribute_columns`. Lastly, `random_state` is used as the seed for the sklearn functions for reproducibility. The function returns a NumPy array of integer labels, corresponding to the cluster assignments for each row in `cat_df`. 

Due to the size of the function, it will be broken down into chunks of line-by-line explanations, rather than one giant breakdown. This first snippet is just the function signature and docstring.

In [None]:
def _cluster_products_within_category_value(
    cat_df: pd.DataFrame,
    attr_cols: List[str],
    random_state: int = 42,
) -> np.ndarray:
    """
    Cluster products within a single category using attribute VALUES
    (not just presence/absence).

    Parameters
    ----------
    cat_df :
        Subset of the dataframe containing only rows for a single category.
    attr_cols :
        Columns to treat as attributes.
    random_state :
        Seed for KMeans and SVD.

    Returns
    -------
    np.ndarray
        An array of integer labels (cluster ids) with length len(cat_df).
    """

## 4.1 - Degenerate Guard
This section is a failsafe to ensure that downstream code never crashes from empty arguments `attr_cols` and `cat_df`.

Line-by-line breakdown:
- If `attr_cols` does not exist OR `cat_df` is an empty dataframe, a single cluster "0" is created and every row is assigned to that cluster. The function returns this cluster assignment immediately.

In [None]:
    if not attr_cols or cat_df.empty:
        return np.zeros(len(cat_df), dtype=int)

## 4.2 - Value-Based Matrix
This section constructs the value-based matrix, the key difference between this function and the sparsity-based clustering function.

Line-by-line breakdown:
- `features` is initialized as an empty list
- Iterate through each attribute column, storing the contents of the column as `s`, and checking if `s` is `NaN` for all values. If it is, then continue to the next column.
- Otherwise, check to see if the column is numeric. If it is, fill the missing values using `s.median` to avoid skew. This is preferable to dropping rows, since dropping rows would break alignment and interfere with clustering (clustering requires equal-length rows). Then convert to a numeric column and append to the `features` list.
- If the column is not numeric, then strings are converted to categorical codes using `pd.factorize`. `OneHotEncoder` is computationally slow and not preferred here because there are too many dimensions, and the dataset is extremely sparse. 
- In the case of non-numeric columns, "missing" is considered a separate code as a result of `codes.astype(float)`. This preserves the information that missing values provide. The codes are then appeneded to `features`.
- If there are no usable features, then a single cluster is returned using the NumPy zero array. 
- Using `np.hstack`, features are combined into a matrix with dimensions "items" x "attributes" and the shape is extracted as `n_samples`, `n_features`.
- If there are only 1 or 0 items in `cat_df`, a simple cluster assignment "0" is returned from this function as a guard.
- If no item has any attribute populated (complete sparsity) then the same as above is returned since clustering is meaningless in this instance.

In [None]:
    features = []
    for col in attr_cols:
        s = cat_df[col]
        if s.isna().all():
            continue

        if pd.api.types.is_numeric_dtype(s):
            filled = s.fillna(s.median())
            features.append(filled.to_numpy(dtype=float).reshape(-1, 1))
        else:
            # Factorize string values
            codes, _ = pd.factorize(s.astype(str), sort=True)
            # Treat -1 (NA) as separate code
            codes = codes.astype(float)
            features.append(codes.reshape(-1, 1))

    if not features:
        return np.zeros(len(cat_df), dtype=int)

    X = np.hstack(features)
    n_samples, n_features = X.shape

    if n_samples <= 1:
        return np.zeros(n_samples, dtype=int)

    if not np.any(np.isfinite(X)):
        return np.zeros(n_samples, dtype=int)

## 4.3 - K Heuristic
This section determines the optimal k value for KMeans using the following heuristic: k is set to the squareroot of the number of items within the category. This value is clipped to prevent over-fragmentation and ensure that there is more than one cluster.

Line-by-line breakdown:
- The squareroot of `n_samples` is stored (rounded down to integer) as `k`, clipped such that the `k` is no less than 2 and no greater than 8. 
- If `k` is greater than `n_samples`, then it is set to be equal to `n_samples` (integer rounding sometime causes this).
- If `k` is less than or equal to 1, then the single cluster assignment "0" is returned from this function (final guard for `k`).

In [None]:
    # Determine k using sqrt heuristic, clipped
    k = int(np.clip(np.sqrt(n_samples), 2, 8))
    if k > n_samples:
        k = n_samples
    if k <= 1:
        return np.zeros(n_samples, dtype=int)

## 4.4 - TruncatedSVD Dimensionality Reduction
This section reduces the dimensions of extremely wide data to improve KMeans performance.

Line-by-line breakdown:
- The number of dimensions `n_components` is set to be the minimum of 20, the value of `n_features` - 1, and the value of `n_samples` - 1 IF `n_features` is greater than 1 (more than 1 attribute column). Otherwise, `n_components` is set to 1.
- If `n_components` is greater than or equal to 1, then the function attempts TruncatedSVD using `n_components` and `random_state`, storing it as `svd`. The result of `svd.fit_transform(X)` is stored as `X_reduced`.
- If TruncatedSVD cannot be performed, then `X_reduced` is set equal to X (no reduction).

In [None]:
    # Optional dimensionality reduction for wide matrices
    n_components = min(20, n_features - 1, n_samples - 1) if n_features > 1 else 1
    if n_components >= 1:
        try:
            svd = TruncatedSVD(n_components=n_components, random_state=random_state)
            X_reduced = svd.fit_transform(X)
        except Exception:
            X_reduced = X
    else:
        X_reduced = X

## 4.5 - KMeans Clustering
This section houses the actual instance of KMeans clustering

Line-by-line breakdown:
- The script attempts to perform k-means clustering using `n_clusters` for k. The method uses greedy k-means to make several trials at each sampling step, ensuring that the best centroids are chosen. (note: `n_init` simply equals `1` in this context). Following the `fit_predict` method, the resulting clustering assignments are stored in `labels`.
- If k-means fails, then a single cluster is generated as a fallback using NumPy zero array. 


In [None]:
    try:
        km = KMeans(n_clusters=k, random_state=random_state, n_init="auto")
        labels = km.fit_predict(X_reduced)
    except Exception:
        labels = np.zeros(n_samples, dtype=int)

## 4.6 - Normalizing Labels and Return
This section normalizes the resulting clustering assignments and returns the assignments.

Line-by-line breakdown:
- Iterate through each label produced by clustering, ensure that the label is an integer, create a set of unique cluster IDs and sort the labels.
- Map old labels to the new normalized labels.
- Convert the mapping into a NumPy array.
- Return the resulting clustering assignments.

Note: Normalization is done per category since each category is clustered independently. The IDs are local instead of global, which is useful for naming and UI grouping. 

In [None]:
    # Normalize labels to 0..C-1 per category
    unique_raw = sorted(set(int(l) for l in labels))
    raw_to_new = {raw: i for i, raw in enumerate(unique_raw)}
    labels_norm = np.array([raw_to_new[int(l)] for l in labels], dtype=int)

    return labels_norm

## 5 - Public Assignment Method 
This section is responsible for running the clustering within the application by making use of private functions within the attribute layer script. The public function `assign_all_clusters` accepts three arguments: a Pandas DataFrame object `df`, an integer `random_state`, a list of strings `extra_excluded_cols`, and a string `method`. In practice, `df` is the complete primary dataset. Next, `random_state` is used as the seed for the sklearn functions for reproducibility. Moving on, `extra_excluded_cols` is a list of columns to be excluded from consideration as attributes for clustering. Lastly, `method` is a string that declares whether sparsity-clustering or value-clustering is used. The function returns a Pandas DataFrame of the original dataset with cluster assignment labels for each row.

Due to the size of the function, it will be broken down into chunks of line-by-line explanations, rather than one giant breakdown. This first snippet is just the function signature and docstring.

In [None]:
def assign_all_clusters(
    df: pd.DataFrame,
    random_state: int = 42,
    extra_excluded_cols: list[str] | None = None,
    method: str = "sparsity",
) -> pd.DataFrame:
    """
    Assign attribute-layer clusters within each category.

    For every distinct value of `category_name`, we cluster its products
    using either:

        - 'sparsity': attribute sparsity patterns (which columns are present)
        - 'value':    attribute values (numeric + factorized categorical)

    and assign an integer `attribute_cluster` id (0..K-1 for that category).

    Parameters
    ----------
    df :
        Input dataframe. Must contain `category_name`.
    random_state :
        Random seed for clustering.
    extra_excluded_cols :
        Optional list of column names to exclude from attribute clustering
        (in addition to the built-in metadata exclusions).
    method :
        'sparsity' or 'value'.

    Returns 
    -------
    pd.DataFrame :
        A dataframe of the original dataset with cluster assignments as integers, and a shape of the original dataset plus an additional column.
    """

# 5.1 - Validation Check

This section is a simple guard to ensure that the column `category_name` exists in the dataset. If it does not, the function fails fast with a clear error message.

In [None]:
    if "category_name" not in df.columns:
        raise ValueError("Expected column 'category_name' in dataframe.")

# 5.2 - Column Selection

This section runs the function `_select_attribute_columns` using the input dataset and the specified excluded columns, and stores the return as `attr_cols`. 

In [None]:
    attr_cols = _select_attribute_columns(df, extra_excluded_cols=extra_excluded_cols)

# 5.3 - Clustering and Return

This section executes the actual clustering functions. 

Line-by-line breakdown:
- A copy of `df` is stored as `df_out` to avoid mutating the original dataset itself.
- Cluster assignments are initialzed prior to clustering using the NumPy zero array.
- Iterate through each category in the dataset, grouping them together and storing the current slice as `cat_df`.
- Check to see which clustering method was passed into the function (sparsity vs. value)
- If clustering using value, then the return of `_cluster_products_within_category_value` (using `cat_df`, `attr_cols`, and `random_state` as arguments) is stored as `labels`. If clustering using sparsity, then the return of `_cluster_products_within_category_sparsity` (using `cat_df`, `attr_cols`, and `random_state` as arguments) is stored as `labels`.
- At the end of each iteration, the category-specific clustering labels are written back into the global array as `all_labels`, making sure that each row (item) is assigned the correct cluster label. 
- After each category has been clustered, the column `attribute_cluster` for `df_out` is generated using `all_labels` and the resulting dataframe is returned.

Note: This function runs sparsity-based clustering by default, and the iteration only checks for `value`. As a result, in the event of a typo, invalid entry, or missing argument, the function will run sparsity-based clustering to prevent crashing.

In [None]:
    df_out = df.copy()
    all_labels = np.zeros(len(df_out), dtype=int)

    for cat, cat_idx in df_out.groupby("category_name").groups.items():
        cat_df = df_out.loc[cat_idx]

        if method == "value":
            labels = _cluster_products_within_category_value(
                cat_df,
                attr_cols,
                random_state=random_state,
            )
        else:
            labels = _cluster_products_within_category_sparsity(
                cat_df,
                attr_cols,
                random_state=random_state,
            )

        all_labels[cat_df.index.to_numpy()] = labels

    df_out["attribute_cluster"] = all_labels
    return df_out

# 6 - Name Generation Purity Method

This section covers the naming algorithm used throughout the application. It is primarily used for creating human-readable names for attribute-layer clusters, but it is also implemented within the category-layer for producing synthetic categories when the dataset does not have a category system in place. The public function `make_cluster_names` accepts three arguments: a Pandas DataFrame object `df`, a float `purity_threshold`, and a list of strings `extra_excluded_cols`. In practice, `df` is the complete primary dataset. Next, `purity_threshold` is a value determining whether the achieved purity score is significant enough to consider the corresponding attribute column to be a descriptive name. Lastly, `extra_excluded_cols` is a list of columns to be excluded from consideration as attributes for clustering. The function returns the complex data structure `Tuple[Dict[tuple, str], pd.DataFrame]`, which is the category-cluster pair mapped to the corresponding cluster label, coupled with the resulting dataframe that includes the category-cluster name. 

Due to the size of the function, it will be broken down into chunks of line-by-line explanations, rather than one giant breakdown. This first snippet is just the function signature and docstring.

In [None]:
def make_cluster_names(
    df: pd.DataFrame,
    purity_threshold: float = 0.5,
    extra_excluded_cols: list[str] | None = None,
) -> Tuple[Dict[tuple, str], pd.DataFrame]:
    """
    Compute human-readable names for attribute-layer clusters.

    For each (category_name, attribute_cluster) group, we scan the
    attribute columns and look for columns whose values are relatively
    pure within the cluster (most rows share the same non-null value).

    We then generate a label such as:

        "Binding Covers – 210 (Width mm) / 16 (Height mm)"

    where the pieces are taken from high-purity attribute values.

    Parameters
    ----------
    df :
        Input dataframe. Must contain 'category_name' and 'attribute_cluster'.
    purity_threshold :
        Minimum fraction of rows within a cluster that must share the
        same value in an attribute column for that (column, value)
        descriptor to be used in the label.
    extra_excluded_cols :
        Optional list of columns to exclude from naming consideration.

    Returns
    -------
    Tuple[Dict[tuple, str], pd.DataFrame] :
        Category-cluster pair mapped to the corresponding cluster label, coupled with the resulting dataframe that includes the category-cluster name.
    """

# 6.1 - Validation Check

This section is a simple guard to ensure that the columns `category_name` and `attribute_cluster` exist in the dataset. If either one does not, the function fails fast with a clear error message.

In [None]:
    if "category_name" not in df.columns or "attribute_cluster" not in df.columns:
        raise ValueError("Expected 'category_name' and 'attribute_cluster' columns.")

# 6.2 - Column Selection

This section runs the function `_select_attribute_columns` using the input dataset and the specified excluded columns, and stores the return as `attr_cols`.

In [None]:
    attr_cols = _select_attribute_columns(df, extra_excluded_cols=extra_excluded_cols)

# 6.3 - Mapping

This section creates a safe copy of the original dataframe, initializes the cluster map as a dictionary with tuples (category_name, category_cluster) mapped to strings (labels), and initializes the `attribute_cluster_name` column in the dataframe copy.

In [None]:
    df_out = df.copy()
    cluster_name_map: Dict[tuple, str] = {}
    df_out["attribute_cluster_name"] = ""

# 6.4 Degenerate Guard

This section is a failsafe for missing attribute columns. 

Line-by-line breakdown:
- If attribute columns are missing, then iterate through each cluster within each category.
- A misc. label is assigned to the current cluster within the category and is stored as `label`.
- The label is mapped to `cluster_name_map` and written into the dataframe.
- After iteration has concluded, the mapping and resulting dataframe are returned.

In [None]:
    if not attr_cols:
        # Degenerate: just use category name + generic suffix
        for (cat, cid), idx in df_out.groupby(
            ["category_name", "attribute_cluster"]
        ).groups.items():
            label = f"{cat} – misc"
            cluster_name_map[(cat, cid)] = label
            df_out.loc[idx, "attribute_cluster_name"] = label
        return cluster_name_map, df_out

# 6.5 - Naming Algorithm

This section contains the actual purity-based naming algorithm, returning the resulting map and dataframe.

Line-by-line breakdown:
- A set of strings indicating missing values are stored in `nan_like`. This set helps exclude uninformative missing values from the naming process, preventing names such as `None (Color)`.
- The outer loop begins, iterating through each category-cluster pair. The number of rows within the pair (accessed here using the corresponding subset dataframe `grp`) is stored as `n_rows`, and the list of significant label fragments is initialzed as `descriptors`.
- The inner loop begins, iterating through each attribute column individually. Column values for the current cluster is stored as `col_series`, skipping the current column is the values are all null. If there are not all null, then the values are normalized and stored `vals`, which then has any empty strings snf nan-like strings dropped. At this point, if `vals` is empty, the current column is skipped.
- Within the inner loop, each distinct value is counted and stored as `vc`. The most common value is stored as `top_val`, and the count of this value is stored as `top_count`. The purity score is evaluated and stored as `top_count` divided by `n_rows`, which is the fraction of rows within the cluster that contain this value. 
- If the calculated purity score achieves or exceeds the `purity_threshold` parameter, then the descriptor is stored and the current iteration of the inner loop concludes.
- Returning to the outer loop, if `descriptors` actually contains descriptors, then the label is generated using the three most popular values. Otherwise, a misc. label is generated instead. The map and dataframe are updated for the current category-cluster pair, and the next iteration begins.
- Upon the conclusion of the outer loop, the resulting map and dataframe are returned.

In [None]:
    # Treat these string values as "missing" when naming
    nan_like = {"nan", "none", "null", "na", "n/a"}

    for (cat, cid), grp in df_out.groupby(["category_name", "attribute_cluster"]):
        n_rows = len(grp)
        descriptors: List[str] = []

        # Scan each attribute column for high-purity values
        for col in attr_cols:
            col_series = grp[col]

            # Skip if all null
            if col_series.isna().all():
                continue

            vals = col_series.astype(str).str.strip()

            # Drop empty strings
            vals = vals[vals != ""]

            # Drop nan-like string representations
            vals = vals[~vals.str.lower().isin(nan_like)]

            if vals.empty:
                continue

            vc = vals.value_counts(dropna=True)
            top_val = vc.index[0]
            top_count = vc.iloc[0]
            purity = top_count / n_rows

            if purity >= purity_threshold:
                descriptors.append(f"{top_val} ({col})")

        if descriptors:
            desc_str = " / ".join(descriptors[:3])
            label = f"{cat} – {desc_str}"
        else:
            label = f"{cat} – misc"

        cluster_name_map[(cat, cid)] = label
        df_out.loc[grp.index, "attribute_cluster_name"] = label

    return cluster_name_map, df_out

# NOTE - Why Purity-Based?

Purity Scores are effective for generating names based on the defining attributes of a cluster. Since many master datasets are sparse and wide, the presence of specific columns (and the actual entries for those columns) are significant for identifying natural groupings of items.

Another option to consider is TF-IDF naming, which was utilized in an earlier version of this project. In theory, TF-IDF would be a powerful upgrade. Having clear, significant, and grammatically correct cluster names would be valuable, and in certain text-based datasets with low sparsity and uniform text column formats, TF-IDF would perform well.

However, TF-IDF struggled in a few key areas in this project:
- Managing unstructured text data columns
- Identifying most popular values within a group
- Handling sparsity 

As a result, TF-IDF often produced non-sensible or insignicant names for many clusters. The goal of naming the clusters is to be able to interpret how the clusters are grouped. Purity scores identify the most significant descriptors in each cluster as well as the most common value, which helps a human understand exactly what types of items the cluster contains. Thus, the project now implements purity-based naming instead of TF-IDF.