# CATEGORY_LAYER.ipynb 
# Categorical Foundation for Hierarchy Engine

---
## Overview
- Docstring and Imports
- Helper Debug Function
- Category Name Column Get/Generate

## 1 - Docstring and Imports
The Category Layer is the foundation of Hierarchy. Semantic layers utilize the existing Category Layer to cluster groups based on semantic similarity, whereas Attribute layers cluster items within categories in the Category Layer.

Although input datasets from real-world sources are often messy, unstructured, and inconsistent with input datasets from other real-world sources, most professional datasets include some type of category column for internal grouping and organizing. Hierarchy leverages this assumption to construct hierarchical layers of items, combining human intuition and input with automated analysis. This application auto-detects category columns while also allowing users to specify different columns as category-based data if desired. 

If a category column does not exist in a dataset, then Hierarchy makes use of the Attribute Layer module to generate categories instead, ensuring that a Category Layer exists.

When outlining the Category Layer, Pandas is used for data management, and annotations/type hints are incorporated for clarity. Additionally, two useful functions from `.attribute_layer` are imported for category generation.

In [None]:
# core/category_layer.py

"""
Category layer utilities.

The category layer is the base text layer from which
semantic layers are built:

    Semantic Layer 0
        ↑
    Semantic Layer 1
        ↑
    Category Layer (this module)
        ↑
    Raw item records

This module provides small, focused helpers to:

  - Normalize and standardize the category column
  - Optionally assign integer category IDs
  - Produce simple summaries of the category layer
"""

# Type hints
from __future__ import annotations
from typing import Optional, List

# External dependencies
import pandas as pd

# Internal dependencies
from .attribute_layer import (
    assign_all_clusters,
    make_cluster_names,
)

# 2 - Ensure Category Name Column

This helper function identifies the category column in a dataset. The function accepts the following arguments: a Pandas DataFrame `df`, a string `category_col`, and a bool `strip`. In practice, `df` is the complete original dataset, `category_col` is the name of the column containing category assignments, and `strip` is a toggle for removing whitespace for normalization. The function returns a Pandas DataFrame, which is a copy of the original dataset with an additional column for category names. 

This first section is just the docstring and signature.

In [None]:
def ensure_category_name_column(
    df: pd.DataFrame,
    category_col: str,
    *,
    strip: bool = True,
) -> pd.DataFrame:
    """
    Ensure the dataframe has a canonical `category_name` column, derived
    from the specified `category_col`.

    This helper is intentionally simple. The HierarchyEngine will call a
    similar normalization internally, but this function is useful if you
    want to perform category-level analysis or debugging outside of the
    engine.

    Parameters
    ----------
    df :
        Input dataframe containing a category column.
    category_col :
        Name of the column to treat as the category text source.
    strip :
        If True, strip leading/trailing whitespace from category strings.

    Returns
    -------
    pd.DataFrame
        A copy of the dataframe with a `category_name` column.
    """

# 2.1 - Actual function

This function is brief and serves a simple debugging purpose. It is not actively used in Hierarchy.

Line-by-line breakdown:
- Check if the specified `category_col` is in the dataframe. If not, raise a ValueError.
- Otherwise, create a safe copy of the dataset, identify the category column, and store it as a string (just in case the original dataset used atypical data types for the category column).
- If `strip` is toggled on, remove leading and trailing whitespace for normalization.
- Store the category names in a new column in the copied dataset, and return.

In [None]:
    if category_col not in df.columns:
        raise ValueError(f"Category column '{category_col}' not found in dataframe.")

    df_out = df.copy()
    cat = df_out[category_col].astype(str)

    if strip:
        cat = cat.str.strip()

    df_out["category_name"] = cat

    return df_out

# 3 - Core Category Get/Generate Logic

This function identifies the category column in a dataset, or generates one if it does not exist. The function accepts the following arguments: a Pandas DataFrame `df`, a string `category_col`, and a list of strings `extra_excluded_cols`. In practice, `df` is the complete original dataset, `category_col` is the name of the column containing category assignments, and `extra_excluded_cols` is an optional list of columns to exclude from consideration during the Attribute Layer-like category generation. The function returns a Pandas DataFrame, which is a copy of the original dataset with an additional column for category names. 

Due to the size of the function, it will be broken down into sections and explained line by line. This first section is simply the function definition and docstring.

In [None]:
def ensure_or_generate_category_name(
    df: pd.DataFrame,
    category_col: Optional[str],
    *,
    extra_excluded_cols: Optional[List[str]] = None,
) -> pd.DataFrame:
    """
    Ensure the dataframe has a `category_name` column.

    - If `category_col` is provided and exists in df:
        behave like `ensure_category_name_column`.
    - If `category_col` is None (or missing from df):
        automatically generate a synthetic `category_name` by
        clustering rows using attribute sparsity patterns and naming
        those clusters with the attribute-layer naming logic.
    """

# 3.1 - Case 1: Existing Category Name Column

This section contains Case 1, the scenario in which a category column exists.

Line-by-line breakdown:
- Create a safe copy of the input dataset.
- If desired category name is not `None` AND it exists in the dataset, then run Case 1.
- Case 1: store the category name column as a string and strip leading/trailing whitespace, generate updated category name column in the copied dataset, and return the resulting dataframe.

In [None]:
    df_out = df.copy()

    if category_col is not None and category_col in df_out.columns:
        cat = df_out[category_col].astype(str).str.strip()
        df_out["category_name"] = cat
        return df_out

# 3.2 - Case 2: Generate Category Name Column

This section contains Case 2, the scenario in which a category column does not exist and needs to be generated.

Line-by-line breakdown:
- `df_temp` is created as a safe copy of the first copy, and each item in the dataset is assigned to one giant category `ALL`.
- The Attribute Layer function `assign_all_clusters` is ran using `df_temp` to cluster items based on attribute sparsity, and the output is stored in `df_temp`.
- The Attribute Layer function `make_cluster_names` is ran using `df_temp` to generate category names, and the resulting dataframe is stored as `df_named`.
- The category name column of the first copied dataframe `df_out` is updated from `ALL` to the new category names based on the clustering assignments, and the resulting dataframe is returned.

In [None]:
    # Temporarily treat the whole dataset as one "pseudo-category"
    df_temp = df_out.copy()
    df_temp["category_name"] = "ALL"

    # Use existing attribute-layer code to find subclusters
    df_temp = assign_all_clusters(
        df_temp,
        random_state=42,
        extra_excluded_cols=extra_excluded_cols,
    )

    # Name those subclusters using the same naming logic
    _, df_named = make_cluster_names(
        df_temp,
        extra_excluded_cols=extra_excluded_cols,
    )

    # Our synthetic categories are exactly those subcluster names
    df_out["category_name"] = df_named["category_subcluster_name"]
    return df_out