# TEXT_UTILS.ipynb 
# Utility Functions for Text Data Management

---
## Overview
- Docstring and Imports
- Globals
- Helper Functions
- Embeddings for Labels
- TF-IDF Clustering

## 1 - Docstring and Imports
The core logic of this application leverages the text-based utility functions outlined in this notebook. 

Regex is used for data cleaning, NumPY is used for data management, Counter is used for the `word_counts` object, and annotations/type hints are incorporated for clarity. Additionally, TfidVectorizer and SentenceTransformer handle text-data processing. 

In [None]:
# core/text_utils.py

"""
Text utilities for semantic and attribute clustering.

This module provides:

    - A shared SentenceTransformer model for embeddings
    - Helpers for normalizing and tokenizing text
    - TF-IDF–based cluster labeling that is robust to sparse input
"""

# Type hints
from __future__ import annotations
from typing import List, Sequence

# External dependencies
import numpy as np
import regex as re
from collections import Counter

# Sklearn dependencies
from sklearn.feature_extraction.text import TfidfVectorizer

# Sentence Transformer
from sentence_transformers import SentenceTransformer

# 2 - Globals

This section contains a shared model, as well as the private function `_get_sentence_model` that accepts a string `model_name` (desired model to use).

The private variable `_sentence_model` is a module-level variable that starts as `None`. 

Line-by-line breakdown of the function:
- Consider `_sentence_model` to be global.
- Check to see if the model exists. If it does not, then the desired model is loaded, stored, and returned. If not, then there is no reload and the model is immediately returned.

The purpose of the function is to ensure that the desired model is loaded only once. `SentenceTransformer` is slow and memory-heavy, taking a long time to run. It does not need to reload every time text is embedded. 

In [None]:
_sentence_model: SentenceTransformer | None = None

def _get_sentence_model(model_name: str = "all-MiniLM-L6-v2") -> SentenceTransformer:
    """
    Lazily load and cache the SentenceTransformer model used for
    semantic embedding of labels (categories, cluster names, etc.).
    """
    global _sentence_model
    if _sentence_model is None:
        _sentence_model = SentenceTransformer(model_name)
    return _sentence_model

# 3 - Helper Functions

This section contains two helper functions for text normalization and tokenization. Each function will be broken down separately.

# 3.1 - Normalize

This function accepts a string `text` and returns a string. In practice, the input string is any desired text to normalize, and the output is the same string without whitespace. This function ensures that empty strings are not used for TF-IDF or embedding.

Line-by-line breakdown:
- Ensure input text is in fact a string.
- Replace any whitespace with a single space.
- Strip whitespace and return.

In [None]:
def normalize_text(text: str) -> str:
    """
    Normalize whitespace and coerce to string.

    This is intentionally lightweight: we primarily use it to make
    sure we don't feed empty strings or pathological whitespace into
    TF-IDF or embedding models.
    """
    s = str(text)
    s = re.sub(r"\s+", " ", s)
    return s.strip()

# 3.2 - Tokenize

This function accepts a string `text` and returns a list of strings. In practice, the input string is any desired text to tokenize, and the output is the tokens of the string. 

Line-by-line breakdown:
- Use `normalize_text` to normalize the input and force to lowercase.
- Get any digit or letter character that is one or more characters long, and return all matching substrings.

In [None]:
def tokenize(text: str) -> List[str]:
    """
    Very simple tokenizer for fallback frequency calculations.

    - Lowercase
    - Extract sequences of letters/digits as tokens
    """
    s = normalize_text(text).lower()
    return re.findall(r"\b\w+\b", s)

# 4 - Embeddings for Labels

The public function `build_embeddings_for_labels` is the crutch of the Semantic Layer. The function accepts a Sequence of strings `labels` (category names or cluster labels to embed), and returns a NumPy array containing the embeddings. 

Line-by-line breakdown:
- Convert sequence into a list so that it can be iterated.
- If `labels` does not exist, then early return an empty NumPy array.
- `get_sentence_model()` retrieves the desired model and stores it as `model`.
- Run `normalize_text` on each label and store all normalized lables as a list `cleaned`.
- Encode cleaned labels and store the result as `emb`.
- Return the embeddings as a NumPy array of floats.

In [None]:
def build_embeddings_for_labels(labels: Sequence[str]) -> np.ndarray:
    """
    Build sentence-level embeddings for a sequence of label strings,
    using a shared SentenceTransformer model.

    Parameters
    ----------
    labels :
        Any iterable of strings (e.g., category names, cluster labels).

    Returns
    -------
    np.ndarray
        2D array of shape (len(labels), embedding_dim).
        Returns an empty array if `labels` is empty.
    """
    labels = list(labels)
    if not labels:
        return np.zeros((0, 0), dtype=float)

    model = _get_sentence_model()
    cleaned = [normalize_text(x) for x in labels]
    emb = model.encode(cleaned, show_progress_bar=False)
    return np.asarray(emb, dtype=float)

# 5 - TF-IDF Clustering

The public function `tfidf_cluster_label` is also integral to the Semantic Layer, specifically for naming. The function takes a Sequence of strings `texts`, an integer `max_words`, an integer `min_df`, and a float `max_df`. In practice, `texts` is the strings of items to be used for creating labels, `max_words` is the upper limit of how many words can be included in the name, and `min_df` and `max_df` are the lower and upper limits on how frequent a token must appear in a document to be considered. The function returns a string, which is the human-readable label calculated using TF-IDF scores. 

Due to its size and complexity, it will broken down into chunks. This first section is the docstring and signature.

In [None]:
def tfidf_cluster_label(
    texts: Sequence[str],
    max_words: int = 4,
    min_df: int = 1,
    max_df: float = 0.9,
) -> str:
    """
    Compute a short, human-readable label from a collection of texts
    using TF-IDF scores.

    This is used to name semantic clusters in a way that
    reflects the most informative terms appearing in cluster members.

    The function is designed to be robust:
      - If input is empty → returns "misc"
      - If TF-IDF pruning removes all terms → relaxes pruning and retries
      - If that still fails → falls back to simple token frequency

    Parameters
    ----------
    texts :
        Sequence of strings belonging to a single cluster.
    max_words :
        Maximum number of words to include in the label.
    min_df :
        Minimum document frequency for TF-IDF features.
    max_df :
        Maximum document frequency (as a proportion) for TF-IDF features.

    Returns
    -------
    str
        A title-cased cluster label, or "misc" if no good label
        can be determined.
    """

# 5.1 - Normalize and Filter

This section calls `normalize_text` for each text and stores the results a list `docs`. If `docs` is empty, then the function early returns the label `misc` as a failsafe.

In [None]:
    docs = [normalize_text(t) for t in texts if normalize_text(t)]
    if not docs:
        return "misc"

# 5.2 - Vectorization

This section performs the actual vectorization. 

Line-by-line breakdown:
- A TfidVectorizer `vec` is initialized 
- The function tries to fit the normalized text using the vector and store it as `X`. If it fails, the function tries it again with looser pruning requirements. 
- If it fails again, then a failsafe is triggered, beginning with the initializatoin of a list of strings `tokens`.
- Each document is tokenized using the helper function `tokenize` and added to `tokens`.
- If there are no valid tokens, then the `misc` fallback is returned.
- Otherwise, the total number of tokens is stored as `counts`.
- The most common tokens are stored as `top_tokens`.
- The most common tokens are returned.

In [None]:
    # First pass: user-specified min_df / max_df
    vec = TfidfVectorizer(
        lowercase=True,
        token_pattern=r"\b\w+\b",
        min_df=min_df,
        max_df=max_df,
    )

    try:
        X = vec.fit_transform(docs)
    except ValueError:
        # Typical case: "After pruning, no terms remain".
        # Relax pruning and retry.
        vec = TfidfVectorizer(
            lowercase=True,
            token_pattern=r"\b\w+\b",
            min_df=1,
            max_df=1.0,
        )
        try:
            X = vec.fit_transform(docs)
        except Exception:
            # Final fallback: simple frequency over tokens
            tokens: List[str] = []
            for d in docs:
                tokens.extend(tokenize(d))
            if not tokens:
                return "misc"
            counts = Counter(tokens)
            top_tokens = [w for w, _ in counts.most_common(max_words)]
            return " ".join(top_tokens).title()

# 5.3 - Additional Misc Fallbacks

This section contains a couple of fallbacks, as well as initializing `terms` and `scores`.

Line-by-line breakdown:
- If there is no vocabulary present in `X`, then return `misc`.
- Calculate `terms`, a NumPy array of vocabulary words.
- Calculate `scores`, a NumPy array of average TF-IDF scores corresponding to each term in the documents
- If there are no `scores`, then return `misc`.

In [None]:
    if X.shape[1] == 0:
        # No features survived
        return "misc"

    terms = np.array(vec.get_feature_names_out())
    scores = np.asarray(X.mean(axis=0)).ravel()
    if scores.size == 0:
        return "misc"

# 5.4 - Ranking

This section, ranks terms by average TF-IDF score, while containing a few more fallbacks.

Line-by-line breakdown:
- Rank the terms by score in descending order.
- Initialize `label_tokens`, a list of strings.
- Iterate through each index of the ordered terms. If the score of the index is less than or equal to 0, continue to the next index. Otherwise, take the term whose TF-IDF score ranks at this position, and add the actual word string to the label.
- If the length of `label_tokens` is less than or equal to the maximum desired length, then stop iterating.
- If there are no labels, then return `misc`.
- Return a pretty version of the label.

In [None]:
    order = scores.argsort()[::-1]

    label_tokens: List[str] = []
    for idx in order:
        if scores[idx] <= 0:
            continue
        label_tokens.append(terms[idx])
        if len(label_tokens) >= max_words:
            break

    if not label_tokens:
        return "misc"

    # Title-case to look nicer as a cluster name
    return " ".join(label_tokens).title()