# Deep Feature-Based Text Clustering and Its Explanation

This notebook is a reproduction of the paper *"Deep Feature-Based Text Clustering and its Explanation"* by Guan et al. (IEEE TKDE, 2022).

The paper addresses the limitations of traditional text clustering approaches, which are usually based on the bag-of-words representation and suffer from high dimensionality, sparsity, and lack of contextual/sequence information.

The authors propose a novel framework called **Deep Feature-Based Text Clustering (DFTC)** that leverages pretrained deep text encoders (ELMo and InferSent) to generate contextualized sentence/document embeddings. These embeddings are then normalized and clustered using classical algorithms such as K-means.

Additionally, the paper introduces the **Text Clustering Results Explanation (TCRE)** module, which applies a logistic regression model on bag-of-words features with pseudo-labels derived from clustering. This allows the extraction of *indication words* that explain the semantics of each cluster, providing interpretability and qualitative evaluation of the results.

Experiments on multiple benchmark datasets (AG News, DBpedia, Yahoo! Answers, Reuters) demonstrate that the proposed framework outperforms traditional clustering methods (tf-idf+KMeans, LDA, GSDMM), deep clustering models (DEC, IDEC, STC), and even BERT in most cases. The combination of **deep semantic features + interpretability** makes DFTC an effective and transparent solution for unsupervised text clustering.


In [4]:
# --- Core Python ---
import os
import io
import re
import zipfile
import requests

# --- Scientific & ML libraries ---
import numpy as np
import torch
import tensorflow as tf
import tensorflow_hub as hub

# --- NLP ---
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')  # for Italian

# --- Machine Learning & Clustering ---
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score, adjusted_rand_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from scipy.optimize import linear_sum_assignment

# --- Visualization ---
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# --- Datasets ---
from datasets import load_dataset





In [5]:
# Load pre trained ELMo model

elmo = hub.load("https://tfhub.dev/google/elmo/3")
print(elmo.signatures['default'].structured_outputs)







KeyboardInterrupt: 

In [None]:
# Clone git repository
!git clone https://github.com/facebookresearch/InferSent.git

# Open the InferSent directory
%cd InferSent

# Install dependencies
!pip install torch torchvision nltk

# Download the pre-trained InferSent model
!wget https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

# Download word embeddings (GloVe)
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip -q glove.840B.300d.zip

In [None]:
from models import InferSent

# Model parameters
MODEL_PATH = 'infersent2.pkl'
params_model = {
    'bsize': 64,
    'word_emb_dim': 300,
    'enc_lstm_dim': 2048,
    'pool_type': 'max',
    'dpout_model': 0.0,
    'version': 2
}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

# GloVe embeddings path
W2V_PATH = 'glove.840B.300d.txt'
infersent.set_w2v_path(W2V_PATH)

## Step 1: Feature Construction

In this step, we transform the input documents into **deep feature representations** using two pretrained models: **ELMo** and **InferSent**.

- **ELMo (Language Model based on BiLSTM)**
  Provides contextualized word embeddings. To obtain a fixed-size vector for a document, we apply pooling operations over token-level embeddings (e.g., mean-pooling, max-pooling).

- **InferSent (Supervised NLI sentence encoder)**
  Produces high-quality sentence embeddings using a BiLSTM + max-pooling architecture. For documents with multiple sentences, we compute the average of sentence embeddings.

The result of this step is a matrix **X** of shape `(n_docs, d)`, where `d = 1024` for ELMo or `d = 4096` for InferSent.
These vectors will later be normalized and clustered (e.g., with K-means).

---

### DFTC framework overview

Below is the overall architecture of the proposed framework from the paper:

![DFTC Framework](DFTC_framework.png)

*Figure: Deep Feature-Based Text Clustering (DFTC) framework. First, pretrained encoders generate document embeddings. Then, features are normalized and clustered. Finally, the TCRE module explains the clusters by identifying indication words.*


In [None]:
def extract_features(texts, model='elmo', pooling='mean', batch_size=64):
    """
    Extract deep feature representations for a list of texts using ELMo or InferSent.

    Args:
        texts (list of str): List of input documents.
        model (str): 'elmo' or 'infersent' to choose the embedding model.
        pooling (str): Pooling strategy for ELMo ('mean', 'max', 'last').
        batch_size (int): Batch size for processing texts with ELMo.

    Returns:
        np.ndarray: Matrix of shape (n_docs, d) with document embeddings.
    """
    if model == 'elmo':
        doc_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i : i + batch_size]
            batch_tokens = [word_tokenize(text) for text in batch_texts]

            # Pad sequences to the maximum length in the batch
            max_len = max(len(tokens) for tokens in batch_tokens)
            padded_tokens = tf.constant([
                tokens + [''] * (max_len - len(tokens)) for tokens in batch_tokens
            ], dtype=tf.string)
            sequence_lengths = tf.constant([len(tokens) for tokens in batch_tokens], dtype=tf.int32)

            # Get ELMo outputs using tokens signature
            outputs = elmo.signatures["tokens"](tokens=padded_tokens, sequence_len=sequence_lengths)
            embeddings = outputs["lstm_outputs2"]  # last LSTM layer

            # Apply pooling along token axis (axis=1)
            batch_embeddings = []
            for j in range(len(batch_texts)):
                seq_len = sequence_lengths[j].numpy()
                seq_embeddings = embeddings[j, :seq_len, :] # Select non-padded embeddings

                if pooling == 'mean':
                    doc_embedding = tf.reduce_mean(seq_embeddings, axis=0).numpy()
                elif pooling == 'max':
                    doc_embedding = tf.reduce_max(seq_embeddings, axis=0).numpy()
                elif pooling == 'last':
                    doc_embedding = seq_embeddings[-1, :].numpy()
                else:
                    raise ValueError("Invalid pooling method.")
                batch_embeddings.append(doc_embedding)
            doc_embeddings.extend(batch_embeddings)

        return np.array(doc_embeddings)

    elif model == 'infersent':
        # Assuming InferSent model is loaded and configured
        model.build_vocab(texts, tokenize=True)
        doc_embeddings = model.encode(texts, tokenize=True)

    else:
        raise ValueError("Model must be 'elmo' or 'infersent'.")

    return doc_embeddings

In [None]:
def normalize_features(X, method='l2', eps=1e-10):
    """
    Normalize feature matrix X using specified method.

    Args:
        X (np.ndarray): Input feature matrix of shape (n_samples, n_features).
        method (str): Normalization method. One of:
                      'identity'  -> no normalization
                      'l2'        -> L2 normalization (unit length)
                      'layernorm' -> normalize each vector by mean and std
        eps (float): Small constant to avoid division by zero.

    Returns:
        np.ndarray: Normalized feature matrix.
    """
    if method == 'identity':
        return X
    elif method == 'l2':
        norms = np.linalg.norm(X, axis=1, keepdims=True)
        return X / (norms + eps)
    elif method == 'layernorm':
        means = np.mean(X, axis=1, keepdims=True)
        stds = np.std(X, axis=1, keepdims=True)
        return (X - means) / (stds + eps)
    else:
        raise ValueError("Invalid normalization method.")

## Step 2: Clustering
In this step, we apply clustering algorithms on the normalized deep feature representations obtained in Step 1.
We will use **K-means** as the primary clustering algorithm, but other methods like **Agglomerative Clustering** or **DBSCAN** can also be employed.
The output of this step is a set of cluster assignments for each document, which will be used in the next step for explanation.


In [None]:
def cluster_features(X, n_clusters=10, random_state=42):
    """
    Cluster feature matrix X using K-means.

    Args:
        X (np.ndarray): Input feature matrix of shape (n_samples, n_features).
        n_clusters (int): Number of clusters.
        random_state (int): Random seed for reproducibility.

    Returns:
        np.ndarray: Cluster labels for each sample.
        KMeans: Fitted KMeans model.
    """
    kmeans = KMeans(
        n_clusters=n_clusters,
        random_state=random_state,
        n_init=10,
        max_iter=300
    )
    predicted_labels = kmeans.fit_predict(X)
    return predicted_labels, kmeans

## Step 3: Evaluation Metrics

To evaluate the clustering performance, we rely on three standard metrics used in the paper:

- **Clustering Accuracy (ACC)**
  Measures the best alignment between predicted clusters and ground-truth labels.
  Since cluster IDs are arbitrary, the Hungarian algorithm is used to find the optimal mapping.

- **Normalized Mutual Information (NMI)**
  Measures the mutual dependence between predicted clusters and true labels.
  Values range from 0 (no mutual information) to 1 (perfect correlation).

- **Adjusted Rand Index (ARI)**
  Measures the similarity between two assignments, adjusted for chance.
  Values range from -1 to 1, where 1 indicates perfect agreement.

These metrics provide complementary views:
- **ACC** focuses on label alignment,
- **NMI** evaluates information overlap,
- **ARI** accounts for random chance.


In [None]:
def clustering_accuracy(y_true, y_pred):
    """
    Calculate clustering accuracy using the Hungarian algorithm.

    Args:
        y_true (np.ndarray): Ground truth labels.
        y_pred (np.ndarray): Predicted cluster labels.

    Returns:
        float: Clustering accuracy.
    """
    y_true = np.asarray(y_true).astype(np.int64) # Ensure integer type
    y_pred = np.asarray(y_pred).astype(np.int64)
    assert y_pred.size == y_true.size # Ensure same size

    D = max(y_pred.max(), y_true.max()) + 1 # Number of clusters
    w = np.zeros((D, D), dtype=np.int64)
    for i in range(y_pred.size):
        w[y_pred[i], y_true[i]] += 1 # Build contingency matrix

    # Hungarian algorithm to find optimal assignment
    row_ind, col_ind = linear_sum_assignment(w.max() - w) # Maximize accuracy, linear_sum_assignment minimizes cost
    mapping = dict(zip(row_ind, col_ind)) # Create mapping

    # Map predicted labels to true labels
    y_pred_mapped = np.array([mapping[label] for label in y_pred])
    accuracy = np.mean(y_pred_mapped == y_true) # Calculate accuracy
    return accuracy

In [None]:
def clustering_nmi(y_true, y_pred):
    """
    Calculate Normalized Mutual Information (NMI) between true and predicted labels.

    Args:
        y_true (np.ndarray): Ground truth labels.
        y_pred (np.ndarray): Predicted cluster labels.

    Returns:
        float: NMI score.
    """
    return normalized_mutual_info_score(y_true, y_pred)

In [None]:
def clustering_ari(y_true, y_pred):
    """
    Calculate Adjusted Rand Index (ARI) between true and predicted labels.

    Args:
        y_true (np.ndarray): Ground truth labels.
        y_pred (np.ndarray): Predicted cluster labels.

    Returns:
        float: ARI score.
    """
    return adjusted_rand_score(y_true, y_pred)

## Step 4: Feature Visualization

To qualitatively assess the quality of the extracted features, we can project the high-dimensional embeddings into a 2D space and visualize them.

We use **t-SNE (t-distributed Stochastic Neighbor Embedding)**, which is a nonlinear dimensionality reduction technique that preserves local similarities between points.

If the extracted deep features are meaningful, samples from the same class should form compact clusters in the 2D visualization, while different classes should be well separated.

This visualization helps to confirm whether the embeddings from ELMo or InferSent provide more discriminative representations compared to traditional methods like tf-idf.


In [None]:
def visualize_features(X, labels, title='Feature Visualization with t-SNE', n_samples=1000, random_state=42):
    """
    Visualize high-dimensional features using t-SNE.

    Args:
        X (np.ndarray): Input feature matrix of shape (n_samples, n_features).
        labels (np.ndarray): Ground truth labels for coloring.
        title (str): Title of the plot.
        n_samples (int): Number of samples to visualize (for large datasets).
        random_state (int): Random seed for reproducibility.

    """
    # Randomly sample a subset for visualization (for large datasets)
    if X.shape[0] > n_samples:
        np.random.seed(random_state)
        indices = np.random.choice(X.shape[0], n_samples, replace=False)
        X_sampled = X[indices]
        labels_sampled = np.array(labels)[indices]
    else:
        X_sampled = X
        labels_sampled = labels

    tsne = TSNE(n_components=2, perplexity=30, learning_rate=200,
                n_iter=1000, random_state=random_state, init="pca") # Hyperparameters from paper
    X_2d = tsne.fit_transform(X_sampled)

    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels_sampled, cmap='tab10', alpha=0.7)
    plt.colorbar(scatter, ticks=np.unique(labels_sampled))
    plt.title(title)
    plt.xlabel('t-SNE Dimension 1')
    plt.ylabel('t-SNE Dimension 2')
    plt.grid(True)
    plt.show()

## Step 5: Dataset Loading

In this section, we will prepare the benchmark datasets used in the paper:

- **AG News** (4 classes, news articles by topic)
- **DBpedia** (14 classes, ontology categories)
- **Yahoo! Answers** (10 classes, question-answer categories)
- **Reuters (R2 / R5 subsets)** (2 or 5 classes from the Reuters-21578 corpus)

Since these datasets are relatively large, we will use either the full datasets or balanced subsets (as done in the paper, e.g., 1000 samples per class for AG News, DBpedia, and Yahoo).

The datasets will be loaded, preprocessed (tokenization, lowercasing, optional stopword removal), and split into:
- **texts**: the raw documents
- **labels**: the ground-truth category of each document

These will be the inputs to our feature extraction pipeline.


In [None]:
def clean_text(text):
    """
    Basic text cleaning: lowercasing, removing special characters.

    Args:
        text (str): Input text.

    Returns:
        str: Cleaned text.
    """
    text = text.lower()
    text = re.sub(r"<.*?>", " ", text)  # remove html
    text = re.sub(r"\s+", " ", text).strip() # remove extra spaces
    return text

In [None]:
def load_subset(dataset, n_per_class=1000, classes=10, label_key='label'):
    """
    Load a balanced subset of a dataset with specified number of samples per class.

    Args:
        dataset (Dataset): Huggingface dataset object.
        n_per_class (int): Number of samples per class to load.
        classes (int): Number of class labels to include.
        label_key (str): Key in the dataset for class labels.

    Returns:
        texts (list of str): List of documents.
        labels (list of int): Corresponding class labels.
    """
    texts, labels = [], []
    for label in range(classes):
        samples = dataset.filter(lambda x: x[label_key] == label).select(range(n_per_class))
        texts.extend([clean_text(item['text']) for item in samples])
        if label_key == 'topic':  # Yahoo! Answers uses 'topic' as label key
            texts.extend([clean_text(item['question_title'] + " " + item['best_answer']) for item in samples])
        labels.extend(samples[label_key])
    return texts, labels

In [None]:
def load_dbpedia(n_per_class=1000):
    """
    Load and preprocess DBpedia dataset.

    Args:
        n_per_class (int): Number of samples per class to load.

    Returns:
        texts (list of str): List of documents.
        labels (list of int): Corresponding class labels.
    """
    dataset = load_dataset('dbpedia_14', split='train')
    return load_subset(dataset, n_per_class, classes=14, label_key='label')

In [None]:
def load_ag_news(n_per_class=1000):
    """
    Load and preprocess AG News dataset.

    Args:
        n_per_class (int): Number of samples per class to load.

    Returns:
        texts (list of str): List of documents.
        labels (list of int): Corresponding class labels.
    """
    dataset = load_dataset('ag_news', split='train')
    return load_subset(dataset, n_per_class, classes=4, label_key='label')

In [None]:
def load_yahoo(n_per_class=1000):
    """
    Load and preprocess Yahoo! Answers dataset.

    Args:
        n_per_class (int): Number of samples per class to load.

    Returns:
        texts (list of str): List of documents.
        labels (list of int): Corresponding class labels.
    """
    dataset = load_dataset('yahoo_answers_topics', split='train')
    return load_subset(dataset, n_per_class, classes=10, label_key='topic')

In [None]:
def load_reuters(subset='R2'):
    """
    Load and preprocess Reuters dataset (R2 or R5 subset).

    Args:
        subset (str): 'R2' for 2 classes, 'R5' for 5 classes.

    Returns:
        texts (list of str): List of documents.
        labels (list of int): Corresponding class labels.
    """
    dataset = load_dataset('reuters21578', 'ModApte', split='train')
    if subset == 'R2':
        # Binary classification: 'earn' vs 'acq'
        texts, labels = [], []
        for item in dataset:
            if 'earn' in item['topics']:
                texts.append(clean_text(item['text']))
                labels.append(0)
            elif 'acq' in item['topics']:
                texts.append(clean_text(item['text']))
                labels.append(1)
        return texts, labels
    elif subset == 'R5':
        # 5 classes: 'earn', 'acq', 'crude', 'trade', 'money-fx'
        class_map = {'earn': 0, 'acq': 1, 'crude': 2, 'trade': 3, 'money-fx': 4}
        texts, labels = [], []
        for item in dataset:
            for topic in item['topics']:
                if topic in class_map:
                    texts.append(clean_text(item['text']))
                    labels.append(class_map[topic])
                    break
        return texts, labels
    else:
        raise ValueError("Subset must be 'R2' or 'R5'.")

# Step 6: Experiments

In this section, we replicate the experimental setup from the paper *"Deep Feature-Based Text Clustering and its Explanation"*.

The goal is to evaluate how different design choices affect clustering performance, namely:
- **Embedding model**: ELMo (LM) or InferSent.
- **Pooling strategy**: Mean, Max, or Last pooling over token embeddings.
- **Normalization**: Identity (I), L2 norm (N), or LayerNorm (LN).
- **Clustering algorithm**: K-means (KM).

We report results on three benchmark datasets used in the paper:
- **AG News** (4 classes, news articles)
- **DBpedia** (14 classes, ontology categories)
- **Yahoo Answers** (10 classes, QA topics)

The evaluation is done using the metrics described earlier: **Clustering Accuracy (ACC)**, **Normalized Mutual Information (NMI)**, and **Adjusted Rand Index (ARI)**.

## 6.1 Experiments with ELMo (LM)

We first focus on the ELMo model (LM-based embeddings).
For each dataset, we apply **mean pooling** over the last layer token representations, followed by different normalization strategies, and then cluster the resulting features using K-means (KM).

### Experiment 1: LM + Mean + LN + KM on AG News
Here we use **LayerNorm** as the normalization method on top of mean-pooled ELMo embeddings for AG News.

### Experiment 2: LM + Mean + I + KM on DBpedia
Here we use **Identity normalization** (no normalization) with mean-pooled embeddings for DBpedia.

### Experiment 3: LM + Mean + N + KM on AG News, DBpedia, and Yahoo Answers
Here we apply **L2 normalization** after mean pooling, and compare performance across three datasets: AG News, DBpedia, and Yahoo Answers.


In [None]:
# 1. Load dataset
texts, labels = load_agnews(n_per_class=1000)
print("Loaded AG News subset:", len(texts))

# 2. Extract features with ELMo + max pooling
X = extract_features(texts, model='elmo', pooling='max')
print("Feature matrix:", X.shape)

# 3. Apply Identity normalization
X_norm = normalize_features(X, method="identity")

# 4. Cluster with K-means (k=4)
y_pred, _ = cluster_features(X_norm, n_clusters=4)

# 5. Evaluate with ACC, NMI, ARI
acc = clustering_accuracy(labels, y_pred)
nmi = clustering_nmi(labels, y_pred)
ari = clustering_ari(labels, y_pred)

print(f"Results on AG News (ELMo + Max + Identity + KMeans):")
# metrics in %
print(f"ACC: {acc*100:.2f}%")
print(f"NMI: {nmi*100:.2f}%")
print(f"ARI: {ari*100:.2f}%")

In [None]:
# Experiment 1: LM + Mean + LN + KM on AG News
texts, labels = load_ag_news(n_per_class=1000)

X = extract_features(texts, model='elmo', pooling='mean')
X_norm = normalize_features(X, method="layernorm")
y_pred, _ = cluster_features(X_norm, n_clusters=4)

acc = clustering_accuracy(labels, y_pred)
nmi = clustering_nmi(labels, y_pred)
ari = clustering_ari(labels, y_pred)

print(f"Results on AG News (ELMo + Mean + LayerNorm + KMeans):")
print(f"ACC: {acc*100:.2f}%")
print(f"NMI: {nmi*100:.2f}%")
print(f"ARI: {ari*100:.2f}%")

# Plot feature visualization
visualize_features(X_norm, labels, title='AG News (ELMo + Mean + LayerNorm)')

In [None]:
# Experiment 2: LM + Mean + I + KM on DBpedia
texts, labels = load_dbpedia(n_per_class=1000)

X = extract_features(texts, model='elmo', pooling='mean')
X_norm = normalize_features(X, method="identity")
y_pred , _ = cluster_features(X_norm, n_clusters=14)

acc = clustering_accuracy(labels, y_pred)
nmi = clustering_nmi(labels, y_pred)
ari = clustering_ari(labels, y_pred)

print(f"Results on DBpedia (ELMo + Mean + Identity + KMeans):")
print(f"ACC: {acc*100:.2f}%")
print(f"NMI: {nmi*100:.2f}%")
print(f"ARI: {ari*100:.2f}%")

# Plot feature visualization
visualize_features(X_norm, labels, title='DBpedia (ELMo + Mean + Identity)')

In [None]:
# Experiment 3: LM + Mean + N + KM on AG News, DBpedia, and Yahoo Answers
for dataset_name, load_func, n_clusters in [
    ('AG News', load_ag_news, 4),
    ('DBpedia', load_dbpedia, 14),
    ('Yahoo Answers', load_yahoo, 10)
]:
    texts, labels = load_func(n_per_class=1000)
    X = extract_features(texts, model='elmo', pooling='mean')
    X_norm = normalize_features(X, method="l2")
    y_pred, _ = cluster_features(X_norm, n_clusters=n_clusters)

    acc = clustering_accuracy(labels, y_pred)
    nmi = clustering_nmi(labels, y_pred)
    ari = clustering_ari(labels, y_pred)

    print(f"Results on {dataset_name} (ELMo + Mean + L2 + KMeans):")
    print(f"ACC: {acc*100:.2f}%")
    print(f"NMI: {nmi*100:.2f}%")
    print(f"ARI: {ari*100:.2f}%")

    # Plot feature visualization
    visualize_features(X_norm, labels, title=f'{dataset_name} (ELMo + Mean + L2)')

## 6.2 Experiments with InferSent

Next, we evaluate the performance of **InferSent**, a sentence embedding model trained on Natural Language Inference (NLI) data.
Unlike ELMo, InferSent directly provides fixed-size sentence embeddings (4096 dimensions), which makes feature extraction straightforward.
We experiment with two normalization strategies before applying K-means clustering.

### Experiment 4: InferSent + LN + KM on DBpedia
Here we apply **LayerNorm** to InferSent embeddings of DBpedia samples before clustering with K-means.

### Experiment 5: InferSent + N + KM on AG News
Here we apply **L2 normalization** to InferSent embeddings of AG News samples before clustering with K-means.


In [None]:
# Experiment 4: InferSent + LN + KM on DBpedia
texts, labels = load_dbpedia(n_per_class=1000)

X = extract_features(texts, model='infersent')
X_norm = normalize_features(X, method="layernorm")
y_pred, _ = cluster_features(X_norm, n_clusters=14)

acc = clustering_accuracy(labels, y_pred)
nmi = clustering_nmi(labels, y_pred)
ari = clustering_ari(labels, y_pred)

print(f"Results on DBpedia (InferSent + LayerNorm + KMeans):")
print(f"ACC: {acc*100:.2f}%")
print(f"NMI: {nmi*100:.2f}%")
print(f"ARI: {ari*100:.2f}%")

# Plot feature visualization
visualize_features(X_norm, labels, title='DBpedia (InferSent + LayerNorm)')

In [None]:
# Experiment 5: InferSent + N + KM on AG News
texts, labels = load_ag_news(n_per_class=1000)

X = extract_features(texts, model='infersent')
X_norm = normalize_features(X, method="l2")
y_pred, _ = cluster_features(X_norm, n_clusters=14)

acc = clustering_accuracy(labels, y_pred)
nmi = clustering_nmi(labels, y_pred)
ari = clustering_ari(labels, y_pred)

print(f"Results on AG News (InferSent + L2 + KMeans):")
print(f"ACC: {acc*100:.2f}%")
print(f"NMI: {nmi*100:.2f}%")
print(f"ARI: {ari*100:.2f}%")

# Plot feature visualization
visualize_features(X_norm, labels, title='AG News (InferSent + L2)')

## Step 7: Explainability – TCRE Model

To interpret the discovered clusters, the paper proposes the **Text Clustering Result Explanation (TCRE)** model.
The idea is to identify **indication words** that best characterize each cluster.

The procedure (Algorithm 1 in the paper) works as follows:

1. Convert each document into a **binary bag-of-words feature vector** (1 if the word appears, 0 otherwise).
2. Filter out stop words and low-frequency words to reduce noise.
3. Train a **logistic regression classifier** using the cluster assignments as pseudo-labels.
4. Inspect the absolute values of the learned weights to find the **most important words** for each cluster.

The output is, for each cluster, a ranked list of **indication words** that can be used to explain the cluster’s semantics.


In [None]:
def tcre_explanation(corpus, cluster_labels, top_n=10, min_df=5, stop_words='english'):
    """
    Text Clustering Result Explanation (TCRE):
    Identify indication words for each cluster.

    Args:
        corpus (list of str): List of documents (raw text).
        cluster_labels (array-like): Cluster assignments for each document.
        top_n (int): Number of top indication words per cluster.
        min_df (int): Minimum document frequency for a word to be kept.
        stop_words (str or list): Stop words to remove.

    Returns:
        dict: {cluster_id: [indication words]}
    """
    # 1. Convert documents to binary bag-of-words features
    vectorizer = CountVectorizer(binary=True, stop_words=stop_words, min_df=min_df)
    X = vectorizer.fit_transform(corpus)
    vocab = np.array(vectorizer.get_feature_names_out())

    # 2. Train logistic regression with pseudo-labels
    clf = LogisticRegression(max_iter=1000, multi_class='ovr')
    clf.fit(X, cluster_labels)

    # 3. For each cluster, get top absolute-weight words
    ind_words = {}
    weights = clf.coef_  # shape (n_clusters, n_features)

    for cluster_id, w in enumerate(weights):
        abs_w = np.abs(w)
        top_idx = np.argsort(abs_w)[::-1][:top_n]
        ind_words[cluster_id] = vocab[top_idx].tolist()

    return ind_words

In [None]:
indication_words = tcre_explanation(texts, y_pred, top_n=10)

for cluster, words in indication_words.items():
    print(f"Cluster {cluster}: {', '.join(words)}")