# Deep Feature-Based Text Clustering and Its Explanation

This notebook is a reproduction of the paper *"Deep Feature-Based Text Clustering and its Explanation"* by Guan et al. (IEEE TKDE, 2022).

The paper addresses the limitations of traditional text clustering approaches, which are usually based on the bag-of-words representation and suffer from high dimensionality, sparsity, and lack of contextual/sequence information.

The authors propose a novel framework called **Deep Feature-Based Text Clustering (DFTC)** that leverages pretrained deep text encoders (ELMo and InferSent) to generate contextualized sentence/document embeddings. These embeddings are then normalized and clustered using classical algorithms such as K-means.

Additionally, the paper introduces the **Text Clustering Results Explanation (TCRE)** module, which applies a logistic regression model on bag-of-words features with pseudo-labels derived from clustering. This allows the extraction of *indication words* that explain the semantics of each cluster, providing interpretability and qualitative evaluation of the results.

Experiments on multiple benchmark datasets (AG News, DBpedia, Yahoo! Answers, Reuters) demonstrate that the proposed framework outperforms traditional clustering methods (tf-idf+KMeans, LDA, GSDMM), deep clustering models (DEC, IDEC, STC), and even BERT in most cases. The combination of **deep semantic features + interpretability** makes DFTC an effective and transparent solution for unsupervised text clustering.


In [4]:
import tensorflow as tf
import tensorflow_hub as hub
import torch
import os
import requests
import zipfile
import io
import numpy as np

import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # for italian

from nltk.tokenize import word_tokenize

from sklearn.cluster import KMeans

from sklearn.metrics import normalized_mutual_info_score, adjusted_rand_score
from scipy.optimize import linear_sum_assignment

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE




In [5]:
# Load pre trained ELMo model

elmo = hub.load("https://tfhub.dev/google/elmo/3")
print(elmo.signatures['default'].structured_outputs)







KeyboardInterrupt: 

In [None]:
# Clone git repository
!git clone https://github.com/facebookresearch/InferSent.git

# Open the InferSent directory
%cd InferSent

# Install dependencies
!pip install torch torchvision nltk

# Download the pre-trained InferSent model
!wget https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

# Download word embeddings (GloVe)
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip -q glove.840B.300d.zip

In [None]:
from models import InferSent

# Model parameters
MODEL_PATH = 'infersent2.pkl'
params_model = {
    'bsize': 64,
    'word_emb_dim': 300,
    'enc_lstm_dim': 2048,
    'pool_type': 'max',
    'dpout_model': 0.0,
    'version': 2
}
infersent = InferSent(params_model)
infersent.load_state_dict(torch.load(MODEL_PATH))

# GloVe embeddings path
W2V_PATH = 'glove.840B.300d.txt'
infersent.set_w2v_path(W2V_PATH)

## Step 1: Feature Construction

In this step, we transform the input documents into **deep feature representations** using two pretrained models: **ELMo** and **InferSent**.

- **ELMo (Language Model based on BiLSTM)**
  Provides contextualized word embeddings. To obtain a fixed-size vector for a document, we apply pooling operations over token-level embeddings (e.g., mean-pooling, max-pooling).

- **InferSent (Supervised NLI sentence encoder)**
  Produces high-quality sentence embeddings using a BiLSTM + max-pooling architecture. For documents with multiple sentences, we compute the average of sentence embeddings.

The result of this step is a matrix **X** of shape `(n_docs, d)`, where `d = 1024` for ELMo or `d = 4096` for InferSent.
These vectors will later be normalized and clustered (e.g., with K-means).

---

### DFTC framework overview

Below is the overall architecture of the proposed framework from the paper:

![DFTC Framework](DFTC_framework.png)

*Figure: Deep Feature-Based Text Clustering (DFTC) framework. First, pretrained encoders generate document embeddings. Then, features are normalized and clustered. Finally, the TCRE module explains the clusters by identifying indication words.*


In [None]:
def extract_features(texts, model='elmo', pooling='mean'):
    """
    Extract deep feature representations for a list of texts using ELMo or InferSent.

    Args:
        texts (list of str): List of input documents.
        model (str): 'elmo' or 'infersent' to choose the embedding model.
        pooling (str): Pooling strategy for ELMo ('mean', 'max', 'concat').

    Returns:
        np.ndarray: Matrix of shape (n_docs, d) with document embeddings.
    """
    if model == 'elmo':
        doc_embeddings = []
        for text in texts:
            # Tokenize text
            tokenized_text = word_tokenize(text)

            # Convert list of tokens to a 1D tensor of strings
            text_tensor = tf.constant(tokenized_text, dtype=tf.string)

            # Get ELMo embeddings for the single text
            embeddings = elmo.signatures['default'](text_tensor)['elmo']

            # Apply pooling
            if pooling == 'mean':
                doc_embedding = tf.reduce_mean(embeddings, axis=0).numpy()
            elif pooling == 'max':
                doc_embedding = tf.reduce_max(embeddings, axis=0).numpy()
            elif pooling == 'last':
                doc_embedding = embeddings[-1, :].numpy()
            else:
                raise ValueError("Invalid pooling method.")

            doc_embeddings.append(doc_embedding)

        return np.array(doc_embeddings)

    elif model == 'infersent':
        # Build vocabulary and encode texts
        infersent.build_vocab(texts, tokenize=True)
        doc_embeddings = infersent.encode(texts, tokenize=True)

    else:
        raise ValueError("Model must be 'elmo' or 'infersent'.")

    return doc_embeddings

In [None]:
def normalize_features(X, method='l2', eps=1e-10):
    """
    Normalize feature matrix X using specified method.

    Args:
        X (np.ndarray): Input feature matrix of shape (n_samples, n_features).
        method (str): Normalization method. One of:
                      'identity'  -> no normalization
                      'l2'        -> L2 normalization (unit length)
                      'layernorm' -> normalize each vector by mean and std
        eps (float): Small constant to avoid division by zero.

    Returns:
        np.ndarray: Normalized feature matrix.
    """
    if method == 'identity':
        return X
    elif method == 'l2':
        norms = np.linalg.norm(X, axis=1, keepdims=True)
        return X / (norms + eps)
    elif method == 'layernorm':
        means = np.mean(X, axis=1, keepdims=True)
        stds = np.std(X, axis=1, keepdims=True)
        return (X - means) / (stds + eps)
    else:
        raise ValueError("Invalid normalization method.")

## Step 2: Clustering
In this step, we apply clustering algorithms on the normalized deep feature representations obtained in Step 1.
We will use **K-means** as the primary clustering algorithm, but other methods like **Agglomerative Clustering** or **DBSCAN** can also be employed.
The output of this step is a set of cluster assignments for each document, which will be used in the next step for explanation.


In [None]:
def cluster_features(X, n_clusters=10, random_state=42):
    """
    Cluster feature matrix X using K-means.

    Args:
        X (np.ndarray): Input feature matrix of shape (n_samples, n_features).
        n_clusters (int): Number of clusters.
        random_state (int): Random seed for reproducibility.

    Returns:
        np.ndarray: Cluster labels for each sample.
        KMeans: Fitted KMeans model.
    """
    kmeans = KMeans(
        n_clusters=n_clusters,
        random_state=random_state,
        n_init=10,
        max_iter=300
    )
    predicted_labels = kmeans.fit_predict(X)
    return predicted_labels, kmeans

## Step 3: Evaluation Metrics

To evaluate the clustering performance, we rely on three standard metrics used in the paper:

- **Clustering Accuracy (ACC)**
  Measures the best alignment between predicted clusters and ground-truth labels.
  Since cluster IDs are arbitrary, the Hungarian algorithm is used to find the optimal mapping.

- **Normalized Mutual Information (NMI)**
  Measures the mutual dependence between predicted clusters and true labels.
  Values range from 0 (no mutual information) to 1 (perfect correlation).

- **Adjusted Rand Index (ARI)**
  Measures the similarity between two assignments, adjusted for chance.
  Values range from -1 to 1, where 1 indicates perfect agreement.

These metrics provide complementary views:
- **ACC** focuses on label alignment,
- **NMI** evaluates information overlap,
- **ARI** accounts for random chance.


In [None]:
def clustering_accuracy(y_true, y_pred):
    """
    Calculate clustering accuracy using the Hungarian algorithm.

    Args:
        y_true (np.ndarray): Ground truth labels.
        y_pred (np.ndarray): Predicted cluster labels.

    Returns:
        float: Clustering accuracy.
    """
    y_true = np.asarray(y_true).astype(np.int64) # Ensure integer type
    y_pred = np.asarray(y_pred).astype(np.int64)
    assert y_pred.size == y_true.size # Ensure same size

    D = max(y_pred.max(), y_true.max()) + 1 # Number of clusters
    w = np.zeros((D, D), dtype=np.int64)
    for i in range(y_pred.size):
        w[y_pred[i], y_true[i]] += 1 # Build contingency matrix

    # Hungarian algorithm to find optimal assignment
    row_ind, col_ind = linear_sum_assignment(w.max() - w) # Maximize accuracy, linear_sum_assignment minimizes cost
    mapping = dict(zip(row_ind, col_ind)) # Create mapping

    # Map predicted labels to true labels
    y_pred_mapped = np.array([mapping[label] for label in y_pred])
    accuracy = np.mean(y_pred_mapped == y_true) # Calculate accuracy
    return accuracy

In [None]:
def clustering_nmi(y_true, y_pred):
    """
    Calculate Normalized Mutual Information (NMI) between true and predicted labels.

    Args:
        y_true (np.ndarray): Ground truth labels.
        y_pred (np.ndarray): Predicted cluster labels.

    Returns:
        float: NMI score.
    """
    return normalized_mutual_info_score(y_true, y_pred)

In [None]:
def clustering_ari(y_true, y_pred):
    """
    Calculate Adjusted Rand Index (ARI) between true and predicted labels.

    Args:
        y_true (np.ndarray): Ground truth labels.
        y_pred (np.ndarray): Predicted cluster labels.

    Returns:
        float: ARI score.
    """
    return adjusted_rand_score(y_true, y_pred)

## Step 4: Feature Visualization

To qualitatively assess the quality of the extracted features, we can project the high-dimensional embeddings into a 2D space and visualize them.

We use **t-SNE (t-distributed Stochastic Neighbor Embedding)**, which is a nonlinear dimensionality reduction technique that preserves local similarities between points.

If the extracted deep features are meaningful, samples from the same class should form compact clusters in the 2D visualization, while different classes should be well separated.

This visualization helps to confirm whether the embeddings from ELMo or InferSent provide more discriminative representations compared to traditional methods like tf-idf.


In [None]:
def visualize_features(X, labels, title='Feature Visualization with t-SNE', n_samples=1000, random_state=42):
    """
    Visualize high-dimensional features using t-SNE.

    Args:
        X (np.ndarray): Input feature matrix of shape (n_samples, n_features).
        labels (np.ndarray): Ground truth labels for coloring.
        title (str): Title of the plot.
        n_samples (int): Number of samples to visualize (for large datasets).
        random_state (int): Random seed for reproducibility.

    """
    # Randomly sample a subset for visualization (for large datasets)
    if X.shape[0] > n_samples:
        np.random.seed(random_state)
        indices = np.random.choice(X.shape[0], n_samples, replace=False)
        X_sampled = X[indices]
        labels_sampled = np.array(labels)[indices]
    else:
        X_sampled = X
        labels_sampled = labels

    tsne = TSNE(n_components=2, perplexity=30, learning_rate=200,
                n_iter=1000, random_state=random_state, init="pca") # Hyperparameters from paper
    X_2d = tsne.fit_transform(X_sampled)

    plt.figure(figsize=(10, 8))
    scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels_sampled, cmap='tab10', alpha=0.7)
    plt.colorbar(scatter, ticks=np.unique(labels_sampled))
    plt.title(title)
    plt.xlabel('t-SNE Dimension 1')
    plt.ylabel('t-SNE Dimension 2')
    plt.grid(True)
    plt.show()