# Deep Feature-Based Text Clustering and Its Explanation

This notebook is a reproduction of the paper *"Deep Feature-Based Text Clustering and its Explanation"* by Guan et al. (IEEE TKDE, 2022).

The paper addresses the limitations of traditional text clustering approaches, which are usually based on the bag-of-words representation and suffer from high dimensionality, sparsity, and lack of contextual/sequence information.

The authors propose a novel framework called **Deep Feature-Based Text Clustering (DFTC)** that leverages pretrained deep text encoders (ELMo and InferSent) to generate contextualized sentence/document embeddings. These embeddings are then normalized and clustered using classical algorithms such as K-means.

Additionally, the paper introduces the **Text Clustering Results Explanation (TCRE)** module, which applies a logistic regression model on bag-of-words features with pseudo-labels derived from clustering. This allows the extraction of *indication words* that explain the semantics of each cluster, providing interpretability and qualitative evaluation of the results.

Experiments on multiple benchmark datasets (AG News, DBpedia, Yahoo! Answers, Reuters) demonstrate that the proposed framework outperforms traditional clustering methods (tf-idf+KMeans, LDA, GSDMM), deep clustering models (DEC, IDEC, STC), and even BERT in most cases. The combination of **deep semantic features + interpretability** makes DFTC an effective and transparent solution for unsupervised text clustering.


In [4]:
import tensorflow_hub as hub
import torch
import os
import requests
import zipfile
import io
import torch
import numpy as np




In [5]:
# Load pre trained ELMo model

elmo = hub.load("https://tfhub.dev/google/elmo/3")
print(elmo.signatures['default'].structured_outputs)







KeyboardInterrupt: 

In [None]:
# Clone git repository
!git clone https://github.com/facebookresearch/InferSent.git

# Open the InferSent directory
%cd InferSent

# Install dependencies
!pip install torch torchvision nltk

# Download the pre-trained InferSent model
!wget https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

# Download word embeddings (GloVe)
!wget http://nlp.stanford.edu/data/glove.840B.300d.zip
!unzip -q glove.840B.300d.zip

In [None]:
import torch
from models import InferSent

# Model parameters
MODEL_PATH = 'infersent2.pkl'
params_model = {
    'bsize': 64,
    'word_emb_dim': 300,
    'enc_lstm_dim': 2048,
    'pool_type': 'max',
    'dpout_model': 0.0,
    'version': 2
}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))

# GloVe embeddings path
W2V_PATH = 'glove.840B.300d.txt'
model.set_w2v_path(W2V_PATH)

import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # for italian

# Build vocabulary and encode sentences
sentences = [
    "Ciao, sto usando InferSent su Colab.",
    "Questo modello trasforma frasi in vettori."
]
model.build_vocab(sentences, tokenize=True)
embeddings = model.encode(sentences, tokenize=True)

print(embeddings.shape)   # (2, 4096)
print(embeddings[0][:10]) # Show first 10 dimensions of the first sentence embedding


## Step 1: Feature Construction

In this step, we transform the input documents into **deep feature representations** using two pretrained models: **ELMo** and **InferSent**.

- **ELMo (Language Model based on BiLSTM)**
  Provides contextualized word embeddings. To obtain a fixed-size vector for a document, we apply pooling operations over token-level embeddings (e.g., mean-pooling, max-pooling).

- **InferSent (Supervised NLI sentence encoder)**
  Produces high-quality sentence embeddings using a BiLSTM + max-pooling architecture. For documents with multiple sentences, we compute the average of sentence embeddings.

The result of this step is a matrix **X** of shape `(n_docs, d)`, where `d = 1024` for ELMo or `d = 4096` for InferSent.
These vectors will later be normalized and clustered (e.g., with K-means).

---

### DFTC framework overview

Below is the overall architecture of the proposed framework from the paper:

![DFTC Framework](DFTC_framework.png)

*Figure: Deep Feature-Based Text Clustering (DFTC) framework. First, pretrained encoders generate document embeddings. Then, features are normalized and clustered. Finally, the TCRE module explains the clusters by identifying indication words.*
