
# Clustering text documents using k-means and dimensionality reduction


This is an example showing how dimensionality reduction can mitigate the "curse of dimensionality", by denoising data and improving performance of euclidean-based clustering approaches. In this example, we cluster a set of documents, represented as bag-of-worids, using two approaches:
1. A standard k-means algorithm
2. We apply k-means after reducing the dimensionality of the space via application of random projections

We use standard measures of clustering quality to compare results provided by the two approaches.

## Datasets
To test our ideas, we begin with some standard datasets, for which ```sklearn``` provides a class for automatic downloading and preprocessing. 
As stated in the description, "The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date." Please refer to http://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset for more information.

To begin with, we import the libraries we will be using in this notebook:

In [8]:
import nltk
from sklearn.datasets import fetch_20newsgroups
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.metrics import homogeneity_score, homogeneity_completeness_v_measure, completeness_score
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.random_projection import SparseRandomProjection as srp
from sklearn.cluster import KMeans

import sys
from time import time

import numpy as np

We first select some categories from the 20 newsgroups dataset. These are specified by a list of string descriptors:

In [2]:
categories = [
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

print("Loading 20 newsgroups dataset for categories:")
print(categories)

Loading 20 newsgroups dataset for categories:
['talk.religion.misc', 'comp.graphics', 'sci.space']


We next download the corresponding dataset, both training and test set:

In [3]:
dataset = fetch_20newsgroups(subset='all', categories=categories, 
                             shuffle=False, remove=('headers', 'footers', 'quotes'))

Documents are not randomly reordered (```Shuffle=False```) and we remove all metadata, leaving only body text. ```train_set``` is an object describing the dataset. Its attributes ```filenames``` and ```target``` are two arrays, respectively containing the paths to the different documents and the corresponding labels, represented as integers from ```0``` to ```len(categories) - 1```.

In [4]:
labels = dataset.target
true_k = len(np.unique(labels)) ## This should be 3 in this example

We first perform lemmatization, which seems to behave better than stemming. The reason might be that the latter is too "aggressive" for this collection, consisting of short documents that may contain misspells, abbreviations etc. You would probably experience similar problems with a corpus of Twitter posts. You can try with stemming after commenting the next block of code

In [9]:
# First, we download the necessary NLTK resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')

# We next perform lemmatization
lemmatizer = WordNetLemmatizer()
for i in range(len(dataset.data)):
    word_list = word_tokenize(dataset.data[i])
    lemmatized_doc = ""
    for word in word_list:
        lemmatized_doc = lemmatized_doc + " " + lemmatizer.lemmatize(word)
    dataset.data[i] = lemmatized_doc  

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


We next convert our corpus into tf-idf vectors in the usual way:

In [10]:
vectorizer = TfidfVectorizer(stop_words='english') ## Corpus is in English
X = vectorizer.fit_transform(dataset.data)

In [11]:
print(X.shape)

(2588, 28070)


## Classification take 1: using standard k-means
We first cluster documents using the standard k-means algorithm (actually, a refined variant called k-means++), without any further data preprocessing. The key parameter of choice when performing k-means is $k$. Alas, there really is no principled way to choose an initial value for $k$. Essentially we have two options:

1. We choose a value that reflects our knowledge about the data, as in this case
2. We try several value, possibly in increasing order. We proceed this way as long as the quality of the resulting clustering (as measured by one or more quality indices) increases. We stop when it starts decreasing. As you may suspect, this case arises pretty often in practice

In this specific case, we set $k = 3$ of course

In [19]:
km = KMeans(n_clusters=true_k, init='k-means++', n_init=20, max_iter=100)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))

done in 2.666s


Quality indices for clustering:

In [13]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

Homogeneity: 0.324
Completeness: 0.420
V-measure: 0.366
Adjusted Rand-Index: 0.253
Silhouette Coefficient: 0.009


We next identify the $10$ most important terms associated to the centroids:

In [20]:
centroids = km.cluster_centers_.argsort()[:, ::-1] ## Indices of largest centroids' entries in descending order
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: file image thanks format program know graphic color ftp bit
Cluster 1: wa space like just think ha year time nasa know
Cluster 2: god jesus wa christian people bible did say kent koresh




## Classification take 2: random projections first
We next use again k-means, but we first use (sparse) random projections to project onto a low-dimensional space

In [25]:
transformer = srp(n_components=500, dense_output=False)
X_proj = transformer.fit_transform(X)

Note that we set ```dense_output=True```. The reason is that otherwise the projected data will be represented as a sparse matrix, which considerably slows $k$-means' computation. Note also that the number of dimensions was chosen by trial and error and is lower than what suggested by Johnson-Lindenstrauss lemma. Actually, the number of dimensions required for $k$-means' cost to be preserved is lower than what we saw (recent results from 2019). We next perform k-means clustering on projected data:

In [26]:
km = KMeans(n_clusters=true_k, init='k-means++', n_init=20, max_iter=100)
t0 = time()
km.fit(X_proj)
print("done in %0.3fs" % (time() - t0))
print("Data shape:", X_proj.shape)

done in 2.459s
Data shape: (2588, 500)


Note that reduction in computational time is definitely consistent with what we said in class: due to the linear dependence of Lloyd's algorithm on the number of dimensions, decreasing their number by a factor $f$ approximately affords a speed-up by the same factor. We next take the usual measures of the quality of the clustering:

In [None]:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

Homogeneity: 0.300
Completeness: 0.408
V-measure: 0.345
Adjusted Rand-Index: 0.236
Silhouette Coefficient: 0.009


Next, we want to find the most representative words for the clusters we obtained. The problem is that now the centroids returned by ```cluster.centers_``` attribute are in projected space and their dimensions have no particular meaning. We need to compute the corresponding centroids in the original space ourselves. This is easily done, since the ```labels_``` attribute of the ```KMeans``` object provides, for every $i$, a label identifying the cluster the $i$-th point was assigned to. We therefore proceed as follows:

In [27]:
centroids = np.zeros((true_k, X.shape[1])) # Initializing true_k centroid arrays
cluster_sizes = [0]*true_k # For each cluster, the number of points it contains, needed for taking average
for i in range(X_proj.shape[0]):
    index = int(km.labels_[i]) # index is the index of the cluster the i-th point belongs to
    centroids[index] += X[i] # Adding component-wise
    cluster_sizes[index] += 1

for i in range(true_k):
    centroids[i] = centroids[i]/cluster_sizes[i] # Computing centroids: take sum and divide by cluster size

Once this is done, we proceed as in the previous case with the identification of representative words for each topic: 

In [28]:
centroids = centroids.argsort()[:, ::-1] ## Indices of largest centroids' entries in descending order
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: wa god know did think people just like say jesus
Cluster 1: space wa like launch just ha nasa program shuttle software
Cluster 2: file image format gif program thanks know color convert graphic




With less than $(1/40)$-th of the original dimensions, quality of the results is almost as good. The speed-up is not as good on Colab, probably because for this dataset instance, pre-processing accounts for most of the computational cost. We instead observe roughly a 10-fold speed-up if the notebook is run as a Jupyter notebook on a Desktop running Ubuntu 20.04.