## Clustering text documents using k-means

This notebook follows the sklearn documentation. The original links is [here](https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py)

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np
import pandas as pd


In [2]:
corpus = pd.read_csv("./arxiv_articles_simplified.csv", sep="|")

In [3]:
corpus.iloc[:5, :]

Unnamed: 0,id,title,authors,arxiv_primary_category,summary,published,updated,general_category
0,http://arxiv.org/abs/2001.05867v1,$σ$-Lacunary actions of Polish groups,Jan Grebik,math.LO,We show that every essentially countable orbit...,2020-01-16T15:09:02Z,2020-01-16T15:09:02Z,math
1,http://arxiv.org/abs/1303.6933v1,Hans Grauert (1930-2011),Alan Huckleberry,math.HO,Hans Grauert died in September of 2011. This a...,2013-03-27T19:23:57Z,2013-03-27T19:23:57Z,math
2,http://arxiv.org/abs/1407.3775v1,A New Proof of Stirling's Formula,Thorsten Neuschel,math.HO,A new simple proof of Stirling's formula via t...,2014-07-10T11:26:39Z,2014-07-10T11:26:39Z,math
3,http://arxiv.org/abs/math/0307381v3,On Dequantization of Fedosov's Deformation Qua...,Alexander V. Karabegov,math.QA,To each natural deformation quantization on a ...,2003-07-30T06:20:33Z,2003-09-20T01:29:18Z,math
4,http://arxiv.org/abs/1604.06794v1,Cyclic extensions are radical,Mariano Suárez-Álvarez,math.HO,We show that finite Galois extensions with cyc...,2016-04-21T22:24:54Z,2016-04-21T22:24:54Z,math


In [4]:
true_k = len(corpus["general_category"].unique())
print("True number of categories {}".format(true_k))
n_features = 10000

True number of categories 8


In [5]:
categories = corpus["general_category"].unique()
labels = [np.where(categories==c)[0][0] for c in corpus["general_category"]]

In [6]:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=True)
X = vectorizer.fit_transform(corpus["summary"])

In [7]:
mb_km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=False)
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=False)

In [8]:
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

print()

Clustering sparse data with KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
       n_clusters=8, n_init=1, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=False)
done in 120.433s

Homogeneity: 0.353
Completeness: 0.366
V-measure: 0.359
Adjusted Rand-Index: 0.211
Silhouette Coefficient: 0.003



In [9]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Top terms per cluster:
Cluster 0: stars ray mass star stellar observations planets emission planet disk
Cluster 1: paper model time theory results based new problem using study
Cluster 2: data network networks model learning models method based analysis methods
Cluster 3: cell model cells dynamics energy field growth time magnetic equations
Cluster 4: estimator regression estimation estimators method data test sample asymptotic model
Cluster 5: bayesian algorithm monte carlo model models posterior markov inference sampling
Cluster 6: gene protein dna genes expression proteins genome sequence sequences binding
Cluster 7: market risk price model volatility optimal financial time portfolio problem


In [10]:
print("Clustering sparse data with %s" % mb_km)
t0 = time()
mb_km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, mb_km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, mb_km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, mb_km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, mb_km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, mb_km.labels_, sample_size=1000))

print()

Clustering sparse data with MiniBatchKMeans(batch_size=1000, compute_labels=True, init='k-means++',
                init_size=1000, max_iter=100, max_no_improvement=10,
                n_clusters=8, n_init=1, random_state=None,
                reassignment_ratio=0.01, tol=0.0, verbose=False)
done in 0.196s

Homogeneity: 0.354
Completeness: 0.382
V-measure: 0.368
Adjusted Rand-Index: 0.261
Silhouette Coefficient: 0.003



In [11]:
print("Top terms per cluster:")
order_centroids = mb_km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Top terms per cluster:
Cluster 0: paper data based systems new theory analysis present time using
Cluster 1: stars ray mass star stellar emission observations disk planets 10
Cluster 2: monte carlo markov sampling bayesian algorithm posterior chain mcmc inference
Cluster 3: risk portfolio measures model default credit financial measure value market
Cluster 4: problem optimal model time volatility stochastic process pricing function options
Cluster 5: model cell network cells protein networks dynamics dna gene time
Cluster 6: market price stock financial markets trading model time volatility prices
Cluster 7: data model method models estimation regression methods based estimator proposed


In [13]:
# View the results of the classification
km.labels_

array([1, 1, 1, ..., 1, 2, 1], dtype=int32)

In [14]:
# View the results of the classification
mb_km.labels_

array([0, 0, 0, ..., 0, 7, 0], dtype=int32)