# Clustering documents using topic modeling

The goal in clustering is to group together similar documents creating a number of different groups. To judge the similarity of the documents it is common for each document to be represented by a vector of weights which are assigned to each word in the document. These weights in most cases are the <a src=https://en.wikipedia.org/wiki/Tf%E2%80%93idf>tf-idf</a> frequencies of the words. Thus, an NxM dimensional matrix is created where N is the number of documents and M the dimensions of the vector space (number of words in the dictionary of the document collection). The end result of clustering is a number of clusters with each document assigned to a cluster.

Another method of organizing a document collection is topic modeling. In topic modeling each word of the dictionary is associated with a probability of occurence in a topic and each topic is associated with a probability of occurence in a document. Each document is represented by a vector of the probabilities of each topic in it. This means that topic modeling results to a NxM matrix representation of the collection where N is the number of documents and M is the number of topics. Since we have a vector space representation of the documents we can use it for clustering. 

# Topic modeling

We'll use <a src=http://radimrehurek.com/gensim/>Gensim</a> for topic modeling as it offers time and space efficient implementations of the topic modeling algorithms and it's free. It also supports parallelization although I don't make us of this feature here. With Gensim you can use either Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) to produce the topic vector space described above. We won't get into how these work now but you can find more information <a src=https://en.wikipedia.org/wiki/Latent_semantic_analysis>here</a> and <a src=https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation>here</a>. In this implementation i will use LSA as it is faster thatn LDA. It is also less accurate but the goal of this notebook is a proof of concept. 

The data we are going to use are greek WikiPedia articles xml formatted and bz2 compressed. You can download them from <a src=https://dumps.wikimedia.org/elwiki/20170601/elwiki-20170601-pages-meta-history.xml.bz2>here</a>. Gensim can work directly on the compressed collection.

First we need to create a Dictionary object, which is the dictionary of the document collection and associates each word to an id, and a MmCorpus object, which stores the corpus in a <a src=http://math.nist.gov/MatrixMarket/formats.html>Matrix Market</a> format. In a MmCorpus object each document can be represented by a vector of either integer count frequencies of it's word or tf-idf frequencies, but either way the resulting matrix is stored in the Matrix Market format.

I have saved the compressed WikiPedia XML file in a directory called data.

In [2]:
%run -m gensim.scripts.make_wikicorpus data/elwiki-20170720-pages-articles.xml.bz2 data/grwiki

2017-08-01 14:55:56,120 : INFO : running /home/theovasi/anaconda3/lib/python3.6/site-packages/gensim/scripts/make_wikicorpus.py data/elwiki-20170720-pages-articles.xml.bz2 data/grwiki
2017-08-01 14:55:56,323 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2017-08-01 14:56:25,444 : INFO : adding document #10000 to Dictionary(361468 unique tokens: ['αθλητισμός', 'είναι', 'συστηματική', 'σωματική', 'καλλιέργεια']...)
2017-08-01 14:56:46,820 : INFO : adding document #20000 to Dictionary(500714 unique tokens: ['αθλητισμός', 'είναι', 'συστηματική', 'σωματική', 'καλλιέργεια']...)
2017-08-01 14:57:05,578 : INFO : adding document #30000 to Dictionary(598622 unique tokens: ['αθλητισμός', 'είναι', 'συστηματική', 'σωματική', 'καλλιέργεια']...)
2017-08-01 14:57:24,443 : INFO : adding document #40000 to Dictionary(686151 unique tokens: ['αθλητισμός', 'είναι', 'συστηματική', 'σωματική', 'καλλιέργεια']...)
2017-08-01 14:57:42,056 : INFO : adding document #50000 to Dictionary(757946 uniq

The above script takes the compressed XML file as its first argument. The second argument is the prefix of the output files. When run this script creates the dictionary and saves it in /data/grwiki_wordids.txt.bz2 and the tf-idf representation of the corpus saved in Matrix Market format in /data/grwiki_tfidf.mm. Now lets load these files and create the needed objects.

First unzip the dictionary file:

In [5]:
!bzip2 -dk data/grwiki_wordids.txt.bz2

Create the Dictionary and MmCorpus objects by loading the files:

In [6]:
import logging, gensim, bz2
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s',
    level=logging.INFO) # Allow gensim to print additional info.

dictionary = gensim.corpora.Dictionary.load_from_text('data/grwiki_wordids.txt')
corpus = gensim.corpora.MmCorpus('data/grwiki_tfidf.mm')

2017-08-01 19:14:34,952 : INFO : loaded corpus index from data/grwiki_tfidf.mm.index
2017-08-01 19:14:34,965 : INFO : initializing corpus reader from data/grwiki_tfidf.mm
2017-08-01 19:14:34,966 : INFO : accepted corpus with 122813 documents, 99121 features, 20926208 non-zero entries


Compute the LSA of the Greek WikiPedia:

In [7]:
lsi = gensim.models.lsimodel.LsiModel(corpus=corpus, id2word=dictionary, num_topics=100)

2017-08-01 19:14:41,235 : INFO : using serial LSI version on this node
2017-08-01 19:14:41,236 : INFO : updating model with new documents
2017-08-01 19:15:03,742 : INFO : preparing a new chunk of documents
2017-08-01 19:15:04,657 : INFO : using 100 extra samples and 2 power iterations
2017-08-01 19:15:04,658 : INFO : 1st phase: constructing (99121, 200) action matrix
2017-08-01 19:15:06,133 : INFO : orthonormalizing (99121, 200) action matrix
2017-08-01 19:15:22,315 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2017-08-01 19:15:23,464 : INFO : computing the final decomposition
2017-08-01 19:15:23,465 : INFO : keeping 100 factors (discarding 22.292% of energy spectrum)
2017-08-01 19:15:23,712 : INFO : processed documents up to #20000
2017-08-01 19:15:23,735 : INFO : topic #0(14.900): 0.274*"έλληνας" + 0.217*"αμερικανός" + 0.199*"πολιτικός" + 0.188*"ηθοποιός" + 0.145*"συγγραφέας" + 0.134*"βασιλιάς" + 0.131*"γεννήσεις" + 0.129*"θάνατοι" + 0.128*"γρηγοριανό" + 0.127*"hμερολό

2017-08-01 19:17:42,558 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2017-08-01 19:17:43,717 : INFO : computing the final decomposition
2017-08-01 19:17:43,718 : INFO : keeping 100 factors (discarding 22.943% of energy spectrum)
2017-08-01 19:17:43,936 : INFO : merging projections: (99121, 100) + (99121, 100)
2017-08-01 19:17:46,323 : INFO : keeping 100 factors (discarding 12.298% of energy spectrum)
2017-08-01 19:17:46,627 : INFO : processed documents up to #100000
2017-08-01 19:17:46,629 : INFO : topic #0(31.378): 0.232*"φυσικά" + 0.217*"αστεροειδών" + 0.212*"jpl" + 0.209*"κύριας" + 0.209*"java" + 0.207*"αστεροειδής" + 0.207*"τροχιά" + 0.203*"ηλιακό" + 0.202*"απόλυτο" + 0.201*"ζώνης"
2017-08-01 19:17:46,632 : INFO : topic #1(26.170): 0.360*"ποδοσφαιριστές" + 0.138*"εθνική" + 0.135*"πρωτάθλημα" + 0.119*"px" + 0.106*"κύπελλο" + 0.103*"εθνικής" + 0.096*"ποδοσφαίρου" + 0.081*"αγώνες" + 0.080*"έλληνας" + 0.079*"ομάδες"
2017-08-01 19:17:46,634 : INFO : topic #2(22.063): -0.

Save LSI model for future use:

In [8]:
lsi.save('data/model.lsi')

2017-08-01 19:31:41,899 : INFO : saving Projection object under data/model.lsi.projection, separately None
2017-08-01 19:31:42,497 : INFO : saved data/model.lsi.projection
2017-08-01 19:31:42,498 : INFO : saving LsiModel object under data/model.lsi, separately None
2017-08-01 19:31:42,499 : INFO : not storing attribute projection
2017-08-01 19:31:42,499 : INFO : not storing attribute dispatcher
2017-08-01 19:31:42,556 : INFO : saved data/model.lsi


We have created a model that can transform a vector from the tf-idf vector space to the topic vector space. This model extracted 400 topics from the documents. The first 10 topics are printed below. Each topic is represented by its 10 most contributing words (negative or positive).

In [9]:
lsi.print_topics(10)

2017-08-01 19:31:46,558 : INFO : topic #0(31.676): 0.230*"φυσικά" + 0.214*"αστεροειδών" + 0.209*"jpl" + 0.207*"κύριας" + 0.206*"java" + 0.205*"αστεροειδής" + 0.205*"τροχιά" + 0.201*"ηλιακό" + 0.200*"απόλυτο" + 0.199*"ζώνης"
2017-08-01 19:31:46,567 : INFO : topic #1(28.462): 0.305*"ποδοσφαιριστές" + 0.133*"πρωτάθλημα" + 0.129*"εθνική" + 0.124*"px" + 0.101*"κύπελλο" + 0.093*"εθνικής" + 0.086*"ποδοσφαίρου" + 0.083*"αγώνες" + 0.080*"έλληνας" + 0.079*"κόμμα"
2017-08-01 19:31:46,574 : INFO : topic #2(23.764): -0.594*"ποδοσφαιριστές" + -0.127*"πρωτάθλημα" + -0.122*"εθνική" + -0.114*"κύπελλο" + -0.100*"αγωνίστηκε" + -0.100*"ποδοσφαίρου" + -0.095*"γκολ" + -0.095*"εθνικής" + 0.087*"χωριό" + -0.086*"λιγκ"
2017-08-01 19:31:46,576 : INFO : topic #3(19.713): -0.227*"κόμμα" + 0.225*"χωριό" + 0.223*"δήμος" + -0.189*"εκλογές" + 0.186*"νομού" + 0.157*"δήμο" + 0.149*"κατοίκους" + 0.148*"δήμου" + 0.141*"απογραφή" + -0.133*"βουλευτές"
2017-08-01 19:31:46,581 : INFO : topic #4(19.473): 0.584*"px" + -0.458*"

[(0,
  '0.230*"φυσικά" + 0.214*"αστεροειδών" + 0.209*"jpl" + 0.207*"κύριας" + 0.206*"java" + 0.205*"αστεροειδής" + 0.205*"τροχιά" + 0.201*"ηλιακό" + 0.200*"απόλυτο" + 0.199*"ζώνης"'),
 (1,
  '0.305*"ποδοσφαιριστές" + 0.133*"πρωτάθλημα" + 0.129*"εθνική" + 0.124*"px" + 0.101*"κύπελλο" + 0.093*"εθνικής" + 0.086*"ποδοσφαίρου" + 0.083*"αγώνες" + 0.080*"έλληνας" + 0.079*"κόμμα"'),
 (2,
  '-0.594*"ποδοσφαιριστές" + -0.127*"πρωτάθλημα" + -0.122*"εθνική" + -0.114*"κύπελλο" + -0.100*"αγωνίστηκε" + -0.100*"ποδοσφαίρου" + -0.095*"γκολ" + -0.095*"εθνικής" + 0.087*"χωριό" + -0.086*"λιγκ"'),
 (3,
  '-0.227*"κόμμα" + 0.225*"χωριό" + 0.223*"δήμος" + -0.189*"εκλογές" + 0.186*"νομού" + 0.157*"δήμο" + 0.149*"κατοίκους" + 0.148*"δήμου" + 0.141*"απογραφή" + -0.133*"βουλευτές"'),
 (4,
  '0.584*"px" + -0.458*"ποδοσφαιριστές" + 0.170*"πρωτάθλημα" + 0.111*"κύπελλο" + 0.107*"ομάδες" + 0.103*"αγώνες" + 0.103*"ολυμπιακός" + 0.094*"παναθηναϊκός" + 0.090*"ανδρών" + -0.082*"χωριό"'),
 (5,
  '-0.370*"ταινίες" + 0.293*

Apply the LSI tranformation to the whole collection that is represented in
the corpus object in tf-df form.

In [10]:
corpus_lsi = lsi[corpus]

We have managed to create a topic space representation of the Greek 
WikiPedia and reduced teh dimensions of the vector space from NxM (N is
the number of documents and M is sthe number of words in the dictionary) 
to Nx100. The next step is the clustering.

# Clustering

We are going to use scikit-learn for the clustering. Scikit's algorithms require for the vector space matrix to be in a sparse matrix format. Gensim 
provides a function to do just that.

In [11]:
lsi_sparse = gensim.matutils.corpus2csc(corpus_lsi)

I am using <a src=https://en.wikipedia.org/wiki/K-means_clustering>k-means</a> to create 8 clusters without any refinements apart from a maximum iteration number.

In [12]:
from sklearn.cluster import KMeans
kmodel = KMeans(n_clusters=8, max_iter=100, verbose=True)

In [None]:
Finally fit the k-means model on the documents x topic weights matrix we
created above. 

In [13]:
kmodel.fit(lsi_sparse) 

Initialization complete
Iteration  0, inertia 14385.484
Iteration  1, inertia 10073.993
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.012397e-07
Initialization complete
Iteration  0, inertia 14445.820
Iteration  1, inertia 9936.316
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.012397e-07
Initialization complete
Iteration  0, inertia 13913.163
Iteration  1, inertia 9410.579
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.012397e-07
Initialization complete
Iteration  0, inertia 14661.618
Iteration  1, inertia 10349.191
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.012397e-07
Initialization complete
Iteration  0, inertia 14889.464
Iteration  1, inertia 10273.376
Converged at iteration 1: center shift 0.000000e+00 within tolerance 1.012397e-07
Initialization complete
Iteration  0, inertia 14777.574
Iteration  1, inertia 10305.512
Converged at iteration 1: center shift 0.000000e+00 within toler

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=8, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=True)

Now that we have a trained k-means model, the next step is to find a way to
visualize the result of the clustering.

# Visualization

In [None]:
# TODO: Visualization.