Topic modelling is one of the important and typical task in unsupervised machine learning. Assigning words to topics, which can be from a web page, library book, media articles etc. has many applications like e.g. spam filtering, email routing, sentiment analysis etc.<br>
Here we are going to use the 20 Newsgroups dataset which is a collection of approximately 20,000 news groups documents, partitioned nearly evenly across 20 different newsgroups which has become a popular dataset for text classification and text clustering.

In [1]:
# import all the necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups 
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings("ignore")

Since the 20 news groups can be categorized into 6 topics, we consider 1 group form each group for clustering.

In [2]:
categories = ['comp.graphics','misc.forsale','rec.sport.baseball','talk.politics.guns','sci.space','alt.atheism']

# read the documents with the specified categories
dataset = fetch_20newsgroups(subset='train',categories=categories)
data_samples = dataset.data
data_samples

['From: cwilliam@tigger.cs.colorado.edu (Christopher Williamson)\nSubject: ** Oscilloscope for sale $99 + probes $25 ea. ** \nNntp-Posting-Host: tigger.cs.colorado.edu\nOrganization: University of Colorado at Boulder\nDistribution: na\nLines: 13\n\nI have a Tektronix T921 15Mhz scope for sale.  It is a nice, simple\nunit to learn on.  I used it while I was in school.  If you want one\nto play with at home, this is easy and inexpensive.  It has a nice\nhandle and is quite lightweight and easy to move around.\n\nI will consider selling the probes seperately for $25 ea.  They are HP\n10017A probes suitable for this type of scope.  The probes are NOT\nincluded in the price of $99 for the scope.\n\nIf you need more technical info, you will have to come look at it, as\nI am not a scope expert and what I have said is all I know.\n\nChris\n',
 "From: nick@sfb256.iam.uni-bonn.de (   Nikan B Firoozye )\nSubject: Re: Sunrise/ sunset times\nOrganization: Applied Math, University of Bonn, Germany\n

Text files are actually series of words (ordered). In order to run machine learning algorithms we need to convert the text files into numerical feature vectors. We will be using bag of words model, i.e. we create a document topic matrix with the frequencies. To reduce the weightage of more frequent words we use the tf-idf vectorizer.

In [3]:
tf_vectorizer = TfidfVectorizer(stop_words='english', min_df=4)
tf = tf_vectorizer.fit_transform(data_samples)
tf.shape

(3385, 12863)

In [4]:
# perform LDA dimensionality reduction technique
lda = LatentDirichletAllocation(n_components=50, random_state=0)
X = lda.fit_transform(tf)

Now, using KMeans clustering technique, we cluster the important words for each topic and print the top 20 words from each topic 

In [5]:
# cluster using KMeans
km = KMeans(n_clusters=6, init='k-means++', max_iter=100, n_init=1)
km.fit(tf)

# find the cluster centers
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

terms = tf_vectorizer.get_feature_names()

# print the top 20 words form each cluster
for i in range(6):
    print("cluster %d:" % i)
    for ind in order_centroids[i,:20]:
        print('%s' % terms[ind], end = ' ')
    print('\n')


cluster 0:
space nasa gov henry access edu alaska digex moon toronto pat com article writes jpl launch orbit zoo lunar just 

cluster 1:
edu com graphics lines subject organization university posting host nntp ca thanks know mail article cs writes files computer image 

cluster 2:
gun com guns edu people stratus fbi weapons don batf firearms writes right article law control cdt think atf like 

cluster 3:
edu baseball year team game players com runs games article writes jewish win season hit cs pitching braves good think 

cluster 4:
god keith edu caltech com atheists people livesey writes sgi morality atheism islam don religion say think islamic moral solntze 

cluster 5:
sale edu 00 offer shipping new condition drive price university lines distribution subject organization asking interested 10 com sell forsale 



From the above output, we can observe that <br>
Cluster 0 is related to space with words such as space, nasa, henry, moon, orbit, lunar and launch <br>
Cluster 1 relates to science with words such as graphics, subject, mail, files, computer and consists of few words from other topics <br>
Cluster 2 relates to policics/guns with words such as gun, weapons, firearms, fbi, law <br>
Cluster 3 is related to sports <br>
Cluster 4 is related to aethism with words such as god, morality, islam, aethists, islamic <br>
Cluster 5 mostly relates to education <br>

Finally, we find the goodness of the clusters using Homogenity, Completeness and V-measure

In [6]:
from sklearn import metrics
labels = dataset.target
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(tf, km.labels_, sample_size=1000))

Homogeneity: 0.667
Completeness: 0.704
V-measure: 0.685
Silhouette Coefficient: 0.008
