# Clustering and Topic Modeling Workshop

Today's workshop will address clustering and topic modelling, primarily through the use of scikit-learn and gensim. A fundmental understanding of Python is necessary. We will cover:

1. Preparing your corpus
2. Clustering
3. Topic Modeling

Python packages you will need:

* NLTK ( \$ pip install nltk)
* scikit-learn ( \$ pip install scikit-learn)
* pandas ( \$ pip install pandas)
* matplotlib ( \$ pip install matplotlib)
* gensim ( \$ pip install gensim)

Data you will need:

* The CMU book summaries dataset from here: http://www.cs.cmu.edu/~dbamman/booksummaries.html

The clustering section is modified from http://brandonrose.org/clustering . The scikit-learn documentation with examples is here: http://scikit-learn.org/stable/ . For further a explanation of topic modeling on the low-level see *Data Science from Scratch*: http://shop.oreilly.com/product/0636920033400.do .

# 1) Preparing the data

We're going to do some basic clustering and topic modelling of a large dataset of book summaries created by CMU. These summaries were scraped from Wikipedia.

First we'll read in the .tsv file into a pandas dataframe:

In [1]:
import pandas as pd

df = pd.read_csv("booksummaries.txt", sep="\t")

We can see the number of books and metadata by printing the shape:

In [2]:
print (df.shape)

(16558, 7)


And the first entry by indexing the head:

In [3]:
df.head()

Unnamed: 0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"": ""Satire"", ""/m/0dwly"": ""Children's literature"", ""/m/014dfn"": ""Speculative fiction"", ""/m/02xlf"": ""Fiction""}","Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. When Major dies, two young pigs, Snowball and Napoleon, assume command and turn his dream into a philosophy. The animals revolt and drive the drunken and irresponsible Mr Jones from the farm, renaming it ""Animal Farm"". They adopt Seven Commandments of Animal-ism, the most important of which is, ""All animals are equal"". Snowball attempts to teach the animals reading and writing; food is plentiful, and the farm runs smoothly. The pigs elevate themselves to positions of leadership and set aside special food items, ostensibly for their personal health. Napoleon takes the pups from the farm dogs and trains them privately. Napoleon and Snowball struggle for leadership. When Snowball announces his plans to build a windmill, Napoleon has his dogs chase Snowball away and declares himself leader. Napoleon enacts changes to the governance structure of the farm, replacing meetings with a committee of pigs, who will run the farm. Using a young pig named Squealer as a ""mouthpiece"", Napoleon claims credit for the windmill idea. The animals work harder with the promise of easier lives with the windmill. After a violent storm, the animals find the windmill annihilated. Napoleon and Squealer convince the animals that Snowball destroyed it, although the scorn of the neighbouring farmers suggests that its walls were too thin. Once Snowball becomes a scapegoat, Napoleon begins purging the farm with his dogs, killing animals he accuses of consorting with his old rival. He and the pigs abuse their power, imposing more control while reserving privileges for themselves and rewriting history, villainising Snowball and glorifying Napoleon. Squealer justifies every statement Napoleon makes, even the pigs' alteration of the Seven Commandments of Animalism to benefit themselves. 'Beasts of England' is replaced by an anthem glorifying Napoleon, who appears to be adopting the lifestyle of a man. The animals remain convinced that they are better off than they were when under Mr Jones. Squealer abuses the animals' poor memories and invents numbers to show their improvement. Mr Frederick, one of the neighbouring farmers, attacks the farm, using blasting powder to blow up the restored windmill. Though the animals win the battle, they do so at great cost, as many, including Boxer the workhorse, are wounded. Despite his injuries, Boxer continues working harder and harder, until he collapses while working on the windmill. Napoleon sends for a van to take Boxer to the veterinary surgeon's, explaining that better care can be given there. Benjamin, the cynical donkey, who ""could read as well as any pig"", notices that the van belongs to a knacker, and attempts to mount a rescue; but the animals' attempts are futile. Squealer reports that the van was purchased by the hospital and the writing from the previous owner had not been repainted. He recounts a tale of Boxer's death in the hands of the best medical care. Years pass, and the pigs learn to walk upright, carry whips and wear clothes. The Seven Commandments are reduced to a single phrase: ""All animals are equal, but some animals are more equal than others"". Napoleon holds a dinner party for the pigs and the humans of the area, who congratulate Napoleon on having the hardest-working but least fed animals in the country. Napoleon announces an alliance with the humans, against the labouring classes of both ""worlds"". He abolishes practices and traditions related to the Revolution, and changes the name of the farm to ""The Manor Farm"". The animals, overhearing the conversation, notice that the faces of the pigs have begun changing. During a poker match, an argument breaks out between Napoleon and Mr Pilkington, and the animals realise that the faces of the pigs look like the faces of humans, and no one can tell the difference between them. The pigs Snowball, Napoleon, and Squealer adapt Old Major's ideas into an actual philosophy, which they formally name Animalism. Soon after, Napoleon and Squealer indulge in the vices of humans (drinking alcohol, sleeping in beds, trading). Squealer is employed to alter the Seven Commandments to account for this humanisation, an allusion to the Soviet government's revising of history in order to exercise control of the people's beliefs about themselves and their society. The original commandments are: # Whatever goes upon two legs is an enemy. # Whatever goes upon four legs, or has wings, is a friend. # No animal shall wear clothes. # No animal shall sleep in a bed. # No animal shall drink alcohol. # No animal shall kill any other animal. # All animals are equal. Later, Napoleon and his pigs secretly revise some commandments to clear them of accusations of law-breaking (such as ""No animal shall drink alcohol"" having ""to excess"" appended to it and ""No animal shall sleep in a bed"" with ""with sheets"" added to it). The changed commandments are as follows, with the changes bolded: * 4 No animal shall sleep in a bed with sheets. * 5 No animal shall drink alcohol to excess. * 6 No animal shall kill any other animal without cause. Eventually these are replaced with the maxims, ""All animals are equal, but some animals are more equal than others"", and ""Four legs good, two legs better!"" as the pigs become more human. This is an ironic twist to the original purpose of the Seven Commandments, which were supposed to keep order within Animal Farm by uniting the animals together against the humans, and prevent animals from following the humans' evil habits. Through the revision of the commandments, Orwell demonstrates how simply political dogma can be turned into malleable propaganda."
0,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
1,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
2,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
3,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...
4,2152,/m/0x5g,All Quiet on the Western Front,Erich Maria Remarque,1929-01-29,"{""/m/098tmk"": ""War novel"", ""/m/016lj8"": ""Roman...","The book tells the story of Paul Bäumer, a Ge..."


In [4]:
lists = df.values.T.tolist()

In [5]:
print(lists[6][1])

 The text of The Plague is divided into five parts. In the town of Oran, thousands of rats, initially unnoticed by the populace, begin to die in the streets. A hysteria develops soon afterward, causing the local newspapers to report the incident. Authorities responding to public pressure order the collection and cremation of the rats, unaware that the collection itself was the catalyst for the spread of the bubonic plague. The main character, Dr. Bernard Rieux, lives comfortably in an apartment building when strangely the building's concierge, M. Michel, a confidante, dies from a fever. Dr. Rieux consults his colleague, Castel, about the illness until they come to the conclusion that a plague is sweeping the town. They both approach fellow doctors and town authorities about their theory, but are eventually dismissed on the basis of one death. However, as more and more deaths quickly ensue, it becomes apparent that there is an epidemic. Authorities, including the Prefect, M. Othon, are 

Now we can zip together the relevant data, we'll just be using the titles and summaries:

In [6]:
zipped_data = list(zip(lists[2],lists[6]))
small_data = zipped_data[:200] # just first 200

sums = [x[1] for x in small_data]
sums = [x.replace('"', "") for x in sums] # quotes are changed when tokenized and get caught
sums = [x.replace('...','') for x in sums] # elipses also get caught

titles = [x[0] for x in small_data]

# 2) Clustering

Clustering is an unsupervised machine learning method, as we are not telling the computer the groupings. Clustering is not the same as topic modeling, although clustering can yield topics. Clustering is a more restricted approach to grouping and visualizing data based on their similarity. If you only want to determine topics, a conventional LDA model will be more accurate as it allows for documents to be assigned to more than one topic. If you are looking for spatial relations and 1-1 assignment of documents to groupings, clustering will show this better.

First we'll define functions in order to collect tokenized words and stemmed words:

In [7]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from string import punctuation

#as these are summaries, we'll take out further stop words that dont' appear in enough summaries
more_stops = ['novel', 'book', 'books', 'story', 'narrator', 'narrative','character', 
              'chapters', 'chapter', 'tells', "'s"]

stemmer = SnowballStemmer("english")

def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = [x for x in tokens if x not in punctuation 
                       and x not in more_stops] #word tokenizer cuts the possessives
    return filtered_tokens

def tokenize_and_stem(text):
    stems = [stemmer.stem(x) for x in tokenize_only(text)]
    return stems

Now we collect these from our paragraphs, this is only necessary to map out our data points after:

In [8]:
totalvocab_stemmed = []
totalvocab_tokenized = []

for s in sums:
    totalvocab_stemmed.extend(tokenize_and_stem(s))
    totalvocab_tokenized.extend(tokenize_only(s))

Our data frame will map tokenized words to stemmed words, recalling our work with pandas in Day 3 of the introductory series:

In [9]:
import pandas as pd

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print (vocab_frame.shape[0])
print (vocab_frame)

167278
                        words
alex                     alex
a                           a
teenag               teenager
live                   living
in                         in
near-futur        near-future
england               england
lead                    leads
his                       his
gang                     gang
on                         on
night                 nightly
orgi                   orgies
of                         of
opportunist     opportunistic
random                 random
ultra-viol     ultra-violence
alex                     alex
friend                friends
droog                  droogs
in                         in
the                       the
anglo-russian   anglo-russian
slang                   slang
nadsat                 nadsat
are                       are
dim                       dim
a                           a
slow-wit          slow-witted
bruiser               bruiser
...                       ...
of                         of
ear

In [10]:
print(sums[0])

 Alex, a teenager living in near-future England, leads his gang on nightly orgies of opportunistic, random ultra-violence. Alex's friends (droogs in the novel's Anglo-Russian slang, Nadsat) are: Dim, a slow-witted bruiser who is the gang's muscle; Georgie, an ambitious second-in-command; and Pete, who mostly plays along as the droogs indulge their taste for ultra-violence. Characterized as a sociopath and a hardened juvenile delinquent, Alex is also intelligent and quick-witted, with sophisticated taste in music, being particularly fond of Beethoven, or Lovely Ludwig Van. The novel begins with the droogs sitting in their favorite hangout (the Korova Milkbar), drinking milk-drug cocktails, called milk-plus, to hype themselves for the night's mayhem. They assault a scholar walking home from the public library, rob a store leaving the owner and his wife bloodied and unconscious, stomp a panhandling derelict, then scuffle with a rival gang. Joyriding through the countryside in a stolen car

We'll make a tfidf, *term freqency inverse document frequency*, matrix. A tfidf takes into account the frequency of a word in the entire corpus, and offsets it based on its frequency among documents (more here: https://en.wikipedia.org/wiki/Tf–idf):

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#define vectorizer parameters, max_df is maximum occurence in docs of word, min_df is opposite
#use .8 max to eliminate more common words, lower .2 looking for unique but not proper nouns
#use inverse document frequency, give more weight to rare words
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=20000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,5))

tfidf_matrix = tfidf_vectorizer.fit_transform(sums) #fit the vectorizer to the summaries

print(tfidf_matrix.shape)

The tfidf_matrix maps every document with every term meeting the vectorizer paramerters, and assigns a weight based on the term's frequency amongst all the documents.

In [None]:
print (tfidf_matrix)

Then we need the words from the vector, these are essentially most influential words taking both document frequency and corpus frequency into account, we will eventually assign them to clusters.

In [None]:
terms = tfidf_vectorizer.get_feature_names()
print (terms)

In order to plot our clusters in a 2D plane, we'll want to calculate the distance between any two given summaries via cosine similarity:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix) #creates high dimensional object
print (dist.shape)

Now we'll start the actual clustering. The algorithm assigns each observation to the cluster whose mean yields the least within-cluster sum of squares, essentially the nearest mean. This iterates until the mean no longer changes.

But how do we know the number of clusters? This is a highly contentious matter. There are various methods (https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set), but we'll employ the silhouette method, which determines the similarity between documents within a cluster, and their relative similarity to documents in other clusters. We want the highest silhouette coefficient we can get.

In [None]:
from sklearn import metrics
from sklearn.cluster import KMeans

cnum = 6
km = KMeans(init='k-means++', n_clusters=cnum, n_init=10, random_state=10)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist() #assigns each paragraph to the respective cluster

We can now do a form of topic modelling by printing the words characterizing the clusters we made, the words are those closest to the centroid of the cluster, extracted from the vocab data frame, indexed by their position within the cluster:

In [None]:
#sort cluster centers by proximity to centroid, and grabs the index to iterate through below
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

cents_words = [] #to collect words for chart legend

for i in range(cnum): #numer of clusters
    cent = []
    print("Cluster %i words:" % i, end='')
    
    for ind in order_centroids[i, :7]: #ind is index, replace 7 with n words per cluster, how many to choose from centroid
        a = vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0] #indexing term in dataframe
        cent.append(a)
        
        print(' %a' % a, end=',')
        
    cents_words.append(cent)
    print ()


To see which book summary was assigned to which cluster, we'll zip our titles and clusters together:

In [None]:
print (list(zip(titles,clusters)))

## Plot clusters

Two dimensional scaling must be applied for plotting:

In [None]:
from sklearn.manifold import MDS

# convert two components as we're plotting points in a two-dimensional plane
# "precomputed" because we provide a distance matrix
mds = MDS(n_components=2, dissimilarity="precomputed", random_state = 10)

pos = mds.fit_transform(dist)  # shape (n_components, n_samples), based on distances

xs, ys = pos[:, 0], pos[:, 1] #grabs x and y coordinates from pos (numpy array)

Define colors and labels for plot:

In [None]:
import random

cluster_colors = {}
cluster_names ={}

cols = ["b", "g", "r", "c", "m","y"]

for i in range(cnum): # for each cluster
    #cluster_colors[i] = "#%06x" % random.randint(0, 0xFFFFFF) #random hexadecimal color
    cluster_colors[i] = cols[i]
    cluster_names[i] = ' '.join(cents_words[i][:4])

Plot:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

#create data frame that has the result of the MDS plus the cluster numbers and titles
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters, title=titles)) 

#group by cluster
groups = df.groupby('label')

# set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size, subplots yields a tuple of figure and axes, hence the two assignments
ax.margins(0.15) # Optional, just adds 10% padding to the autoscaling

#iterate through groups to layer the plot
#note that I use the cluster_name and cluster_color dicts with the 'name' lookup to return the appropriate color/label
for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=12, #marker size
            label=cluster_names[name], color=cluster_colors[name], 
            mec='none') #'marker edge color'
    
ax.legend(numpoints=1)  #show legend with only 1 point

#add label in x,y position with the label as the paragraph number
for i in range(len(df)):
    ax.text(df.ix[i]['x'], df.ix[i]['y'], df.ix[i]['title'], size=7)  

    
    
plt.show() #show the plot

#uncomment the below to save the plot if need be
#plt.savefig('clusters_small_noaxes.png', dpi=200)

We can also cluster hierarchically via Ward's method (https://en.wikipedia.org/wiki/Ward%27s_method):

In [None]:
from scipy.cluster.hierarchy import ward, dendrogram

linkage_matrix = ward(dist) #define the linkage_matrix using ward clustering pre-computed distances

fig, ax = plt.subplots(figsize=(15, 40)) # set size
ax = dendrogram(linkage_matrix, orientation="right", labels=titles);

plt.tick_params(labelbottom='off')

plt.tight_layout() #show plot with tight layout

#uncomment below to save figure
# plt.savefig('ward_clusters.png', dpi=200) #save figure as ward_clusters

# 3) Topic Modelling with gensim

We can compare the above results with topic modeling results. There are two popular choices for models here: Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA). LDA is a more complex process, and thus takes more resources and longer to run, but has higher accuracy. LSI is a much simpler process and can be run quite quickly.
- LSI looks at words in a documents and its relationships to other words, with the important assumption that every word can only mean one thing. (cf. https://en.wikipedia.org/wiki/Latent_semantic_indexing)
- LDA seeks to remedy this fault by allowing words to exist in multiple topics, first grouping them by topic, and each document is compared across each topic to determine the best fit. (cf. https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

In [None]:
from gensim import corpora, models, similarities 
from nltk.corpus import stopwords
from string import punctuation

sums_tm = [tokenize_only(x) for x in sums]
sums_tm = [[stemmer.stem(x) for x in i if x not in more_stops] for i in sums_tm]
sums_tm = [[x for x in i if x not in stopwords.words("english") and x not in punctuation] for i in sums_tm]

We first create a dict of word IDs and their respective word frequency for all documents.

In [None]:
#create a Gensim dictionary from the texts
dictionary = corpora.Dictionary(sums_tm)

#remove extremes (similar to the min/max df step used when creating the tf-idf matrix)
#no_below is absolute # of docs, no_above is fraction of corpus
dictionary.filter_extremes(no_below=40, no_above=.70)

The corpus we now create with doc2bow is a vector of all words (IDs from the dict), and frequency for each document.

In [None]:
#convert the dictionary to a bag of words corpus for reference
corpus = [dictionary.doc2bow(i) for i in sums_tm]

## LSI

For LSI we first create a tfidf similar to above with clustering:

In [None]:
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

We then build the model:

In [None]:
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=6)
corpus_lsi = lsi[corpus_tfidf]
lsi.print_topics(6)

Not so great, let's try LDA.

## LDA

The LDA model can be built immediately:

In [None]:
#we run chunks of 15 books, and update after every 2 chunks, and make 10 passes
lda = models.LdaModel(corpus, num_topics=6, 
                            update_every=2,
                            id2word=dictionary, 
                            chunksize=15, 
                            passes=10)

lda.show_topics()

In [None]:
corpus_lda = lda[corpus_tfidf]
for i,doc in enumerate(corpus_lda): # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
    print(titles[i],doc)
    print ()

For more with gensim, see the tutorials here: https://radimrehurek.com/gensim/tutorial.html