# Complete Guide to Topic Modeling

https://nlpforhackers.io/topic-modeling/

Definitions:

C: collection of documents containing N texts.
V: vocabulary (the set of unique words in the collection)

Dimensionality Reduction
Topic modeling is a form of dimensionality reduction. Rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in V}, we can represent the text in its topic space as {Topic_i: weight(Topic_i, T) for Topic_i in Topics}. Notice that we’re using Topics to represent the set of all topics.

Unsupervised Learning
Topic modeling can be easily compared to clustering. As in the case of clustering, the number of topics, like the number of clusters, is a hyperparameter. By doing topic modeling we build clusters of words rather than clusters of texts. A text is thus a mixture of all the topics, each having a certain weight.

A Form of Tagging
If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.

# Topic Modeling Algorithms

In [None]:
There are several algorithms for doing topic modeling. The most popular ones include

LDA – Latent Dirichlet Allocation – The one we’ll be focusing in this tutorial. Its foundations are Probabilistic Graphical Models
LSA or LSI – Latent Semantic Analysis or Latent Semantic Indexing – Uses Singular Value Decomposition (SVD) on the Document-Term Matrix. Based on Linear Algebra
NMF – Non-Negative Matrix Factorization – Based on Linear Algebra

Here are some things all these algorithms have in common:

The number of topics (n_topics) as a parameter. None of the algorithms can infer the number of topics in the document collection.

All of the algorithms have as input the Document-Word Matrix (or Document-Term Matrix). DWM[i][j] = The number of occurrences of word_j in document_i

All of them output 2 matrices: WTM (Word Topic Matrix) and TDM (Topic Document Matrix). The matrices are significantly smaller and the result of their multiplication should be as close as possible to the original DWM matrix.

## Using Gensim for Topic Modeling

In [1]:
from nltk.corpus import brown
 
data = []
 
for fileid in brown.fileids():
    document = ' '.join(brown.words(fileid))
    data.append(document)
 
NO_DOCUMENTS = len(data)
print(NO_DOCUMENTS)
print(data[:5])

500


Gensim doesn’t have an implementation for NMF so we’re only going to play with LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models.

In [1]:
import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

In [4]:
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')
 
def clean_text(text):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    return cleaned_text
 
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in data:
    tokenized_data.append(clean_text(text))

In [5]:
# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
 
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
 
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

[(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2), (44, 2), (45, 2), (46, 2), (47, 2), (49, 1), (50, 1), (53, 1), (56, 1), (59, 1), (60, 1), (66, 1), (75, 1), (80, 1), (98, 1), (101, 1), (106, 1), (117, 1), (129, 1), (130, 2), (132, 2), (135, 2), (140, 1), (141, 2), (143, 4), (144, 2), (145, 2), (166, 1), (195, 1), (198, 3), (219, 1), (220, 4), (221, 3), (223, 1), (229, 4), (230, 4), (231, 2), (235, 1), (236, 1), (242, 2), (246, 2), (255, 1), (263, 1), (269, 1), (270, 5), (271, 2), (275, 5), (276, 1), (278, 4), (280, 2), (281, 1), (307, 2), (310, 1), (311, 3), (313, 1), (314, 5), (318, 4), (322, 1), (336, 1), (338, 3), (339, 1), (340, 1), (341, 1), (345, 1), (346, 1), (351, 1), (354, 1), (355, 1), (366, 3), (368, 13), (370, 1), (372, 1), (374, 3), (377, 3), (381, 3), (386, 1), (392, 6), (396, 1), (401, 1), (412, 2), (426, 2), (428, 2), (431, 2), (434, 2), (439, 2), (444, 1), (450, 1), (452, 1), (462, 1), (465, 1), (467, 1), (470, 1), (478, 1), (483, 1), (

Let’s now display the topics the two models have inferred:

In [6]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)

LDA Model:
Topic #0: 0.007*"would" + 0.005*"one" + 0.004*"could" + 0.004*"said" + 0.003*"time" + 0.003*"may" + 0.003*"like" + 0.002*"back" + 0.002*"first" + 0.002*"two"
Topic #1: 0.006*"one" + 0.004*"would" + 0.004*"two" + 0.003*"could" + 0.003*"time" + 0.003*"said" + 0.003*"new" + 0.002*"man" + 0.002*"also" + 0.002*"even"
Topic #2: 0.006*"one" + 0.004*"would" + 0.003*"could" + 0.003*"said" + 0.003*"new" + 0.003*"man" + 0.003*"time" + 0.002*"may" + 0.002*"two" + 0.002*"even"
Topic #3: 0.005*"one" + 0.005*"would" + 0.004*"said" + 0.003*"first" + 0.003*"two" + 0.003*"may" + 0.003*"new" + 0.003*"time" + 0.002*"could" + 0.002*"well"
Topic #4: 0.006*"one" + 0.004*"would" + 0.003*"could" + 0.003*"said" + 0.003*"new" + 0.003*"may" + 0.002*"even" + 0.002*"like" + 0.002*"man" + 0.002*"must"
Topic #5: 0.006*"one" + 0.006*"would" + 0.004*"said" + 0.003*"could" + 0.003*"time" + 0.003*"like" + 0.003*"new" + 0.002*"two" + 0.002*"even" + 0.002*"first"
Topic #6: 0.005*"would" + 0.005*"one" + 0.004*"sa

Let’s now put the models to work and transform unseen documents to their topic distribution:

In [11]:
text = "The economy is working better than ever"
bow = dictionary.doc2bow(clean_text(text))
 
print(lsi_model[bow])
print("=" * 110) 
print(lda_model[bow])


[(0, 0.09161404414187316), (1, 0.00877510043017405), (2, 0.016276655235545535), (3, 0.04055827557842265), (4, -0.014087372248488522), (5, -0.010998552017034986), (6, -0.029797620426955575), (7, 0.014716167964050755), (8, 0.05813478939785562), (9, -0.024468321686409136)]
[(0, 0.020011283), (1, 0.02001212), (2, 0.020011777), (3, 0.020011794), (4, 0.02001181), (5, 0.020011764), (6, 0.81989455), (7, 0.02001151), (8, 0.020011712), (9, 0.020011656)]


The LDA result can be interpreted as a distribution over topics. Let’s take an example:
[(0, 0.020229582), (1, 0.48642197), (2, 0.020894188), (3, 0.020058075), (4, 0.022410348), (5, 0.025939714), (6, 0.20046122), (7, 0.13457063), (8, 0.048185956), (9, 0.02082831)]. This result suggests that topic 1 has the strongest representation in this text.

Gensim offers a simple way of performing similarity queries using topic models.

In [12]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
# [(104, 0.87591344), (178, 0.86124849), (31, 0.8604598), (77, 0.84932965), (85, 0.84843522), (135, 0.84421808), (215, 0.84184396), (353, 0.84038532), (254, 0.83498049), (13, 0.82832891)]
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(data[document_id][:1000])
 

[(69, 0.99815696), (333, 0.99780756), (96, 0.9976221), (141, 0.9976192), (90, 0.9976164), (387, 0.997598), (368, 0.9975539), (414, 0.9975468), (137, 0.99754524), (289, 0.9975412)]
Tenure as criterion I would like to add one more practical reform to those mentioned by Russell Kirk ( Dec. 16 ) . It has to do with teachers' salaries and tenure . Next September , after receiving a degree from Yale's Master of Arts in Teaching Program , I will be teaching somewhere -- that much is guaranteed by the present shortage of mathematics teachers . I will also be underpaid . The amazing thing is that this too is caused by the dearth of teachers . Teaching is at present a sellers' market ; ; as a result buyers , the public , must be satisfied with second-rate teachers . But this is not the real problem ; ; the rub arises from the fact that teachers are usually paid on the basis of time served rather than quality . Hence all teachers , good and bad , who have been teaching for a given number of years

## Using Scikit-Learn for Topic Modeling

Let’s now go through the same process with sklearn. This librabry offers a NMF implementation as well. The algorithms are more bare-bones than what we’ve seen with gensim but on the plus side, they implement the fit/transform interface we’re used with:

In [13]:
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)

In [14]:
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
print(lda_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=NUM_TOPICS)
nmf_Z = nmf_model.fit_transform(data_vectorized)
print(nmf_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)
 
# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=NUM_TOPICS)
lsi_Z = lsi_model.fit_transform(data_vectorized)
print(lsi_Z.shape)  # (NO_DOCUMENTS, NO_TOPICS)



(500, 10)
(500, 10)
(500, 10)


In [15]:
# Let's see how the first document in the corpus looks like in different topic spaces
print(lda_Z[0])
print(nmf_Z[0])
print(lsi_Z[0])

[4.98712693e-02 1.05604586e-04 9.49283817e-01 1.05618402e-04
 1.05600389e-04 1.05627028e-04 1.05626691e-04 1.05606887e-04
 1.05606247e-04 1.05623855e-04]
[0.         0.         2.11872507 0.0769166  0.         0.54506651
 1.06367111 0.         0.         0.24670294]
[ 23.30684263   1.59492702  21.80016313  -0.02799718   0.81778015
  11.49415067   4.2087838   -2.10492522   1.52576105 -13.87185573]


In order to inspect the inferred topics we need to implement a print function ourselves:

In [16]:
def print_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])
 
print("LDA Model:")
print_topics(lda_model, vectorizer)
print("=" * 20)
 
print("NMF Model:")
print_topics(nmf_model, vectorizer)
print("=" * 20)
 
print("LSI Model:")
print_topics(lsi_model, vectorizer)
print("=" * 20)
 

LDA Model:
Topic 0:
[('new', 211.4103191509502), ('use', 187.11068067058116), ('cost', 155.55392362884064), ('small', 152.92793051498248), ('state', 145.84969104356153), ('water', 142.58626965029853), ('time', 139.50332837865864), ('development', 128.04228218027922), ('year', 127.5000534385448), ('used', 124.97211700875577)]
Topic 1:
[('seeds', 35.53687475461216), ('used', 21.88988380717902), ('seed', 21.0521918800328), ('oil', 19.795515593503634), ('exercise', 16.217306405046546), ('meat', 15.98503533434088), ('mustard', 14.584594919016043), ('nuts', 14.476866340597718), ('make', 13.457807953640371), ('mason', 13.313098302569415)]
Topic 2:
[('new', 945.1013573306385), ('world', 594.8166058560859), ('said', 594.0673789127343), ('time', 548.0521460942985), ('people', 546.265380267223), ('man', 528.9406465895836), ('years', 496.30122617254625), ('state', 459.03473746624957), ('american', 447.747100096487), ('life', 415.0829574975654)]
Topic 3:
[('game', 57.85309355628941), ('ball', 56.48

[('united', 0.28155868346978163), ('states', 0.23806544311041702), ('mrs', 0.20069277134381544), ('shall', 0.19390926400240588), ('government', 0.17959151820884245), ('school', 0.14230449725378613), ('section', 0.1248979045264043), ('agreement', 0.11674711020580848), ('act', 0.11435520594820868), ('india', 0.1017802644256832)]
Topic 8:
[('form', 0.32396908240788747), ('dictionary', 0.30611260177489263), ('information', 0.30092847548143026), ('text', 0.23123927971701877), ('cell', 0.1943972232348597), ('forms', 0.19140887329034706), ('year', 0.1758021680849887), ('tax', 0.1470288085152132), ('list', 0.1368947442261754), ('said', 0.13462323899700593)]
Topic 9:
[('fiscal', 0.26434907920711115), ('year', 0.25325495398663667), ('tax', 0.193211900821829), ('school', 0.15163342665263502), ('states', 0.13069770984575943), ('like', 0.11041909908884896), ('time', 0.10618415952255968), ('years', 0.0839696159626546), ('children', 0.08192516375640395), ('child', 0.08180583969926397)]


Transforming an unseen document goes like this:

In [17]:
text = "The economy is working better than ever"
x = nmf_model.transform(vectorizer.transform([text]))[0]
print(x) 

[0.00289836 0.         0.         0.         0.         0.00440548
 0.         0.         0.         0.0046761 ]


Here’s how to implement the similarity functionality we’ve seen in the gensim section:

In [18]:
from sklearn.metrics.pairwise import euclidean_distances
 
def most_similar(x, Z, top_n=5):
    dists = euclidean_distances(x.reshape(1, -1), Z)
    pairs = enumerate(dists[0])
    most_similar = sorted(pairs, key=lambda item: item[1])[:top_n]
    return most_similar
 
similarities = most_similar(x, nmf_Z)
document_id, similarity = similarities[0]
print(data[document_id][:1000])
 

Livery stable -- J. Vernon , prop. '' . Coaching had declined considerably by 1905 , but the sign was still there , near the old Wells Fargo building in San Francisco , creaking in the fog as it had for thirty years . John Vernon had had all the patronage he cared for -- he had prospered , but he could not retire from horsedom . Coaching was in his blood . He had two interests in life : the pleasures of the table and driving . Twice a week he drove his tallyho over the Santa Cruz road , upland and through the redwood forest , with orchards below him at one hand , and glimpses of the Pacific at the other . The journey back he made along the coast road , traveling hell-for-leather , every lantern of the tallyho ablaze . The southward route was the classic run in California , and the most fashionable . His patronage on this stretch was made up largely of San Franciscans -- regulars , most of them , and trenchermen like himself . They did not complain at the inhuman hour of starting ( seve

### Plotting words and documents in 2D with SVD

We can use SVD with 2 components (topics) to display words and documents in 2D. The process is really similar. Let’s start with displaying documents since it’s a bit more straightforward.

In case you are running this in a Jupyter Notebook, run the following lines to init bokeh:

In [19]:
import pandas as pd
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()

In [20]:
svd = TruncatedSVD(n_components=2)
documents_2d = svd.fit_transform(data_vectorized)
 
df = pd.DataFrame(columns=['x', 'y', 'document'])
df['x'], df['y'], df['document'] = documents_2d[:,0], documents_2d[:,1], range(len(data))
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="document", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

You can try going through the documents to see if indeed closer documents on the plot are more similar. To display words in 2D we just need to transpose the vectorized data: words_2d = svd.fit_transform(data_vectorized.T).

In [21]:
svd = TruncatedSVD(n_components=2)
words_2d = svd.fit_transform(data_vectorized.T)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = words_2d[:,0], words_2d[:,1], vectorizer.get_feature_names()
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="8pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=600, plot_height=600)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

## More about Latent Dirichlet Allocation

LDA is the most popular method for doing topic modeling in real-world applications. That is because it provides accurate results, can be trained online (do not retrain every time we get new data) and can be run on multiple cores. Let’s repeat the process we did in the previous sections with sklearn and LatentDirichletAllocation:

In [22]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
 
NUM_TOPICS = 10
 
vectorizer = CountVectorizer(min_df=5, max_df=0.9, 
                             stop_words='english', lowercase=True, 
                             token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(data)
 
# Build a Latent Dirichlet Allocation Model
lda_model = LatentDirichletAllocation(n_topics=NUM_TOPICS, max_iter=10, learning_method='online')
lda_Z = lda_model.fit_transform(data_vectorized)
 
text = "The economy is working better than ever"
x = lda_model.transform(vectorizer.transform([text]))[0]
print(x, x.sum())



[0.02500852 0.02500005 0.02500639 0.77495565 0.02500857 0.0250028
 0.02500677 0.02500652 0.02500004 0.02500469] 1.0


Notice how the factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. The purpose of LDA is to compute how much of the document was generated by which topic. In this example, more than half of the document has been generated by the second topic:

LDA is an iterative algorithm. Here are the two main steps:

In the initialization stage, each word is assigned to a random topic.
Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
What’s the probability of the word belonging to a topic
What’s the probability of the document to be generated by a topic

Due to these important qualities, we can visualize LDA results easily. We’re going to use a specialized tool called PyLDAVis:

In [23]:
import pyLDAvis.sklearn
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, data_vectorized, vectorizer, mds='tsne')
panel
 

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:

Larger topics are more frequent in the corpus.

Topics closer together are more similar, topics further apart are less similar.

When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.

Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.