# Gensim Topic Modeling

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

In [40]:
import gensim
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.models.hdpmodel import HdpModel # Heirarichal Dirichlet process
from gensim.models.wrappers import LdaVowpalWabbit, LdaMallet
from gensim.corpora.dictionary import Dictionary
from numpy import array

In [41]:
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

In [43]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(texts, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[texts], threshold=100)  


In [46]:
# The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus.
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts] # # Term Document Frequency

In [47]:
corpus

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

In [48]:
# with 50 iteration
goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=50, num_topics=2)

# one iteration ony
badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)

### Topic Coherence
Topic Coherence is a measure used to evaluate topic models: methods that automatically generate topics from a collection of documents, using latent variable models.
Each such generated topic consists of words, and the topic coherence is applied to the top N words from the topic. It is defined as the average / median of the pairwise word-similarity scores of the words in the topic (e.g. PMI).

A good model will generate coherent topics, i.e., topics with high topic coherence scores. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture.

I'm not an expert, but I think the scores are mainly comparative: if topic B has a higher coherence score than topic A, it is 'better' (more coherent).

In [49]:
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
print(goodcm.get_coherence(),badcm.get_coherence())

0.3838413553737203 0.3838413553737203


In [50]:
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='u_mass')
badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='u_mass')
print(goodcm.get_coherence(),badcm.get_coherence())

-14.664627001047737 -14.726061108336854


In [53]:
# Compute Perplexity
goodLdaModel.log_perplexity(corpus)  # a measure of how good the model is. lower the better.

-3.0212114083512485

## Visualize topic models

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

In [35]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)

In [60]:
# save html file
vis = pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)
vis = pyLDAvis.prepared_data_to_html(vis)

with open("LDA_output.html", "w") as file:
    file.write(vis)

In [37]:
topics = []
for topic_id, topic in goodLdaModel.show_topics(num_topics=10,formatted=False):
    topic = [word for word, _ in topic]
    topics.append(topic)

In [38]:
topics

[['graph',
  'trees',
  'minors',
  'survey',
  'computer',
  'interface',
  'human',
  'system',
  'time',
  'response'],
 ['system',
  'user',
  'eps',
  'response',
  'time',
  'human',
  'interface',
  'computer',
  'survey',
  'trees']]

# Mallet’s version of LDA algorithm

In [57]:
# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = r'E:\projects\mallet-2.0.8\bin\mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=dictionary)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Sabeeha\\AppData\\Local\\Temp\\9d5505_state.mallet.gz'

In [None]:
# Show Topics
pprint(ldamallet.show_topics(formatted=False))

In [None]:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

### How to find the optimal number of topics for LDA?
My approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.