<a href="https://colab.research.google.com/github/nitinpunjabi/nlp-demystified/blob/main/notebooks/nlpdemyst_topic_modelling_lda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Latent Dirichlet Allocation
https://nlpdemystified.org<br>
https://github.com/nitinpunjabi/nlp-demystified

# spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.
<br><br>
**IMPORTANT**<br>
If you're running this for free in the cloud rather than using a paid tier or using a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm

In [None]:
!python -m spacy info

For topic modelling, we'll use **Gensim**, a popular topic modelling library originally authored by Radim Řehůřek. It has implementations of LDA and other models we'll use later in the course.<br>
https://radimrehurek.com/gensim/index.html

In [None]:
import spacy
spacy.prefer_gpu()

In [None]:
# Upgrade gensim in case.
!pip install -U gensim==4.*

# First pass at building an LDA topic model for our corpus

We'll use a corpus of over 2,000 Associated Press news articles compiled by David M. Blei, one of the authors of the original LDA paper<br>
<br>
The original paper:<br>
https://dl.acm.org/doi/10.5555/944919.944937
<br>


In [None]:
# https://docs.python-requests.org/en/master/
import requests

In [None]:
# Retrieve the articles and put them in a list.
url = 'https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/datasets/ap_articles.txt'
response = requests.get(url)
articles = response.text.splitlines()

In [None]:
articles[0]

In [None]:
# Like before, if we want to use spaCy's tokenizer, we need 
# to create a callback. In this case, we'll start off with a
# blank tokenizer (i.e. no parsing, tagging, etc).
nlp = spacy.blank('en')

# For this exercise, we'll remove punctuation and spaces (which
# includes newlines), filter for tokens consisting of alphabetic
# characters, and return the token text.
def spacy_tokenizer(doc):
  return [t.text for t in nlp(doc) if \
          not t.is_punct and \
          not t.is_space and \
          t.is_alpha]

In [None]:
# Tokenize all the articles
%%time
tokenized_articles = []
for a in articles:
  tokenized_articles.append(spacy_tokenizer(a))

In [None]:
print(tokenized_articles[0])

To start off, we'll go with 20 topics. With most topic models including LDA, there isn't a clear recipe on how to pick the optimal number of topics. The nature and composition of the data (e.g. average length of each document) has a major impact on how many topics are *interpretable* by a human. Often, it's best to go with something reasonable to begin with and then try different topic numbers. With ~2,500 documents, 20-50 topics can usually give someone a good idea of the content.

In [None]:
from gensim import models, corpora
NUM_TOPICS = 20

After tokenizing our text, we use Gensim to construct a **Dictionary** mapping words to their integer IDs.<br>
https://radimrehurek.com/gensim/corpora/dictionary.html

In [None]:
# Build a Dictionary of word<-->id mappings.
dictionary = corpora.Dictionary(tokenized_articles)

The next step is to create a frequency bag-of-words from each article.
https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow

In [None]:
corpus_bow = [dictionary.doc2bow(article) for article in tokenized_articles]

Finally, we'll generate our base LDA model. Gensim's LDA model has a large number of optional parameters but for now, we'll keep it simple.<br>
https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel

In [None]:
%%time
lda_model = models.LdaModel(corpus=corpus_bow, num_topics=NUM_TOPICS, id2word=dictionary, random_state=1)

Once our model is generated, we can view the topics inferred. By default, the model's *print_topics* method shows the top 20 topics and each topic's ten most significant words.<br>
https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#gensim.models.ldamodel.LdaModel.print_topics

In [None]:
lda_model.print_topics()

The first pass is pretty awful. The topics are dominated by stop words such that they essentially look all the same. Let's see if we can do better.

# Improving preprocessing for better results.

For our next attempt, we'll
- remove stop words using the default list. Given this is a corpus of news articles, there may be other stop words to consider such as salutations ("Mr", "Mrs"), and words related to quotes and thoughts ("say", "think"). But for this, we'll stick to defaults unless we see reason to do otherwise.
- consider only the words the spaCy tagger flags as *nouns, verbs,* and *adjectives*. Including words with only certain POS tags is a common approach to improving topic models.
- take the lemma.

In [None]:
nlp = spacy.load('en_core_web_sm')

def spacy_tokenizer_w_pos(doc):
  return [t.lemma_ for t in nlp(doc) if \
          t.is_alpha and \
          not t.is_punct and \
          not t.is_space and \
          not t.is_stop and \
          t.pos_ in ['NOUN', 'VERB', 'ADJ']]

In [None]:
# We'll need to retokenize everything and rebuild the BOWs. Because we're now
# using the POS tagger, this will take longer. The "w_pos" in the variable 
# names below just means "with part-of-speech".
%%time
tokenized_articles_w_pos = [spacy_tokenizer_w_pos(a) for a in articles]
dictionary_w_pos = corpora.Dictionary(tokenized_articles_w_pos)
corpus_bow_w_pos = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]

In [None]:
lda_model = models.LdaModel(corpus=corpus_bow_w_pos, num_topics=NUM_TOPICS, id2word=dictionary_w_pos, random_state=1)

In [None]:
lda_model.print_topics()

This is better but there are still a few low-signal words dominating topics such as "said" lemmatized to "say" which makes sense for a news corpus. Perhaps trimming the vocabulary and tuning the model parameters themselves can lead to something more interpretable.

# Trimming low- and high-frequency words.

One thing we can try is filtering out rare and common tokens.
https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes

In [None]:
# The size of the dictionary before filtering.
len(dictionary_w_pos)

The filtering is a bit idiosyncratic. The lower bound is an *absolute* number, and the upper bound is a *percentage*. Here, we're saying filter out words which occur in fewer than five documents and more than 50% of the documents.

In [None]:
dictionary_w_pos.filter_extremes(no_below=5, no_above=0.5)

In [None]:
# The size of the dictionary after filtering.
len(dictionary_w_pos)

In [None]:
# Rebuild bag of words.
corpus_bow_w_pos_filtered = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]

This time, we're passing additional arguments when building the model. *alpha* is the prior on each topic's probability, *eta* is the prior on each word's probability, and *passes* is the number of complete passes through the corpus during training.

In [None]:
%%time
lda_model = models.ldamodel.LdaModel(corpus=corpus_bow_w_pos_filtered,
                                     id2word=dictionary_w_pos,
                                     num_topics=NUM_TOPICS,
                                     passes=10,
                                     alpha='auto',
                                     eta='auto',
                                     random_state=1)

In [None]:
lda_model.print_topics()

In [None]:
articles[0]

We can look at the topic distribution comprising a given article.<br>
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

In [None]:
sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[0]), key=lambda tup: tup[1])[::-1]

In [None]:
lda_model.show_topic(16)

The results of this model look the best so far and we can see a human-interpretable link between the distribution of topics in a document, the distribution of words in each topic, and the content of the document itself.

# Evaluation and Visualization

## Measuring topic models with coherence.

If a topic is a mixture of particular words, then one way to measure how semantically coherent a topic is to calculate co-occurrence among the words. That is, how often the top words in a topic co-occur together among the documents versus how often they occur independently.

Gensim's **Coherence Model** offers coherence implemented as a pipeline:<br>
https://radimrehurek.com/gensim/models/coherencemodel.html
<br>
<br>
See this paper for a detailed description of the pipeline as well as different co-occurence measures proposed (here, we are using the default *c_v* measure):<br>
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
<br>
<br>
Topic model evaluation is a difficult subject with no clear quantitative approach. A higher c_v measure doesn't necessarily translate to a higher *qualitative* model. That is, the score a human would give looking at the topic words and how interpretable they are. It's very possible to favour a *lower* scoring model because it serves a particular purpose better. Just keep that in mind. See this video for the problems with quantitative topic model evaluation:<br>
[Matti Lyra - Evaluating Topic Models](https://www.youtube.com/watch?v=UkmIljRIG_M)

In [None]:
from gensim.models.coherencemodel import CoherenceModel

In [None]:
# Let's check out the coherence of our current model with 20 topics.
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_articles_w_pos, dictionary=dictionary_w_pos, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Train another model with 30 topics to see whether the coherence score improves.
%%time
lda_model_30 = models.ldamodel.LdaModel(corpus=corpus_bow_w_pos_filtered,
                                        id2word=dictionary_w_pos,
                                        num_topics=30,
                                        passes=10,
                                        alpha='auto',
                                        eta='auto',
                                        random_state=1)

In [None]:
coherence_lda_30 = CoherenceModel(model=lda_model_30, texts=tokenized_articles_w_pos, dictionary=dictionary_w_pos, coherence='c_v')
print('\nCoherence Score: ', coherence_lda_30.get_coherence())

We improved the coherence score by a couple of percentage points by increasing the number of topics. One common technique is to try a bunch of different *num_topics* values, plot the coherence score for each, then choose the num_topics with the highest score.

## Human evaluation
Because the quantitative metrics aren't entirely correlated with quality, human judgment still plays a large role in topic model evaluation.


We can look at the topic words to see how interpretable they are...

In [None]:
lda_model.show_topic(19)

...or visualize them with word clouds.

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
def render_word_cloud(model, rows, cols, max_words):
  word_cloud = WordCloud(background_color='white', max_words=max_words, prefer_horizontal=1.0)
  fig, axes = plt.subplots(rows, cols, figsize=(15,15))

  for i, ax in enumerate(axes.flatten()):
      fig.add_subplot(ax)
      topic_words = dict(model.show_topic(i))
      word_cloud.generate_from_frequencies(topic_words)
      plt.gca().imshow(word_cloud, interpolation='bilinear')
      plt.gca().set_title('Topic {id}'.format(id=i))
      plt.gca().axis('off')

  plt.axis('off')
  plt.show()

In [None]:
# Here we'll visualize the first nine topics.
render_word_cloud(lda_model, 3, 3, 10)

There are also subjective tests like *word intrusion* and *topic intrusion*. Word intrusion is taking words which belong to a topic, injecting a word from another topic into the collection, and seeing whether a human can easily identify the intruder word. The more easily the intruder word is spotted, the more well-formed the topic. For example, which word doesn't belong in this topic?<br>
*{apple, lemon, tomato, horse, grape}*

# Finding similar documents.

Gensim has a **similarities** module which can build an index for a given set of documents. Here, we're using **MatrixSimilarity** which computes cosine similarity across a corpus and stores them in an index.<br>
https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity

In [None]:
from gensim import similarities
lda_index = similarities.MatrixSimilarity(lda_model[corpus_bow_w_pos_filtered], num_features=len(dictionary_w_pos))

Here's a utility function to help retrieve the *first_m_words* of the *top_n* most similar documents. If you're curious about the \_\_getitem\__ method on the LDA Model class, you can find the code here:<br>
https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py

In [None]:
def get_similar_documents(index, bow, model, article_id, top_n=5, first_m_words=300):
  # bow[article_id] gets the specific bag-of-words for the given article.
  # model[bow[article_id]] retrieves the topic distribution for the BOW.
  # index[model[bow[article_id]]] compares the topic distribution for the BOW against the similarity index previously computed.
  similar_docs = index[model[bow[article_id]]]
  top_n_docs = sorted(enumerate(similar_docs), key=lambda item: -item[1])[1:top_n+1]
  
  # Return a list of tuples with each tuple: (article id, similarity score, first_m_words of article)
  return list(map(lambda entry: (entry[0], entry[1], articles[entry[0]][:first_m_words]), top_n_docs))

In [None]:
article_id = 0
articles[article_id]

In [None]:
get_similar_documents(lda_index, corpus_bow_w_pos_filtered, lda_model, article_id)

We can also query for documents similar to unseen documents. Below are short, actual blurbs from 2021 involving stock options and crime. Because of the subject matter, it's relatively easy to find similar articles to these even in a corpus from the 1980s like this Associated Press collection. But keep in mind that if you query with short articles about subjects like cryptocurrencies and social media, you probably won't find good matches. This is another aspect to keep in mind when thinking about your data and use cases.

In [None]:
#d = "Capricorn Business Acquisitions Inc. (TSXV: CAK.H) (the “Company“) is pleased to announce that its board has approved the issuance of 70,000 stock options (“Stock Options“) to directors on April 19, 2020."
d = "DEA agent sentenced to 12 years in prison for conspiring with Colombian drug cartel."

In [None]:
doc_vec = dictionary_w_pos.doc2bow(spacy_tokenizer_w_pos(d))

In [None]:
similar_docs = lda_index[lda_model[doc_vec]]
top_n_docs = sorted(enumerate(similar_docs), key=lambda item: -item[1])[1:5+1]
top_n_docs

In [None]:
articles[top_n_docs[0][0]]

In [None]:
# Look at the topics comprising the query document.
topics = sorted(lda_model[doc_vec], key=lambda tup: -tup[1])[:10]
topics

In [None]:
lda_model.show_topic(topics[0][0])

# Closing Thoughts and things to explore.
- Gensim infers topic and word distributions through [Variational Bayes (VB)](https://en.wikipedia.org/wiki/Variational_Bayesian_methods), not Gibbs Sampling. From the topics I've seen, Gibbs Sampling tends to lead to more interpretable topics, but VB is faster and Gensim offers the additional benefits of streaming documents, online learning, and training across a cluster of machines.
- Another topic modelling library, [Mallet](http://mallet.cs.umass.edu/), infers through Gibbs Sampling but is Java-based. Unfortunately, Gensim 4.0+ no longer offers a wrapper around Mallet. But if you're comfortable with Java, it may be worth exploring.
- Scikit-learn offers an [LDA model](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). Maybe as an exercise, try using this LDA model on the [20 Newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset.
- [pyLDAvis](https://github.com/bmabey/pyLDAvis) is another means of visualizing topic models. You can see it in action in this [notebook](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb). See if you can get it working on your own topic model.
- LDA tends to work better on longer documents, and whether a topic model is "good" depends on your use case rather than strictly on a quantitative metric.