<a href="https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_topic_modelling_lda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing Demystified | Topic Modelling With Latent Dirichlet Allocation
https://nlpdemystified.org<br>
https://github.com/futuremojo/nlp-demystified<br><br>
Course module for this demo: https://www.nlpdemystified.org/course/topic-modelling

# spaCy upgrade and package installation.

At the time this notebook was created, spaCy had newer releases but Colab was still using version 2.x by default. So the first step is to upgrade spaCy.
<br><br>
**IMPORTANT**<br>
If you're running this in the cloud rather than a local Jupyter server on your machine, then the notebook will *timeout* after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical package(s).
<br><br>
Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:<br>
https://research.google.com/colaboratory/local-runtimes.html

---
> **In the course video, I ran this demo on a local Jupyter server to take advantage of multiprocessing capabilities. It's not necessary but I recommend it.**

In [None]:
!pip install -U spacy==3.*
!python -m spacy download en_core_web_sm
!python -m spacy info

For topic modelling, we'll use **Gensim**, a popular topic modelling library originally authored by Radim Řehůřek. It has implementations for LDA and other models.<br>
https://radimrehurek.com/gensim/index.html

In [None]:
# Upgrade gensim in case.
# !pip install --upgrade numpy
!pip install -U gensim==4.*

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import random
import spacy

from gensim import models, corpora
from gensim import similarities
from gensim.models.coherencemodel import CoherenceModel
from wordcloud import WordCloud

# First pass at building an LDA topic model for our corpus

We'll use a corpus of over 90,000 CNN news articles originally compiled for training question answering models. I lightly processed them to remove some metadata and put them on Google Drive.
([original source](https://cs.nyu.edu/~kcho/DMQA/))
<br><br>
To retrieve the corpus from Google Drive, we'll use the **gdown** library which I've already installed:<br>
https://github.com/wkentaro/gdown

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
    
locale.getpreferredencoding = getpreferredencoding

In [None]:
!pip install --upgrade --no-cache-dir gdown

In [None]:
# Download the CNN corpus.
!gdown 'https://drive.google.com/uc?id=122fC9XpNwFKx0ryRVKJz5MWUTzA3Vpsf'

The corpus is one large text file with each article in the corpus separated by an *@delimiter* string. We'll split the articles and place them in a list.

In [None]:
with open('cnn_articles.txt', 'r', encoding='utf8') as f:
  articles = f.read().split('@delimiter')

In [None]:
print(len(articles))
print(articles[0])

For this demo, we'll use a subset of the articles to speed things up but feel free to change the dataset size.

In [None]:
DATASET_SIZE = 20000
dataset = articles[:DATASET_SIZE]

Just like in the [Text Classification with Naive Bayes](https://github.com/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_classification_naive_bayes.ipynb) demo, we'll start off with a *blank* tokenizer with no further pipeline components to see if that's good enough.
<br><br>
We'll filter out punctuations, newlines, and any tokens containing non-alphabetic characters.

In [None]:
nlp = spacy.blank('en')

def basic_filter(tokenized_doc):
  return [t.text for t in tokenized_doc if
          not t.is_punct and \
          not t.is_space and \
          t.is_alpha]

In this demo, we'll leverage spaCy's **nlp.pipe** function which can process a corpus as a batch (or a series of batches) and use multiple processes. Here, we'll process our dataset as a batch across multiple processes, then run the tokenized **doc** objects through the *basic_filter* function. You can adjust **NUM_PROCESS** as you wish.<br><br>
Take a look at these link for ways to further optimize spaCy's pipeline:<br>
https://spacy.io/usage/processing-pipelines#processing<br>
https://spacy.io/api/language#pipe<br><br>
YouTube video from spaCy on using **nlp.pipe**: [Speed up spaCy pipelines via `nlp.pipe` - spaCy shorts](https://www.youtube.com/watch?v=OoZ-H_8vRnc)<br>
Tuning **nlp.pipe**: https://stackoverflow.com/questions/65850018/processing-text-with-spacy-nlp-pipe

In [None]:
NUM_PROCESS = 4

In [None]:
%%time
tokenized_articles = list(map(basic_filter, nlp.pipe(dataset, n_process=NUM_PROCESS)))

In [None]:
print(tokenized_articles[0])

To start off, we'll go with 20 topics. With most topic models including LDA, there isn't a clear recipe on how to pick the optimal number of topics. The nature and composition of the data (e.g. average length of each document) has a major impact on how many topics are *interpretable* by a human. Often, it's best to go with something reasonable to begin with and then try different topic numbers.<br><br>For this corpus, I'm going with 20 topics which is a small amount relative to the corpus size, but my reasoning is that since this is a general mainstream news corpus, the topics themselves are going to be fairly broad.

In [None]:
NUM_TOPICS = 20

After tokenizing our text, the first step with Gensim is to construct a **Dictionary** mapping words to integer IDs.<br>
https://radimrehurek.com/gensim/corpora/dictionary.html<br><br>
This is similar to the *fit* step we took with scikit-learn's vectorizers.

In [None]:
# Build a Dictionary of word<-->id mappings.
%%time
dictionary = corpora.Dictionary(tokenized_articles)

sample_token = 'news'
print(f'Id for \'{sample_token}\' token: {dictionary.token2id[sample_token]}')

The next step is to create a frequency bag-of-words from each article using the **dictionary**'s *doc2bow* method. This is similar to the *transform* step from scikit-learn's vectorizers.<br>
https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow

In [None]:
%%time
corpus_bow = [dictionary.doc2bow(article) for article in tokenized_articles]

Finally, we'll generate our base LDA model. Gensim's LDA model has a large number of optional parameters but for now, we'll keep it simple.<br>
https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel

In [None]:
%%time
lda_model = models.LdaModel(corpus=corpus_bow, num_topics=NUM_TOPICS, id2word=dictionary, random_state=1)

Once our model is generated, we can view the topics inferred. By default, the model's *print_topics* method shows the top 20 topics and each topic's ten most significant words.<br>
https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#gensim.models.ldamodel.LdaModel.print_topics

In [None]:
lda_model.print_topics()

The first pass is pretty awful. The topics are dominated by stop words such that they essentially look all the same. Let's see if we can do better.

# Improving preprocessing for better results.

For our next attempt, we'll
- remove stop words using the default spaCy stopword list. Given this is a corpus of news articles, there may be other stop words to consider such as salutations ("Mr", "Mrs"), and words related to quotes and thoughts ("say", "think"). But for this, we'll stick to defaults unless we see reason to do otherwise.
- consider only the words the spaCy tagger flags as *nouns, verbs,* and *adjectives*. Including words with only certain POS tags is a common approach to improving topic models.
- take the lemma.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def improved_filter(tokenized_doc):
  return [t.lemma_ for t in tokenized_doc if
          t.is_alpha and \
          not t.is_punct and \
          not t.is_space and \
          not t.is_stop and \
          t.pos_ in ['NOUN', 'VERB', 'ADJ']]

In [None]:
# We'll need to retokenize everything and rebuild the BOWs. Because we're now
# using the POS tagger, this will take longer. The "w_pos" in the variable 
# names below just means "with part-of-speech".
%%time
tokenized_articles_w_pos = list(map(improved_filter, nlp.pipe(dataset, n_process=NUM_PROCESS)))
dictionary_w_pos = corpora.Dictionary(tokenized_articles_w_pos)
corpus_bow_w_pos = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]

In [None]:
%%time
lda_model = models.LdaModel(corpus=corpus_bow_w_pos, num_topics=NUM_TOPICS, id2word=dictionary_w_pos, random_state=1)

In [None]:
lda_model.print_topics()

This is better but there are still a few low-signal words dominating topics such as "said" lemmatized to "say" which makes sense for a news corpus. Perhaps trimming the vocabulary and tuning the model parameters themselves can lead to something more interpretable.

# Trimming low- and high-frequency words.

One thing we can try is filtering out rare and common tokens.
https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes

In [None]:
# The size of the dictionary before filtering.
len(dictionary_w_pos)

The filtering is a bit idiosyncratic. The lower bound is an *absolute* number, and the upper bound is a *percentage*. Here, we're saying filter out words which occur in fewer than N documents and more than M% of the documents.

In [None]:
dictionary_w_pos.filter_extremes(no_below=5, no_above=0.5)

In [None]:
# The size of the dictionary after filtering.
len(dictionary_w_pos)

In [None]:
# Rebuild bag of words.
corpus_bow_w_pos_filtered = [dictionary_w_pos.doc2bow(article) for article in tokenized_articles_w_pos]

This time, we're passing additional arguments when building the model. *alpha* is the prior on the document-topic distribution, and *eta* is the prior on the topic-word distribution (this was *beta* in the slides), and *passes* is the number of complete passes through the corpus during training.<br>
https://radimrehurek.com/gensim/models/ldamodel.html?highlight=lda#module-gensim.models.ldamodel

In [None]:
%%time
lda_model = models.ldamodel.LdaModel(corpus=corpus_bow_w_pos_filtered,
                                     id2word=dictionary_w_pos,
                                     num_topics=NUM_TOPICS,
                                     passes=10,
                                     alpha='auto',
                                     eta='auto',
                                     random_state=1)

In [None]:
lda_model.print_topics()

With improved filtering and low- and high-frequency words trimmed, we can see the topic-word distributions containing certain themes such as crime, travel, entertainment, etc.<br><br>
**NOTE:** Remember that the topic model doesn't label topics for us. It just converges on collections of terms that likely form topics.

We set the training algorithm to learn priors for *alpha* and *eta*.

In [None]:
print(lda_model.alpha)
print(lda_model.eta)

The *alpha* and *eta* values the training algorithm arrived at are well below 1. This translates to most articles being dominated by one or just a few topics, and most topics being dominated by a handful of words.

We can look at the topic distribution comprising a given article using the model's *get_document_topics* method.<br>
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.get_document_topics

In [None]:
article_idx = 0
print(dataset[article_idx][:300])

In [None]:
# Return topic distribution for an article sorted by probability.
topics = sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[article_idx]), key=lambda tup: tup[1])[::-1]
topics

We can get the top words (10 by default) representing a topic using the model's *show_topic* method.
https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topic

In [None]:
# View the words of the top topic from the previous article.
lda_model.show_topic(topics[0][0])

In [None]:
# View the words of the second-most prevalent topic from the previous article.
lda_model.show_topic(topics[1][0])

The function below takes a document index and returns a **DataFrame** containing:
1. the topics comprising the document up to a minimum probability.
2. the top words of each topic.
<br>

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [None]:
def get_top_topics(article_idx, min_topic_prob):

  # Sort from highest to lowest topic probability.
  topic_prob_pairs = sorted(lda_model.get_document_topics(corpus_bow_w_pos_filtered[article_idx],
                                                          minimum_probability=min_topic_prob),
                            key=lambda tup: tup[1])[::-1]

  word_prob_pairs = [lda_model.show_topic(pair[0]) for pair in topic_prob_pairs]
  topic_words = [[pair[0] for pair in collection] for collection in word_prob_pairs]

  data = {
      'Major Topics': topic_prob_pairs,
      'Topic Words': topic_words
  }

  return pd.DataFrame(data)


In [None]:
pd.set_option('max_colwidth', 600)
snippet_length = 300
min_topic_prob = 0.25

article_idx = 1
print(dataset[article_idx][:snippet_length])
get_top_topics(article_idx, min_topic_prob)

In [None]:
article_idx = 10
print(dataset[article_idx][:snippet_length])
get_top_topics(article_idx, min_topic_prob)

In [None]:
article_idx = 100
print(dataset[article_idx][:snippet_length])
get_top_topics(article_idx, min_topic_prob)

In [None]:
article_idx = 1000
print(dataset[article_idx][:snippet_length])
get_top_topics(article_idx, min_topic_prob)

In [None]:
article_idx = 10000
print(dataset[article_idx][:snippet_length])
get_top_topics(article_idx, 0.25)

The results of this model look the best so far and we can see a human-interpretable link between the distribution of topics in a document, the distribution of words in each topic, and the content of the document itself.

# Evaluation and Visualization

## Measuring topic models with coherence.

If a topic is a mixture of particular words, then one way to measure how semantically coherent a topic is to calculate co-occurrence among the words. That is, how often the top words in a topic co-occur together among the documents versus how often they occur independently.

Gensim's **Coherence Model** offers coherence implemented as a pipeline:<br>
https://radimrehurek.com/gensim/models/coherencemodel.html
<br>
<br>
See this paper for a detailed description of the pipeline as well as different co-occurence measures proposed:<br>
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
<br>
<br>
Topic model evaluation is a difficult subject with no clear quantitative approach and is still debated. A higher (or lower score depending on the measure) doesn't necessarily translate to a higher *qualitative* model. That is, the score a human would give looking at the topic words and how interpretable they are.<br><br>
It's possible to favour a poorer scoring model because it serves a particular purpose better. Perhaps it's better to score the effectiveness of topic models based on performance in downstream tasks? See these videos for the problems with quantitative topic model evaluation:<br>
[Matti Lyra - Evaluating Topic Models](https://www.youtube.com/watch?v=UkmIljRIG_M)<br>
[Is Topic Model Evaluation Broken? The Incoherence of Coherence](https://www.youtube.com/watch?v=4KO2TO_cm2I)

In [None]:
%%time
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_articles_w_pos, dictionary=dictionary_w_pos, coherence='u_mass')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

## Human evaluation
Because the quantitative metrics aren't entirely correlated with quality, human judgment still plays a large role in topic model evaluation.


We can get someone to look at the topic words to see how interpretable they are. 

There are also subjective tests like **word intrusion** and **topic intrusion**.
<br><br>
**Word intrusion** is taking words which belong to a topic, injecting a word from another topic into the collection, and seeing whether a human can easily identify the intruder word. The more easily the intruder word is spotted, the more well-formed the topic. For example, which word doesn't belong in this topic?<br>
*{apple, lemon, tomato, horse, grape}*

We can also visualize them with word clouds.

In [None]:
def render_word_cloud(model, rows, cols, max_words):
  word_cloud = WordCloud(background_color='white', max_words=max_words, prefer_horizontal=1.0)
  fig, axes = plt.subplots(rows, cols, figsize=(15,15))

  for i, ax in enumerate(axes.flatten()):
      fig.add_subplot(ax)
      topic_words = dict(model.show_topic(i))
      word_cloud.generate_from_frequencies(topic_words)
      plt.gca().imshow(word_cloud, interpolation='bilinear')
      plt.gca().set_title('Topic {id}'.format(id=i))
      plt.gca().axis('off')

  plt.axis('off')
  plt.show()

In [None]:
# Here we'll visualize the first nine topics.
render_word_cloud(lda_model, 3, 3, 10)

# Finding similar documents.

Gensim has a **similarities** module which can build an index for a given set of documents. Here, we're using **MatrixSimilarity** which computes cosine similarity across a corpus and stores them in an index.<br>
https://radimrehurek.com/gensim/similarities/docsim.html#gensim.similarities.docsim.MatrixSimilarity

In [None]:
lda_index = similarities.MatrixSimilarity(lda_model[corpus_bow_w_pos_filtered], num_features=len(dictionary_w_pos))

Here's a utility function to help retrieve the *first_m_words* of the *top_n* most similar documents. If you're curious about the *\_\_getitem\__* method on the LDA Model class, you can find the code here:<br>
https://github.com/RaRe-Technologies/gensim/blob/master/gensim/models/ldamodel.py

In [None]:
def get_similar_articles(index, model, article_bow, top_n=5, first_m_words=300):
  # model[article_bow] retrieves the topic distribution for the BOW.
  # index[model[article_bow] compares the topic distribution for the BOW against the similarity index previously computed.
  similar_docs = index[model[article_bow]]
  top_n_docs = sorted(enumerate(similar_docs), key=lambda item: -item[1])[1:top_n+1]
  
  # Return a list of tuples with each tuple: (article id, similarity score, first_m_words of article)
  return list(map(lambda entry: (entry[0], entry[1], articles[entry[0]][:first_m_words]), top_n_docs))

In [None]:
article_idx = 0
print(dataset[article_idx][:snippet_length], '\n')
get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])

In [None]:
article_idx = 10
print(dataset[article_idx][:snippet_length], '\n')
get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])

In [None]:
article_idx = 100
print(dataset[article_idx][:snippet_length], '\n')
get_similar_articles(lda_index, lda_model, corpus_bow_w_pos_filtered[article_idx])

We can also query for documents similar to new, unseen documents. Below are short, actual blurbs from 2021 involving stock options and crime. Keep in mind that if this were a really old news corpus, then excerpts about cryptocurrencies and social media probably won't lead to good matches. This is another aspect to keep in mind when thinking about your data and use cases.

In [None]:
test_article = "Capricorn Business Acquisitions Inc. (TSXV: CAK.H) (the “Company“) is pleased to announce that its board has approved the issuance of 70,000 stock options (“Stock Options“) to directors on April 19, 2020."

article_tokens = list(map(improved_filter, [nlp(test_article)]))[0]
article_bow = dictionary_w_pos.doc2bow(article_tokens)
get_similar_articles(lda_index, lda_model, article_bow)

In [None]:
test_article = "DEA agent sentenced to 12 years in prison for conspiring with Colombian drug cartel."

article_tokens = list(map(improved_filter, [nlp(test_article)]))[0]
article_bow = dictionary_w_pos.doc2bow(article_tokens)
get_similar_articles(lda_index, lda_model, article_bow)

# Closing Thoughts and things to explore.
- Gensim infers topic and word distributions through [Variational Bayes (VB)](https://en.wikipedia.org/wiki/Variational_Bayesian_methods), not Gibbs Sampling. From the topics I've seen, Gibbs Sampling tends to lead to more interpretable topics, but VB is faster and Gensim offers the additional benefits of streaming documents, online learning, and training across a cluster of machines.
- Another topic modelling library, [Mallet](http://mallet.cs.umass.edu/), infers through Gibbs Sampling but is Java-based. Unfortunately, Gensim 4.0+ no longer offers a wrapper around Mallet. But if you're comfortable with Java, it may be worth exploring.
- Scikit-learn offers an [LDA model](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html). Maybe as an exercise, try using that LDA model on the [20 Newsgroups](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) dataset (or ideally, a dataset with longer documents).
- [pyLDAvis](https://github.com/bmabey/pyLDAvis) is another means of visualizing topic models. You can see it in action in this [notebook](https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb). See if you can get it working on your own topic model.
- LDA tends to work better on longer documents, and whether a topic model is "good" depends on your use case rather than strictly on a quantitative metric.