# Topic modeling with Gensim

To perform a topic modeling analysis on a text and visualize the output, you can follow these steps:

1. Preprocessing: Tokenize the text, remove stopwords and punctuation, and perform other necessary preprocessing steps.
2. Dictionary and Corpus Creation: Create a dictionary and corpus required for LDA modeling.
3. LDA Model Training: Train an LDA model on the corpus.
4. Visualization: Use `pyLDAvis` to visualize the topics.

The `num_topics` parameter in the `lda_analysis` function determines the number of topics to be extracted from the text. You can adjust this parameter based on your text content and the granularity of topics you wish to achieve.

The `pyLDAvis` visualization will provide an interactive interface to explore the topics, their distribution, and the most relevant terms for each topic. You can hover over the topics and terms to get more detailed information.

**Choosing the optimal number of topics for LDA** is more of an art than an exact science, as it depends on the specific characteristics of your dataset and your objectives. However, there are several approaches you can use to guide your decision:

1. **Topic Coherence**: This is a popular method for evaluating the quality of the topics generated by LDA. Topic coherence measures the degree of semantic similarity between high-scoring words in the topic. You can use Gensim's `CoherenceModel` to calculate coherence scores for different numbers of topics and choose the number that maximizes coherence. You can then plot the coherence values against the number of topics to find the optimal number.
2. **Perplexity**: This is another metric that can be used to evaluate the quality of the LDA model. Lower perplexity indicates a better model. However, perplexity might not always align with human judgment, so it's often used in conjunction with other methods.
3. **Manual Inspection**: Sometimes, the best way to determine the optimal number of topics is to manually inspect the topics generated by the model for different numbers of topics. Look for a number that provides a meaningful and interpretable set of topics without too much overlap or redundancy.
4. **Domain Knowledge**: Consider the context of your data and what you know about the subject area. If you have an idea of how many distinct topics you expect to find, this can guide your choice.

It's often a good idea to experiment with several methods and consider their results in conjunction with your own judgment and the needs of your project.

For advanced visualizations of LDA topic modeling results, you can explore various options beyond the standard pyLDAvis output. Here are some ideas:

- **Topic Trend Analysis**: Track the prevalence of topics over time or across different segments of your data. This can be particularly useful if your corpus is time-stamped or can be divided into meaningful categories. Plotting the proportion of each topic in different time periods or segments can reveal trends and patterns in the data.
- **Word Clouds for Topics**: Generate word clouds for each topic, where the size of each word is proportional to its importance in the topic. This provides a quick and visually appealing way to understand the content of each topic.
- **Topic Network Visualization**: Create a network graph where nodes represent topics and edges represent the similarity between topics (based on their word distributions). This can help you visualize the relationships between topics and identify clusters of related topics.
- **Topic Distribution in Documents**: For a selected set of documents, visualize the distribution of topics within each document using stacked bar charts or pie charts. This can help you understand how different topics are represented in individual documents or groups of documents.
- **Interactive Dashboards**: Build interactive dashboards using tools like Dash or Streamlit that allow users to explore the LDA results dynamically. You could include options to filter documents by topic, search for specific words or topics, and visualize the results in various ways.
- **Heatmaps of Topic-Word Distributions**: Create heatmaps to visualize the distribution of words across topics or the distribution of topics across documents. This can provide a detailed view of the relationships between words and topics or between documents and topics.
- **Hierarchical Clustering of Topics**: Perform hierarchical clustering on the topics based on their word distributions and visualize the resulting dendrogram. This can help you identify groups of similar topics and understand the hierarchical structure of the topics in your corpus.

## Anomaly Detection in textual data
Leveraging Latent Dirichlet Allocation (LDA) for anomaly detection involves using the topic distributions generated by LDA to identify documents or text segments that are unusual or deviate significantly from the norm. Here are some approaches to using LDA for anomaly detection:

- **Outlier Detection in Topic Distributions**: After training an LDA model on your corpus, you can examine the topic distribution for each document. Documents that have an unusual distribution of topics (e.g., a very high proportion of a single topic or a distribution that significantly differs from the average distribution) could be considered anomalies.

- **Topic Coherence Analysis**: Calculate the coherence of topics generated by LDA. Topics with very low coherence (i.e., topics that contain a mix of unrelated words) might indicate anomalies in the data, such as documents that are very different from the rest of the corpus.

- **Cluster Analysis of Topic Distributions**: Perform clustering (e.g., K-means or hierarchical clustering) on the topic distributions of documents. Documents that fall into very small clusters or that are far from the centroids of their clusters could be considered anomalies.

- **Temporal Analysis of Topic Trends**: If your corpus is time-stamped, you can track the prevalence of topics over time. Sudden changes in topic prevalence or the emergence of new, transient topics could indicate anomalous events or shifts in the data.

- **Comparison with a Reference Corpus**: If you have a reference corpus that represents "normal" data, you can train an LDA model on this corpus and then use it to analyze a target corpus. Documents in the target corpus with topic distributions that are significantly different from those in the reference corpus could be considered anomalies.

- **Thresholding on Topic Probabilities**: Set thresholds for the probabilities of topics in documents. Documents with topic probabilities that exceed these thresholds (either too high or too low) can be flagged as anomalies.

It's important to note that anomaly detection using LDA requires careful interpretation and validation, as the definition of an anomaly can be context-dependent. Additionally, combining LDA with other text analysis and machine learning techniques can improve the effectiveness of anomaly detection.

## Advanced examples

- Using [BERTopic](https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html) and OpenAI ChatGPT for coherent topic labeling and formation, and implemintation [example](https://medium.com/python-in-plain-english/topic-modeling-for-beginners-using-bertopic-and-python-aaf1b421afeb).

# Example 1: Topic Modeling with Gensim

Topic Modeling is a technique to extract the hidden topics from large volumes of text.

**Latent Dirichlet Allocation(LDA)** is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. 

The challenge, however, is how to extract **good quality of topics** that are **clear**, **segregated** and **meaningful**. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. 

## Introduction

One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed.

- This example: We will take a real example of the **20 Newsgroups’ dataset** and use LDA to extract the naturally discussed topics. I will be using the **Latent Dirichlet Allocation (LDA)** from **Gensim** package along with the **Mallet’s implementation** (via Gensim). Mallet has an efficient implementation of the LDA. It is known to run faster and gives better topics segregation. We will also extract the volume and percentage contribution of each topic to get an idea of how important a topic is.

### What does LDA do?

LDA's approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

**A topic** is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

- The quality of text processing.
- The variety of topics the text talks about.
- The choice of topic modeling algorithm.
- The number of topics fed to the algorithm.
- The algorithms tuning parameters.

## Prerequisites – Download and install necessary packages and files and import packages

We will need the `stopwords` from **NLTK** and `spacy`’s `en_core_web_sm` model (used to be simply `en` denoting a model pretained on English language) for text pre-processing. Later, we will be using the `spacy` model for lemmatization.

The core packages are `re`, `gensim`, `spacy` and `pyLDAvis`. Besides this we will also using `matplotlib`, `numpy` and `pandas` for data handling and visualization.

In [49]:
#!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[38;5;3m[!] As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use
the full pipeline package name 'en_core_web_sm' instead.[0m
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:
import re
import numpy as np
import pandas as pd
from pprint import pprint  # data pretty printer - provides a capability to “pretty-print” arbitrary Python data structures

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import nltk
nltk.download('stopwords')

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
#warnings.filterwarnings("ignore",category=DeprecationWarning)

import utsnlp

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vitali\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Get data

We will be using the 20-Newsgroups dataset for this exercise. This version of the dataset contains about 11k newsgroups posts from 20 different topics. This is available as [newsgroups.json](https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json).

This is imported using `pandas.read_json` and the resulting dataset has 3 columns as shown.

In [2]:
# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head(10)

['rec.autos' 'comp.sys.mac.hardware' 'comp.graphics' 'sci.space'
 'talk.politics.guns' 'sci.med' 'comp.sys.ibm.pc.hardware'
 'comp.os.ms-windows.misc' 'rec.motorcycles' 'talk.religion.misc'
 'misc.forsale' 'alt.atheism' 'sci.electronics' 'comp.windows.x'
 'rec.sport.hockey' 'rec.sport.baseball' 'soc.religion.christian'
 'talk.politics.mideast' 'talk.politics.misc' 'sci.crypt']


Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space
5,From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...,16,talk.politics.guns
6,From: bmdelane@quads.uchicago.edu (brian manni...,13,sci.med
7,From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...,3,comp.sys.ibm.pc.hardware
8,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...,2,comp.os.ms-windows.misc
9,From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje...,4,comp.sys.mac.hardware


## Pre-process text

- Prepare Stopwords
    - We have already downloaded the stopwords. Let’s import them and make it available in `stop_words`.
- Remove emails and newline characters
    - There are many emails, newline and extra spaces that is quite distracting. Let’s get rid of them using regular expressions.
- Tokenize
    - Break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

In [3]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [29]:
# Convert to list
data = df.content.values.tolist()
pprint(data[:1])

["From: lerxst@wam.umd.edu (where's my thing)\n"
 'Subject: WHAT car is this!?\n'
 'Nntp-Posting-Host: rac3.wam.umd.edu\n'
 'Organization: University of Maryland, College Park\n'
 'Lines: 15\n'
 '\n'
 ' I was wondering if anyone out there could enlighten me on this car I saw\n'
 'the other day. It was a 2-door sports car, looked to be from the late 60s/\n'
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition,\n'
 'the front bumper was separate from the rest of the body. This is \n'
 'all I know. If anyone can tellme a model name, engine specs, years\n'
 'of production, where this car is made, history, or whatever info you\n'
 'have on this funky looking car, please e-mail.\n'
 '\n'
 'Thanks,\n'
 '- IL\n'
 '   ---- brought to you by your neighborhood Lerxst ----\n'
 '\n'
 '\n'
 '\n'
 '\n']


In [30]:
# Remove Emails
data = [re.sub(r'\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub(r'\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub(r"\'", "", sent) for sent in data]

pprint(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know. If anyone can tellme a model name, engine specs, years of '
 'production, where this car is made, history, or whatever info you have on '
 'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
 'your neighborhood Lerxst ---- ']


Tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s `simple_preprocess()` is great for this. Additionally I have set `deacc=True` to remove the punctuations.

In [31]:
# Tokenize words and Clean-up text
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

In [35]:
utsnlp.print_colored_text(data)
utsnlp.print_colored_text(data_words)

Data Type: a list of strings
[91mFrom: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: 15 I was wondering if anyone out there could enlighten me on this car I saw the other day. It was a 2-door sports car, looked to be from the late 60s/ early 70s. It was called a Bricklin. The doors were really small. In addition, the front bumper was separate from the rest of the body. This is all I know. If anyone can tellme a model name, engine specs, years of production, where this car is made, history, or whatever info you have on this funky looking car, please e-mail. Thanks, - IL ---- brought to you by your neighborhood Lerxst ---- [0m
[92mFrom: (Guy Kuo) Subject: SI Clock Poll - Final Call Summary: Final call for SI clock reports Keywords: SI,acceleration,clock,upgrade Article-I.D.: shelley.1qvfo9INNc3s Organization: University of Washington Lines: 11 NNTP-Posting-Host: carson.u.washington.edu A fair

## Creating Bigram (or Trigram) Models

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Some examples in our example are: ‘front_bumper’, ‘oil_leak’, ‘maryland_college_park’ etc.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams.

In [42]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neighborhood', 'lerxst']


In [43]:
# Apply the models to the data
bigram_data = [bigram_mod[doc] for doc in data_words]
trigram_data = [trigram_mod[bigram_mod[doc]] for doc in data_words]

In [46]:
nDocs = 3
red_color = '\033[91m'  # ANSI escape code for red color
reset_color = '\033[0m'  # ANSI escape code to reset color

# Function to color words with underscores in red
def color_words_with_underscores(words):
    colored_words = []
    for word in words:
        if '_' in word:
            colored_words.append(f"{red_color}{word}{reset_color}")
        else:
            colored_words.append(word)
    return " ".join(colored_words)

# Print the original, bigram, and trigram data for comparison
print("Original:")
for doc in data_words[:nDocs]:
    print(color_words_with_underscores(doc))

print("\nBigram:")
for doc in bigram_data[:nDocs]:
    print(color_words_with_underscores(doc))

print("\nTrigram:")
for doc in trigram_data[:nDocs]:
    print(color_words_with_underscores(doc))


Original:
from wheres my thing subject what car is this nntp posting host rac wam umd edu organization university of maryland college park lines was wondering if anyone out there could enlighten me on this car saw the other day it was door sports car looked to be from the late early it was called bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is all know if anyone can tellme model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please mail thanks il brought to you by your neighborhood lerxst
from guy kuo subject si clock poll final call summary final call for si clock reports keywords si acceleration clock upgrade article shelley qvfo innc organization university of washington lines nntp posting host carson washington edu fair number of brave souls who upgraded their si clock oscillator have shared their experiences for this poll please send brief mess

The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

In [47]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [48]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['s', 'thing', 'car', 'nntp_poste', 'host', 'rac_wam', 'university', 'park', 'line', 'wonder', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'door', 'really', 'small', 'addition', 'separate', 'rest', 'body', 'know', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]


## Create the Dictionary and Corpus needed for Topic Modeling

The two main inputs to the LDA topic model are the dictionary(`id2word`) and the `corpus`. 

Let’s create them.

In [49]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 5), (5, 1), (6, 2), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1)]]


Gensim creates a unique `id` for each word in the document. The produced corpus shown above is a mapping of `(word_id, word_frequency)`.

For example, `(0, 1)` above implies, word `id 0` occurs once in the first document. Likewise, word `id 4` occurs 5 times and so on.

In [50]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('addition', 1),
  ('body', 1),
  ('bring', 1),
  ('call', 1),
  ('car', 5),
  ('day', 1),
  ('door', 2),
  ('early', 1),
  ('engine', 1),
  ('enlighten', 1),
  ('funky', 1),
  ('history', 1),
  ('host', 1),
  ('info', 1),
  ('know', 1),
  ('late', 1),
  ('lerxst', 1),
  ('line', 1),
  ('look', 2),
  ('mail', 1),
  ('make', 1),
  ('model', 1),
  ('name', 1),
  ('neighborhood', 1),
  ('nntp_poste', 1),
  ('park', 1),
  ('production', 1),
  ('rac_wam', 1),
  ('really', 1),
  ('rest', 1),
  ('s', 1),
  ('see', 1),
  ('separate', 1),
  ('small', 1),
  ('spec', 1),
  ('sport', 1),
  ('thank', 1),
  ('thing', 1),
  ('university', 1),
  ('wonder', 1),
  ('year', 1)]]

## Building the Topic Model
We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

Apart from that, `alpha` and `eta` are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to *1.0/num_topics* prior.

`chunksize` is the number of documents to be used in each training chunk. `update_every` determines how often the model parameters should be updated and `passes` is the total number of training passes.

In [51]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

## View the topics in LDA model
The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using `lda_model.print_topics()` as shown next.

In [55]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.021*"research" + 0.019*"information" + 0.019*"high" + 0.019*"report" + '
  '0.018*"player" + 0.016*"service" + 0.015*"rate" + 0.014*"design" + '
  '0.013*"season" + 0.012*"low"'),
 (1,
  '0.077*"team" + 0.072*"game" + 0.053*"play" + 0.050*"faith" + 0.049*"win" + '
  '0.031*"belief" + 0.025*"atheist" + 0.025*"year" + 0.018*"wing" + '
  '0.018*"score"'),
 (2,
  '0.106*"space" + 0.029*"notice" + 0.029*"launch" + 0.026*"earth" + '
  '0.024*"mission" + 0.024*"orbit" + 0.023*"external" + 0.020*"vehicle" + '
  '0.019*"satellite" + 0.019*"door"'),
 (3,
  '0.022*"say" + 0.019*"people" + 0.017*"reason" + 0.017*"believe" + '
  '0.015*"evidence" + 0.014*"mean" + 0.012*"point" + 0.012*"question" + '
  '0.011*"many" + 0.010*"claim"'),
 (4,
  '0.078*"book" + 0.044*"science" + 0.042*"reference" + 0.036*"pin" + '
  '0.032*"section" + 0.025*"faq" + 0.024*"author" + 0.023*"copy" + '
  '0.023*"reality" + 0.022*"internal"'),
 (5,
  '0.065*"cost" + 0.059*"model" + 0.039*"character" + 0.036*"pictur

### How to interpret this?

Topic 0, for example, is represented below. It means that the top keywords that contribute to this topic are as below and the weights reflect how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? 

In [61]:
topicID=0
pprint(lda_model.print_topics(num_words=10)[topicID])

(0,
 '0.021*"research" + 0.019*"information" + 0.019*"high" + 0.019*"report" + '
 '0.018*"player" + 0.016*"service" + 0.015*"rate" + 0.014*"design" + '
 '0.013*"season" + 0.012*"low"')


In [66]:
# Get the number of topics from the LDA model
num_topics = lda_model.num_topics
num_topics

20

In [70]:
def get_topic_labels(lda_model, num_topics, topn=2):
    topic_labels = []
    for i in range(num_topics):
        # Get the top 2 words for each topic
        top_words = lda_model.show_topic(i, topn=topn)
        # Combine the top words to form a label
        label = ' & '.join([word for word, _ in top_words])
        topic_labels.append(label)
    return topic_labels

# Get labels for all topics
labels = get_topic_labels(lda_model, num_topics=num_topics, topn=3)
labels


['research & information & high',
 'team & game & play',
 'space & notice & launch',
 'say & people & reason',
 'book & science & reference',
 'cost & model & character',
 'system & use & program',
 'moral & property & serial',
 'window & monitor & normal',
 'child & church & woman',
 'ax & physical & graphic',
 'line & organization & write',
 'plane & hi & subscription',
 'people & state & gun',
 'drug & film & movie',
 'box & club & modem',
 'drive & car & bike',
 'patient & disease & scientific',
 'get & go & good',
 'key & test & public']

In [75]:
# Print all topics with their labels
red_color = '\033[91m'  # ANSI escape code for red color
reset_color = '\033[0m'  # ANSI escape code to reset color

for i in range(num_topics):
    topic = lda_model.print_topics(num_words=10)[i][1]
    print(f"Topic {i} ({red_color}{labels[i]}{reset_color}):", topic)

Topic 0 ([91mresearch & information & high[0m): 0.021*"research" + 0.019*"information" + 0.019*"high" + 0.019*"report" + 0.018*"player" + 0.016*"service" + 0.015*"rate" + 0.014*"design" + 0.013*"season" + 0.012*"low"
Topic 1 ([91mteam & game & play[0m): 0.077*"team" + 0.072*"game" + 0.053*"play" + 0.050*"faith" + 0.049*"win" + 0.031*"belief" + 0.025*"atheist" + 0.025*"year" + 0.018*"wing" + 0.018*"score"
Topic 2 ([91mspace & notice & launch[0m): 0.106*"space" + 0.029*"notice" + 0.029*"launch" + 0.026*"earth" + 0.024*"mission" + 0.024*"orbit" + 0.023*"external" + 0.020*"vehicle" + 0.019*"satellite" + 0.019*"door"
Topic 3 ([91msay & people & reason[0m): 0.022*"say" + 0.019*"people" + 0.017*"reason" + 0.017*"believe" + 0.015*"evidence" + 0.014*"mean" + 0.012*"point" + 0.012*"question" + 0.011*"many" + 0.010*"claim"
Topic 4 ([91mbook & science & reference[0m): 0.078*"book" + 0.044*"science" + 0.042*"reference" + 0.036*"pin" + 0.032*"section" + 0.025*"faq" + 0.024*"author" + 0.023

## Compute Model Perplexity and Coherence Score

Model perplexity and [topic coherence](https://rare-technologies.com/what-is-topic-coherence/) provide a convenient measure to judge how good a given topic model is. In my experience, topic coherence score, in particular, has been more helpful.

In [76]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -13.32460930278773

Coherence Score:  0.483541481988623


## Visualize the topics-keywords

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
pyLDAvis.show(vis, local=False) # or you can simply run 'vis' for in-notebook view

In [None]:
# Save to an HTML file
pyLDAvis.save_html(vis, 'lda_topic_visualization.html')

### Interpret pyLDAvis’s output

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, **non-overlapping bubbles** scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.