<img src="data/images/div/lecture-notebook-header.png" />

# Topic Modeling with Latent Dirichlet Allocation (LDA)

Topic modeling is a technique used in natural language processing (NLP) and machine learning to identify topics or themes within a collection of documents. It's an unsupervised learning method that aims to uncover hidden patterns or structures within a set of texts.

The primary goal of topic modeling is to automatically analyze and extract meaningful topics from a large corpus of documents without needing prior annotations or labels. One of the most popular algorithms for topic modeling is Latent Dirichlet Allocation (LDA). LDA assumes that each document is a mixture of various topics, and each word in the document is attributable to one of those topics.

Here's a simplified way it works:

* **Preprocessing:** Text data is cleaned, tokenized, and prepared for analysis by removing stop words, punctuation, and other irrelevant information.

* **Vectorization:** Documents are represented as numerical vectors, often using techniques like the bag-of-words model or TF-IDF (Term Frequency-Inverse Document Frequency).

* **Topic Modeling:** Algorithms like LDA are applied to these numerical representations to identify underlying topics based on the co-occurrence of words across documents. These topics are represented as a distribution of words.

* **Interpretation:** Once topics are identified, analysts or researchers interpret and label these topics based on the most frequent or representative words within each topic.

Topic modeling finds applications in various fields like information retrieval, content recommendation, sentiment analysis, and understanding trends in large text datasets, helping to organize, summarize, and make sense of extensive textual information.

## Setting up the Notebook

### Import all important packages

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

from tqdm import tqdm
from src.plotutil import show_wordcloud

import spacy
# Load English language model (if missing, check out: https://spacy.io/models/en)
nlp = spacy.load('en_core_web_md')  

--- 

## Working with Toy Data

### Definition of Toy Dataset

For this simple example, we define our corpus as a list of documents. Each document is only a single sentence to keep the example easy to follow. Naturally, a document may contain a large number of sentences. You will notice that this toy dataset includes two main topics: "pets, cats, dogs" and "programming, python". We will see how this observation will be reflected in the result later on.

In [None]:
documents = ["Cats and dogs are both domesticated animals.",
             "The domestication of dogs started 10,000 years ago.",
             "Dogs were easier to domensticate than cats.",
             "Some people have a dog and a cat (or several dogs and cat) as pets.",
             "The domestication of animals was an important part of human progress.",
             "Python is a programming laguage that is easy to learn",
             "Python makes text processing rather easy.",
             "A lot of programming languages support text analysis.",
             "Programming in Python makes the analysis of text easy",
             "NLTK is a great NLP package for Python."]

### Preprocessing

LDA assumes as input bags of words, not sequences of words. It also strongly benefits from normalization, as, for example, the capitalization of words, the tense of verbs, or the plurality of nouns arguably do not affect the topic of a document. Let's therefore use spaCy to form tokenization, case-folding and lemmatization. We also remove all stopwords and punctuation marks.


In [None]:
processed_documents = []

for doc in documents:
    doc = nlp(doc)
    processed_documents.append(' '.join([ t.lemma_.lower() for t in doc if t.is_stop == False and t.is_punct == False]))

# Print the processed documents
for doc in processed_documents:
    print (doc)

### Generate Term-Document Matrix (TDM)

In practice, we often limited the number of considered words (i.e., our vocabulary) to the most frequent words. Of course, for the very small toy dataset, this is not really needed.

In [None]:
num_words = 1000 # Top 1000 words

The `CountVectorizer` is, among other vectorizers, a handy and flexible way to generate a document term matrix. More specifically, here each value in the matrix represents the count of how a term $t_i$ occurs in document $d_j$. In contrast to TF-IDF values, LDA requires only the word counts without any additional weights

The [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) of `scikit-learn` class allows for a wide range of useful input parameters to configure the generation of the document term matrix; In this example, we use the following:

- `max_df`: If not `None` one can specify how often a word has to be in the corpus AT MOST, either in relative terms or in absolute terms. This allows us to ignore words that are very COMMON across all documents and that are not very discriminative.
- `min_df`: If not `None` one can specify how often a word has to be in the corpus AT LEAST, either in relative terms or in absolute terms. This allows us to ignore rare words that are very RARE across all documents and that are not very discriminative.
- `max_features`: If not `None` one can limit the number of words to ones with the highest counts (term frequencies) across the whole corpus
- `stop_words`: If not `None` one can specify the list of stop words to be removed from each document (not really necessary if stop words are removed during preprocessing)


In [None]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=num_words)

tf_tdm = tf_vectorizer.fit_transform(processed_documents)

When we look at the vocabulary, we notice that we are missing several non stopwords. This is because we have removed all words that appear only once across all sentences (e.g., *"year"*, *"processing"*) -- notice the parameter `min_df=2` in the code cell above. Again, this is very common to ignore very rare words, even if they would be in the list of `num_words` most frequent words.

In [None]:
vocabulary = tf_vectorizer.get_feature_names_out()

print(vocabulary)

### Visualize Term-Document Matrix

Just for illustrative purposes, let's print the term-document matrix. This is only meaningful for the toy datasets, but highlights the effects of the different preprocessing options even before performing LDA.


In [None]:
import pandas as pd
pd.DataFrame(tf_tdm.A.T, index=list(vocabulary), columns=['d{}'.format(c) for c in range(1, len(vocabulary)+1)])

### Perform LDA

First, we need to set the number of topics. In practice, this is not known a-priori. For our toy example, we know to expect 2 main topics. You can still change the value and then interpret and compare the different results.


In [None]:
num_topics = 2

lda = LatentDirichletAllocation(n_components=num_topics, max_iter=100, learning_method='online', learning_offset=50.,random_state=0).fit(tf_tdm)

The results of the model are not probabilities, i.e., the values do not sum up to 1. In most cases, this is not a problem since the absolute values but the relative differences are the important parts. In other words, most of the time these values do not matter at all. However, for illustrative purposes, we can normalize all values to proper probabilities.

In [None]:
lda.components_ /= lda.components_.sum(axis=1)[:, np.newaxis]

### Evaluating the results

#### Show distribution of words for topics

`display_topics()` is just a utility method to display the results. For each topic, it ranks all words with respect to their probabilities and list the top *N* words. Again, for our small toy dataset with the small vocabulary, we can easily print all the words.

In [None]:
def display_topics(model, feature_names, num_top_features):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic {}".format(topic_idx))
        for feature_idx in topic.argsort()[:-num_top_features-1:-1]:
            print ("\t{0:20} {1}".format(feature_names[feature_idx],topic[feature_idx]))

Let's apply this method to our LDA result:

In [None]:
display_topics(lda, vocabulary, num_words)

#### Show which document belongs to which topic

The method `transform()` takes as input a document-topic matrix X and returns topic distribution for X.

In [None]:
doc_topic = lda.transform(tf_tdm)

The method `display_documents()` shows the topic for each document. To this end, the method picks the topic with the highest probability. Recall that each document is a distribution over all topics.

In [None]:
def display_documents(document_topic_matrix, max_documents=10):
    num_documents = document_topic_matrix.shape[0]    # Get the number of documents
    for n in range(min(num_documents,max_documents)): # Never show more than #max_documents documents
        topic_distribution = document_topic_matrix[n] # List of probabilities, e.g., [0.032, 0.233, 0.001, ...]
        topic_most_pr = topic_distribution.argmax()   # Pick the list index with the highest probability
        print("doc: {}   topic: {}".format(n,topic_most_pr))

Now let's see the results for the toy example:

In [None]:
display_documents(doc_topic)

The topic assignment should be in line with our expectations.

#### Visualize distribution of words for topics using word clouds

Particularly for larger datasets and larger vocabularies, topics are best visualized using word clouds. Here, the size of words reflects their probabilities within a topic. The method `show_wordcloud()` in the file `src/plotutils.py` handles this for us. The code cell below goes through all identified topics and generates their respective word clouds.

In [None]:
for topic in range(num_topics):
    feature_distribution = lda.components_[topic]
    # Create dictionary of word frequencies as input for wordcloud package
    word_frequencies = { vocabulary[idx]:prob for idx, prob in enumerate(feature_distribution) }
    show_wordcloud(word_frequencies)

---

## Application use case: news article headlines

In this example, we apply LDA over a list of 12,394 news article headlines from TechCrunch (https://techcrunch.com/). This dataset is publicly available on Kaggle (https://www.kaggle.com/), see the full details [here](https://www.kaggle.com/PromptCloudHQ/titles-by-techcrunch-and-venturebeat-in-2017). For convenience, we already downloaded the dataset as CSV file.

### Load news article headlines from CSV file

As usual, we use `pandas` reading structured files like CSV files.

In [None]:
df = pd.read_csv('data/datasets/techcrunch/news-article-headlines-techcrunch.csv', encoding = "ISO-8859-1")

# Remove rows where Title is "NaN" to avoid any errors later on
df = df[pd.notnull(df['title'])]

# Extract list of headline from data frame
news_headlines = df['title'].tolist()

# Print the first 5 headlines
for idx in range(5):
    print (news_headlines[idx])

### Preprocess all Headlines

As usual, we first preprocess all news article headlines by tokenizing, lemmatizing, and case-folding all words, as well as remove all stopwords and punctuation marks.

In [None]:
processed_news_headlines = []

for doc in tqdm(news_headlines):
    doc = nlp(doc)
    processed_news_headlines.append(' '.join([ t.lemma_ for t in doc if t.is_stop == False and t.is_punct == False]))
    #break

# Print the first 5 processed documents
for doc in processed_news_headlines[:5]:
    print (doc)        

### Generate Term-Document Matrix

The dataset is now ready to compute the term-document matrix with all the term/word counts needed to run LDA. Since we are performing the same steps as for the toy dataset, we skip a more detailed discussion here.


In [None]:
num_words = 1000 # Top 1000 words

tf_vectorizer_news_headlines = CountVectorizer(max_df=0.95, min_df=5, max_features=num_words, stop_words='english')

tf_news_headlines = tf_vectorizer_news_headlines.fit_transform(processed_news_headlines)

vocabulary_news_headlines = tf_vectorizer_news_headlines.get_feature_names_out()

print("Size of vocabulary:", len(vocabulary_news_headlines))

### Perform LDA

Since these are 12k+ documents, setting the number of topics to 2 is usually not very meaningful. There are no straightforward rules on how to set this number. A common value to start with is 20, inspect the results, and potentially repeat this step with different values.

**Note:** This will now take several seconds or even minutes, but is still manageable. If you have (really) large data, it it is recommended to apply LDA first on a sample to see if all works (no errors) and if the results "look" meaningful.


In [None]:
%%time

num_topics = 20

lda_news_headlines = LatentDirichletAllocation(n_components=num_topics, max_iter=100, learning_method='online', learning_offset=50.,random_state=0).fit(tf_news_headlines)

### Visualize Results

Given the dataset size, we inspect the result by directly looking at the word clouds. Again, the code cell below performs the exact same required steps for that as we have already seen for the toy dataset. The only difference is that with the larger number of topics and the much larger vocabulary, we now can plot more word clouds, and each word cloud contains more words.

In [None]:
word_frequencies = {}

for topic in range(num_topics):
    feature_distribution = lda_news_headlines.components_[topic]
    # Create dictionary of word frequencies as input for wordcloud package
    word_frequencies = { vocabulary_news_headlines[idx]:prob for idx, prob in enumerate(feature_distribution) }
    show_wordcloud(word_frequencies, max_words=50)

---

## Summary

Topic modeling, a key technique in natural language processing, is employed to uncover latent themes or topics within a collection of documents without requiring prior labeling. Among the prominent algorithms, Latent Dirichlet Allocation (LDA) stands out for its ability to identify these hidden topics. LDA operates under the assumption that each document comprises a blend of various topics, and each word within a document is linked to one of these topics.

The process of topic modeling involves several stages. Initially, text data undergoes preprocessing, including tasks such as cleaning, tokenization, and removing irrelevant elements like stop words. Subsequently, documents are transformed into numerical vectors using methods like the bag-of-words model or TF-IDF. LDA is then applied to these vectors to detect underlying topics by examining word co-occurrences across documents. These identified topics are represented as distributions of words.

The utility of topic modeling extends across diverse domains such as information retrieval, content recommendation systems, sentiment analysis, and the extraction of trends from extensive text datasets. By organizing and summarizing textual information, topic modeling aids in understanding the underlying themes present within large volumes of text, facilitating easier navigation and interpretation of complex textual data.