<img src='images/header.png' style='height: 50px; float: left'>

## Introduction to Computational Social Science methods with Python

# Session D4. Topic modeling


**Topic modeling** is a technique in natural language processing (NLP) used to uncover underlying themes in a large corpus of text. The goal of topic modeling is to identify the most important topics in a collection of documents and to extract the most important words and phrases associated with each topic. There are several popular topic modeling algorithms, including **Latent Dirichlet Allocation (LDA)** and Non-Negative Matrix Factorization (NMF). These algorithms are **unsupervised**, meaning that they do not require labeled data. On the contrary, they are based on the assumption that words that frequently occur together in a document are likely to be associated with the same topic.

Topic modeling typically involves the following steps:

- Preprocessing the data: The text data is cleaned and preprocessed to remove unnecessary information such as stop words, punctuation, and special characters.
- Vectorization: the preprocessed text data is converted into a numerical vector representation, such as a bag-of-words or TF-IDF (Term Frequency-Inverse Document Frequency) matrix.
- Model training: a topic modeling algorithm is applied to the vectorized text data to identify the main topics or themes. This includes identifying the words or phrases most strongly associated with each topic.
- Theme Interpretation: the resulting themes are interpreted by examining the most representative words or phrases for each theme. This may involve manually examining the most important words for each theme or using automated techniques to summarize the themes.

First, let's compare LDA with Latent Semantic Analysis (LSA), which is based on Truncated SVD (you have learned in the session previous session). LSA operates by performing a matrix factorization on a term-document matrix to extract latent semantic features. However, LSA does not explicitly model the generative process of how documents are created or the distribution of topics within documents.



## D4.1 Latent Semantic Analysis (LSA)

**Latent topics** refer to a hidden thematic concept or underlying theme within a collection of documents. It represents a high-level abstract idea or subject that is not explicitly stated but can be inferred from the patterns of word usage within the documents. Latent topics provide a way to group related documents together based on shared content.

Dimensionality reduction techniques, such as Latent Semantic Analysis (LSA), can help discover topics by reducing the dimensionality of the document-term matrix. LSA utilizes matrix factorization methods, such as Truncated SVD (Singular Value Decomposition), to transform the original high-dimensional space of the document-term matrix into a lower-dimensional space. This reduction allows for the extraction of latent semantic features and the identification of hidden patterns that correspond to different topics. By capturing the most informative components, LSA can reveal the underlying topics in a more compact representation.

In LSA, documents are represented in a document-term matrix, where each row corresponds to a document and each column represents a unique term in the entire corpus. The matrix contains term frequencies or other term weighting schemes, such as TF-IDF, as the values in the cells. This matrix captures the frequency or importance of each term in each document, providing a numerical representation of the document's content.


<img src='images/Topic_modeling_.svg' >


LSA involves a series of four steps, with the second and third being particularly critical and challenging to grasp. The steps are outlined below:

1. Acquire raw Text Data
2. Create a document-term matrix
3. Apply Singular Value Decomposition (SVD)
4. Explore the data encoded with topics.

The first two steps we have learned in the Session C3. Further we will continue with the data produced in that session. 

Next, we proceed to carry out the singular value decomposition, which can be achieved by utilizing the truncated SVD model from Scikit-Learn or by utilizing LsiModel from gensim. 

**Singular value decomposition (SVD)**, a mathematical technique, is employed to condense a large text into a matrix where each row represents unique words and each column represents a document's word counts. This process reduces the number of rows while preserving the general structure of the columns.

In mathematics, SVD is a method that factorize any matrix *C* into the product of 3 separate matrices:
$
\begin{align}
C = U S V^T
\end{align}
$

<img src='images/SVD.svg' >

SVD plays a crucial role in reducing dimensionality by selecting the k largest singular values and retaining only the first k columns of the U and V matrices. Here, k represents a hyperparameter that can be adjusted to determine the desired number of topics to extract.

<img src='images/SVD_trunkated.svg' >


Intuitively, this can be understood as keeping only the k most significant dimensions in the transformed space.

In this context, the matrix *U (m x k)* serves as the term-topic matrix, while the matrix *V (n x k)* represents the document-topic matrix. In both matrices, each column corresponds to one of the k topics.

$U$: Rows in this matrix represent term vectors expressed in terms of topics.

$V$: Rows in this matrix represent document vectors expressed in terms of topics.

Each row in the $V_k$ matrix (document-topic matrix) represents the vector representation of a specific document. The length of these vectors is equal to k, which corresponds to the desired number of topics. Similarly, the vector representation for terms in the data can be found in the $U_k$ matrix (term-topic matrix).

By utilizing SVD, we obtain vector representations for every document and term in our data, with each vector having a length of k. One significant application of these vectors is finding similar words and documents using cosine similarity as a metric. To compare documents, the cosine of the angle formed by any two vectors created from the columns is utilized, along with the dot product of the normalized vectors. A value close to 1 indicates high similarity between documents, while a value close to 0 suggests significant dissimilarity.

Similar to principal component analysis (PCA), the singular value decomposition allows for encoding the original dataset with these latent features through latent semantic analysis, resulting in a reduction in dimensionality. These latent features correspond to the underlying topics present in the original text data.


Here we present an example of LSA applied to a set of documents. We will use the news articles that we preprocessed in Session C3. 
Let's import the TF-IDF matrix and other data that we previously extracted from this corpus.



In [1]:
import pandas as pd
import pickle as pkl

# import raw dataset 
news = pd.read_csv("../data/news/news_subset.csv")

# import also the dictionary, the preprocessed corpus,  and TF-IDF matrices
with open("../data/news/tf_idf_gensim.pkl", "rb") as file:
    corpus_tfidf = pkl.load(file)

with open("../data/news/dict_gensim.pkl", "rb") as file:
    dct = pkl.load(file)

with open("../data/news/corpus.pkl", "rb") as file:
    corpus = pkl.load(file)
    
with open("../data/news/document_term_matrix.pkl", "rb") as file:
    corpus_bow = pkl.load(file)

Now we will import LSI model from gensim library. *Note: Latent Semantic Indexing (LSI) is a synonym of Latent Semantic Analysis (LSA).* 

We will use tf_idf weights as our corpus. We will also specify the number of topics (*num_topics*) to be extracted.

In [25]:
from gensim.models.lsimodel import LsiModel

lsi_model = LsiModel(corpus=corpus_tfidf, id2word=dct, num_topics=10,chunksize=100 )

Let's inspect which words contribute to which topic:

In [26]:
lsi_model.print_topics()#num_words (int) â€“ The number of words to be included per topics(ordered by significance)

[(0,
  '0.157*"trump" + 0.157*"time" + 0.147*"day" + 0.138*"new" + 0.137*"like" + 0.135*"peopl" + 0.133*"want" + 0.127*"way" + 0.124*"year" + 0.124*"thing"'),
 (1,
  '0.560*"trump" + 0.289*"donald_trump" + 0.276*"presid" + 0.159*"republican" + 0.143*"democrat" + 0.120*"said" + -0.115*"kid" + -0.111*"love" + -0.110*"parent" + -0.108*"photo"'),
 (2,
  '0.477*"photo" + 0.302*"wed" + -0.266*"parent" + -0.229*"kid" + 0.184*"look" + -0.175*"children" + 0.126*"dress" + -0.122*"need" + -0.114*"peopl" + -0.112*"life"'),
 (3,
  '0.344*"kid" + -0.332*"women" + 0.328*"parent" + 0.266*"trump" + -0.206*"year" + 0.170*"day" + 0.160*"mom" + 0.159*"children" + 0.154*"photo" + -0.150*"peopl"'),
 (4,
  '-0.300*"parent" + -0.267*"women" + 0.260*"know" + 0.253*"want" + -0.231*"new" + -0.223*"children" + 0.211*"dont" + -0.197*"year" + -0.180*"divorc" + 0.175*"thing"'),
 (5,
  '0.642*"day" + 0.204*"year" + -0.201*"women" + -0.190*"want" + -0.170*"love" + -0.158*"like" + -0.147*"photo" + 0.140*"mother" + -0.1

According to LSA "trump", "time", "day" and "new" are related words and they contribute to the first topic (*topic 0*). The second topic concerns "donald_trump" too, as well as "republican" and "democrat". 

If we would want to see which documents are related to which topic, we would need to vectorize input corpus in bag-of-word format:  

In [27]:
doc_lsi = lsi_model[corpus_tfidf]

Then we can print probabilities of documents to belong to certain topic, as well as list of stemms created from the text. In the next example we will print probabilities for documents 27 and 28 in the corpus:

In [45]:
for doc, as_text in zip(doc_lsi[27:29], corpus[27:29]): #text_df
    print(doc, as_text)

[(0, 0.08913494033177832), (1, 0.18106540018757797), (2, 0.008162918400419968), (3, 0.04806489019320761), (4, -0.03402112025261824), (5, 0.048260170739779895), (6, 0.03567352261405183), (7, 0.10136627686159357), (8, 0.01568601952270656), (9, 0.0023899220099628236)] ['robert_muel', 'found', 'witch', 'trump', 'claim', 'hunt', 'special_counsel', 'spell', 'link', 'trump', 'campaign', 'russian', 'militari', 'intellig', 'new', 'court', 'file']
[(0, 0.07811084319524991), (1, -0.04712993749360177), (2, -0.07222390342619596), (3, 0.09505903055865207), (4, -0.039594703426103016), (5, -0.02955136109266649), (6, -0.029525610238191134), (7, -0.014217144149449416), (8, -0.0018883849185777606), (9, 0.012510356011972734)] ['littl', 'hack', 'parent', 'easier', 'bath', 'child', 'laundri', 'basket', 'toy', 'dont', 'float', 'away', 'read', 'buzzfe']


Here, the first list shows probability of the document to be in one out of ten topics. 
We can see that the first document is more strongly related to the second topic (*topic 1*) while the second to the *topic 3*. 


Also queries can be incorporated into LSA representations by treating them as additional documents. Similar to the document-term matrix, the query is represented as a vector in the same high-dimensional space. By calculating the similarity between the query vector and the document vectors using cosine similarity or other similarity measures, LSA can rank the documents based on their relevance to the query. This allows for information retrieval and document ranking based on the discovered latent topics.


LSA can be highly beneficial, but it does have its limitations. It is crucial to understand both the advantages and disadvantages of LSA in order to determine when to utilize it and when to explore alternative approaches.

Advantages of LSA:

1. It is efficient and straightforward to implement.
2. It yields considerably better results compared to the plain vector space model.
3. LSA is faster than many other available topic modeling algorithms since it involves decomposing the document term matrix only.

Disadvantages of LSA:

1. Being a linear model, LSA may not perform well on datasets with non-linear dependencies.
2. LSA assumes a Gaussian distribution of terms in documents, which may not hold true for all problems.
3. The computational intensity and difficulty of updating SVD make it challenging to incorporate new data.
4. LSA lacks interpretable embeddings, as the nature of the topics and the components' positivity or negativity can be arbitrary.
5. Accurate results in LSA require a large set of documents and a wide vocabulary.
6. LSA provides less efficient representation in terms of capturing the intricacies of the underlying data.

## D4.2 Latent Dirichlet Allocation (LDA)

On the other hand, LDA is a Bayesian generative model that assumes each document is a mixture of topics, and each topic is a distribution over words. It provides a way to infer the latent topic structure by analyzing the observed words in the documents. LDA assumes the following generative process:
    
- For each document, a distribution of topics is selected from a Dirichlet distribution.
- For each word in the document:
    - Select a topic from the distribution of topics of the document.
    - Select a word from the distribution of words for the selected topic.

LDA is a generative probabilistic model which is generally used for topic modeling. It holds a special role in the ML canon due to its ability to uncover hidden topics in a collection of documents and provide a probabilistic framework for understanding the distribution of topics within each document.


The goal of LDA is to infer the underlying topic distribution for each document and the word distribution for each topic based on the observed words in the corpus. It uses Bayesian inference techniques to estimate these distributions.

Plate notation is often used to represent the graphical model of LDA. In plate notation, circles represent random variables, and rectangles represent plate repetitions. The graphical representation of LDA includes three plates: one for documents, one for topics, and one for words. Arrows denote dependencies between variables.

|<img src="https://upload.wikimedia.org/wikipedia/commons/d/d3/Latent_Dirichlet_allocation.svg" alt="LDA model from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation" />|
|:--|
|<em style='float: center'>**Figure 1**: LDA model from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation </em>|



In the LDA model, Alpha represents the parameter controlling the distribution of topics in documents, Theta represents the topic distribution for each document, Beta represents the word distribution for each topic.

To run LDA, you need to specify the number of topics (K) beforehand. The model then estimates the topic distribution (Theta) and word distribution (Beta) using techniques like variational inference or Gibbs sampling. These distributions provide insights into the topics present in the documents and the words associated with each topic.

By comparing the results of LDA with those of LSA, you can observe the differences in the discovered topics. LDA typically produces more interpretable and coherent topics as it explicitly models the generative process of documents and leverages Bayesian inference techniques to estimate topic-word distributions.

Here we present an example of LDA applied to a set of documents. We will use the news articles that we preprocessed in Session 6. We run LDA on the tfidf matrix. We use the Gensim implementation of LDA:

In [26]:
from gensim.models import LdaModel

# train an LDA model on the TF-IDF corpus
num_topics = 10
lda_model = LdaModel(corpus_tfidf,
                     id2word=dct, 
                     num_topics=10, 
                     passes = 10)

We can visualize the extracted topics by looking at the 10 most common words of each topic:

In [27]:
# print the topics and associated keywords
for topic in lda_model.print_topics():
    print(topic)

(0, '0.006*"check_huffpost" + 0.006*"want_sur" + 0.005*"style_twitt" + 0.005*"facebook_tumblr" + 0.004*"awkward" + 0.004*"destin" + 0.004*"pinterest_instagram" + 0.004*"huffpoststyl" + 0.004*"west" + 0.004*"adopt"')
(1, '0.005*"email" + 0.005*"serv" + 0.004*"speech" + 0.004*"cocktail" + 0.004*"obamacar" + 0.004*"freedom" + 0.004*"motiv" + 0.004*"adventur" + 0.003*"convent" + 0.003*"independ"')
(2, '0.006*"valentines_day" + 0.005*"pregnanc" + 0.005*"william" + 0.005*"resolut" + 0.005*"pack" + 0.004*"breakfast" + 0.004*"ted_cruz" + 0.004*"birth" + 0.004*"toy" + 0.004*"wine"')
(3, '0.005*"technolog" + 0.004*"sen" + 0.004*"bed" + 0.004*"arrest" + 0.004*"pressur" + 0.003*"data" + 0.003*"immedi" + 0.003*"relax" + 0.003*"demonstr" + 0.003*"trial"')
(4, '0.005*"time" + 0.005*"photo" + 0.005*"new" + 0.005*"like" + 0.005*"day" + 0.004*"peopl" + 0.004*"way" + 0.004*"year" + 0.004*"love" + 0.004*"thing"')
(5, '0.004*"dog" + 0.003*"fit" + 0.003*"human" + 0.003*"walk" + 0.003*"spring" + 0.003*"simpl

## D4.3 Select Number of Topics

To assess how good are the topics extracted by the model, we can use different metrics such as the coherence score and perplexity.

**Coherence score** is a measure of how coherent the topics generated by a topic model are, based on the co-occurrence of words in the corpus. It is often used as an evaluation metric for topic models, in addition to perplexity.
The coherence score is typically based on the top N words in each topic, and measures the similarity between pairs of words in the same topic. There are different ways to define the coherence score, but one common approach is to use the Pointwise Mutual Information (PMI) between pairs of words.

PMI measures the degree of association between two words, based on the probability of their co-occurrence in the corpus relative to their individual probabilities. A high PMI score indicates that the two words are strongly associated and likely to appear together in the same context.

The coherence score is calculated as the average PMI score over all pairs of words in each topic, and then averaged over all topics in the model. A higher coherence score indicates that the topics are more coherent and contain more related and meaningful words.
Coherence score is often used in combination with perplexity to evaluate the quality of a topic model. Perplexity measures how well the model fits the data, while coherence measures how well the topics generated by the model make sense in terms of the co-occurrence of words in the corpus.

We can also use coherence score to calibrate the number of topics to use:



In [19]:
from gensim.models import CoherenceModel
 

scores = []
for num_topics in np.arange(5, 20):

    # fit LDA model
    lda_model_tmp = LdaModel(corpus_tfidf, id2word=dct, num_topics=num_topics)

    # compute Coherence Score
    coherence_model_lda = CoherenceModel(model=lda_model_tmp, texts=corpus, dictionary=dct)
    coherence_lda = coherence_model_lda.get_coherence()
    print('\nCoherence Score with {0} topics: {1}'.format(num_topics, coherence_lda))

    scores.append(coherence_lda)


Coherence Score with 5 topics: 0.31200946953042596

Coherence Score with 6 topics: 0.3946906947456356

Coherence Score with 7 topics: 0.34919259077628045

Coherence Score with 8 topics: 0.32511259244445945

Coherence Score with 9 topics: 0.3902277965334005

Coherence Score with 10 topics: 0.3419287668714649

Coherence Score with 11 topics: 0.3552615853001322

Coherence Score with 12 topics: 0.31912493034094486

Coherence Score with 13 topics: 0.3706833991981929

Coherence Score with 14 topics: 0.3603810504156701

Coherence Score with 15 topics: 0.3711640537264907

Coherence Score with 16 topics: 0.36335976924667424

Coherence Score with 17 topics: 0.3343568795271591

Coherence Score with 18 topics: 0.3580533109347793

Coherence Score with 19 topics: 0.3641280639195868


To visualize the results, you can create graphs such as topic-word distributions, word clouds for each topic, or bar charts representing the topic proportions in different documents.

We can explore visually the topics extracted using the python package `pyLDAvis`. 

The Python library `pyLDAvis` provides a powerful tool for visualizing and interpreting topic models generated using Latent Dirichlet Allocation. With `pyLDAvis`, we can create interactive visualizations that display the topics, the most relevant terms associated with each topic, and the relationships between topics. These visualizations enable us to gain insights into the underlying structure of a corpus of text and to identify patterns and trends that may be difficult to discern through manual analysis alone. Additionally, `pyLDAvis` allows users to customize the visualization parameters and to explore different topic modeling configurations, making it a valuable tool for researchers and data scientists working with large volumes of text data. 

More in detail, we can visualize topics using the intertopic distance map. The intertopic distance map is a visualization tool in `pyLDAvis` that provides a two-dimensional representation of the distances between topics in a given topic model. The map uses a multidimensional scaling (MDS) algorithm to project the high-dimensional topic space onto a two-dimensional plane, where each point represents a topic and the distance between points reflects the similarity or dissimilarity between topics. The intertopic distance map can help us to identify clusters of related topics and to explore the relationships between different topics in the model. Additionally, the map can be used to highlight potential gaps or overlaps in the topic space, which can be useful for refining the topic model or identifying areas for further investigation. For each topic, we can also visualize the most salient terms.

Let's see it in practice:

In [20]:
import pyLDAvis
import pyLDAvis.gensim

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus_bow, dct)
vis

ModuleNotFoundError: No module named 'pyLDAvis'

In [18]:
import numpy as np

news = pd.read_csv("../data/news/news_subset.csv")
text_df= []
for index, row in news.iterrows():
    text_df.append(row.headline + ". " + row.short_description)
text_df = np.array(text_df)


## Commented references


<a id='McLevey_D_2022'></a>
McLevey, J. (2022). *Doing Computational Social Science: A Practical Introduction*. SAGE. https://us.sagepub.com/en-us/nam/doing-computational-social-science/book266031. *A rather complete introduction to the field with well-structured and insightful chapters also on unsupervised Machine Learning. The [website](https://github.com/UWNETLAB/dcss_supplementary) offers the code used in the book.*





<div class='alert alert-block alert-success'>
<b>Document information</b>

Contact and main author: Olga Zagovora, Nicolo Gozzi

Version date: 03. June 2023

License: ...
</div>