<div style="text-align: center;"> <!-- This div will center all its contents -->
  <img src="https://scontent.fopo6-1.fna.fbcdn.net/v/t39.30808-6/327345211_708012977623591_5371889953719216000_n.png?_nc_cat=104&ccb=1-7&_nc_sid=5f2048&_nc_eui2=AeGA4Epi5DPgQWGmwJnzDzYwlTHqnE4dPp2VMeqcTh0-ndnVzTPGmZ1C7LYJvEsh0wc&_nc_ohc=oHf3AV_aUB0AX_auBWi&_nc_ht=scontent.fopo6-1.fna&oh=00_AfCTA0yaHCQugeMu_44t-6cLSKGa53d67a0DpQQ-fVTGYg&oe=654F295F" width="570" height="250" style="display: block; margin: auto;"/> <!-- This will center the image -->
  <div><strong style="color: #4F5B63;">Master in Data Science for Social Sciences</strong></div>
  <div><strong style="color: #4F5B63;">University of Aveiro</strong></div>
</div>


<div style="display: flex; justify-content: space-around; align-items: flex-start;">
  <div style="width: 100%; padding: 10px; box-shadow: 0 2px 4px rgba(0,0,0,0.1); margin: 10px;">
    <h2><h1 style="text-align: center; font-size: 4em; color: #46627F; margin-top: 0; margin-bottom: 0; line-height: 1;">Latent Dirichlet Allocation </h1>
<h1 style="text-align: center; color: #B1C0CF; margin-top: 0; margin-bottom: 0; line-height: 1;"> -Deduce the hidden topic from the document- </h1></h2>
      </div>
</div>


# **Latent Dirichlet Allocation (LDA)**

Latent Dirichlet Allocation (LDA) is a technique to explore a vast textual corpora, unveiling latent themes that thread through a collection of documents. This generative statistical model discerns the distribution of topics within documents, and the distribution of words within topics, by examining the occurrence patterns of words.

![image.png](attachment:image.png)

## **1.Import required libraries and modules**

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gensim
import gensim.corpora as corpora
import nltk
import contractions
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

from pprint import pprint

from tqdm import tqdm

from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
from gensim.models.ldamodel import LdaModel
from gensim.models.ldamulticore import LdaMulticore
from gensim.corpora import Dictionary

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import pos_tag

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\beatr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **2. Load and preprocess the dataset**
For the application of LDA, a collection of documents is required. In this instance, we will utilize the example dataset made available through the Gensim library.

In [2]:
# Load the CSV file into a DataFrame
df = pd.read_csv('scopus.csv')

In [3]:
document=df['Abstract']
document

0      The spatial distribution of violent crime is i...
1      Background: Population aging refers to the inc...
2      City area traffic demand analysis is an import...
3      Wuchereria bancrofti (Wb) is the most widely d...
4      Background:Migration has long been understood ...
                             ...                        
349    One of the central problems in migration measu...
350    This paper describes the problems arising from...
351    After George N. Tziafetas (EDIB, Vol. XII, 198...
352    An analysis of the initial results of the 1979...
353                              [No abstract available]
Name: Abstract, Length: 354, dtype: object

Preprocessing steps include tokenization, removal of stopwords, and lemmatization.

In [4]:
def preprocess_data(documents):
    # Define the list of stopwords in English
    stop_words = stopwords.words('english')

    # Tokenize the documents and remove stopwords
    texts = [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in documents]

    # Return the list of tokenized and preprocessed texts
    return texts

In [5]:
# Process the documents using the previously defined preprocessing function
processed_texts = preprocess_data(document)
processed_texts

[['spatial',
  'distribution',
  'violent',
  'crime',
  'influenced',
  'small',
  'area',
  'characteristics',
  'social',
  'disorganization',
  'theory',
  'proposes',
  'neighbourhood',
  'scale',
  'characteristics',
  'including',
  'ethnic',
  'composition',
  'immigrant',
  'residents',
  'indirectly',
  'influence',
  'crime',
  'social',
  'control',
  'recent',
  'spatial',
  'demographic',
  'changes',
  'urban',
  'areas',
  'including',
  'increased',
  'immigration',
  'ethnic',
  'heterogeneity',
  'city',
  'peripheries',
  'motivated',
  'social',
  'disorganization',
  'using',
  'exploratory',
  'spatial',
  'data',
  'analysis',
  'spatial',
  'regression',
  'methods',
  'research',
  'identifies',
  'violent',
  'crime',
  'hotspots',
  'analyzes',
  'influence',
  'ethnic',
  'composition',
  'immigrant',
  'resident',
  'concentration',
  'violent',
  'crime',
  'toronto',
  'ontario',
  'census',
  'tract',
  'scale',
  'results',
  'suggest',
  'violent',
  

## **3. Create a dictionary and a corpus**

The dictionary serves as a link between words and their respective integer identifiers, and the corpus consists of a collection of documents, each depicted in the form of a bag-of-words (BoW).

The `id2word` variable is assigned a `Dictionary` object from the `corpora` module. This object will map unique identifiers to words present in the `processed_texts`, which contains the preprocessed documents. Essentially, this dictionary facilitates the conversion of text documents into a bag-of-words format, which is required for subsequent modeling with LDA.


In [6]:
# Create Dictionary
id2word = corpora.Dictionary(processed_texts)

In [7]:
# Create Corpus
texts = processed_texts
texts

[['spatial',
  'distribution',
  'violent',
  'crime',
  'influenced',
  'small',
  'area',
  'characteristics',
  'social',
  'disorganization',
  'theory',
  'proposes',
  'neighbourhood',
  'scale',
  'characteristics',
  'including',
  'ethnic',
  'composition',
  'immigrant',
  'residents',
  'indirectly',
  'influence',
  'crime',
  'social',
  'control',
  'recent',
  'spatial',
  'demographic',
  'changes',
  'urban',
  'areas',
  'including',
  'increased',
  'immigration',
  'ethnic',
  'heterogeneity',
  'city',
  'peripheries',
  'motivated',
  'social',
  'disorganization',
  'using',
  'exploratory',
  'spatial',
  'data',
  'analysis',
  'spatial',
  'regression',
  'methods',
  'research',
  'identifies',
  'violent',
  'crime',
  'hotspots',
  'analyzes',
  'influence',
  'ethnic',
  'composition',
  'immigrant',
  'resident',
  'concentration',
  'violent',
  'crime',
  'toronto',
  'ontario',
  'census',
  'tract',
  'scale',
  'results',
  'suggest',
  'violent',
  

In [8]:
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

## **4. Train the LDA model**

Choose the number of topics you want to generate and train the LDA model.

In [9]:
# Set number of topics
num_topics = 8

In [10]:
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics, random_state=42, passes=10, alpha='auto', per_word_topics=True)

## **5. Display topics and their keywords**

In [11]:
# Print the keywords for each topic
pprint(lda_model.print_topics())

[(0,
  '0.033*"population" + 0.013*"data" + 0.009*"migration" + 0.006*"model" + '
  '0.005*"populations" + 0.005*"analysis" + 0.005*"development" + '
  '0.005*"growth" + 0.004*"china" + 0.004*"based"'),
 (1,
  '0.010*"population" + 0.008*"data" + 0.007*"migration" + 0.007*"age" + '
  '0.006*"used" + 0.005*"distribution" + 0.005*"areas" + 0.005*"water" + '
  '0.005*"also" + 0.005*"number"'),
 (2,
  '0.023*"data" + 0.018*"migration" + 0.005*"time" + 0.004*"population" + '
  '0.004*"information" + 0.004*"research" + 0.004*"model" + 0.004*"year" + '
  '0.004*"method" + 0.003*"study"'),
 (3,
  '0.023*"urban" + 0.017*"migration" + 0.013*"population" + 0.012*"rural" + '
  '0.010*"data" + 0.009*"migrants" + 0.008*"cities" + 0.006*"level" + '
  '0.005*"economic" + 0.005*"income"'),
 (4,
  '0.018*"population" + 0.011*"age" + 0.011*"migration" + 0.011*"data" + '
  '0.009*"urban" + 0.008*"fertility" + 0.008*"years" + 0.008*"migrants" + '
  '0.006*"children" + 0.006*"higher"'),
 (5,
  '0.011*"data"

## **6. Evaluate the model using the coherence score**

Coherence measures the degree of semantic similarity between high scoring words in the topic. There are various ways to measure coherence, and one of the most popular is the 'c_v' measure. Higher coherence scores indicate a better model.

In [12]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_texts, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score:', coherence_lda)

Coherence Score: 0.329063658224148


A coherence score of 0.47 typically indicates a moderate level of coherence for the topics generated by the LDA model. Coherence scores range from 0 to 1, where higher scores correspond to more coherent topics that are semantically meaningful, with words in the same topic being more related to each other.

In practical terms, a score of 0.47 suggests that the topics are reasonably good at capturing meaningful patterns in the data, but there might still be room for improvement. For example, some topics might be well defined and others less so, or some words within topics might not fit as well as others.

In topic modeling, coherence scores can be somewhat subjective and should be interpreted in the context of the application. It's often useful to compare coherence scores across different models or different numbers of topics to find the best model for your specific dataset and use case.

## **7. Visualize the topics**

To visualize the topics, you can use the pyLDAvis library (install using pip install pyLDAvis). This library provides an interactive visualization of the topics and their keywords.

In [14]:
# Assuming 'processed_texts' is your list of tokenized and preprocessed documents
dictionary = Dictionary(processed_texts)


pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, dictionary)
vis

## **8. Conclusion**

To sum up, this tutorial has walked you through the process of developing a Latent Dirichlet Allocation (LDA) model with Python's Gensim library. The steps outlined in this guide will enable to unearth the underlying thematic framework in a set of documents, as the model groups words into topics. We've gone over everything from setting up your environment, importing necessary libraries, to preprocessing data, forming a dictionary and corpus, to training and interpreting the LDA model, assessing its coherence, and visualizing the results with pyLDAvis. Armed with these skills, you can harness LDA for a variety of text analysis applications, deriving meaningful insights from extensive textual data.

References:
https://bennett-holiday.medium.com/a-step-by-step-guide-to-writing-an-lda-program-in-python-690aa99119ea