# OpenAlex Topic Modeling

Author: Alex Davis

Date: 07/11/2024

The purpose of this script is to generate a high-quality topic model using the preprocessed corpus from the 'data_load' script.

In [6]:
#import packages
import gensim
import gensim.corpora as corpora
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel

import pyLDAvis
import pyLDAvis.gensim

import pickle
import re
import pandas as pd
import matplotlib.pyplot as plt

## Import Data

Here, we read the pickle file we wrote in the data_load notebook.

In [7]:
#open the file where we stored the pickled data
file = open('Data/preprocessed_data.pkl', 'rb')

#dump information to that file
data = pickle.load(file)

# close the file
file.close()

## Prepare Corpus

Here, we grab the data we need and transform it to prepare it for modeling. We create the corpus as well as the word ID mappings that will all feed into the model.

In [8]:
#convert the preprocessed text to a list
documents = list(data["clean_text"])

#seperate by ' ' to tokenize each article
texts = [x.split(' ') for x in documents]

In [9]:
#construct word ID mappings
id2word = Dictionary(texts)

#use word ID mappings to build corpus
corpus = [id2word.doc2bow(text) for text in texts]

## Sample Model

Here, we build a sample model with arbitrary parameters, compute its coherence score, and visualize the resulting topics using pyLDAvis.

In [10]:
#build LDA model
lda_model = LdaModel(corpus = corpus, id2word = id2word, num_topics = 10, decay = 0.5,
                     random_state = 0, chunksize = 100, alpha = 'auto', per_word_topics = True)

In [12]:
#compute coherence score
coherence_model_lda = CoherenceModel(model = lda_model, texts = texts, dictionary = id2word, coherence = 'c_v')
coherence_score = coherence_model_lda.get_coherence()
print(coherence_score)

0.482765271348718


Below, we use pyLDAvis to visualize the topics from the model above. To the left, you can see each topic represented in a parameter space. Ideally, we want topics that are well defined and that do not overlap with other topics. To the right, you can see the most salient terms. Click on a topic to see that topic's most salient terms highlighted in red. Adjust the relevance metric to the left to view tokens completeley unique to that topic, and slide it to the right to see less unique tokens.

In [11]:
#create Topic Distance Visualization 
pyLDAvis.enable_notebook()
lda_viz = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
lda_viz

In [12]:
#save as html file
pyLDAvis.save_html(lda_viz, 'Outputs/lda_draft.html')

## Model Creation and Evaluation

Here, we create an LDA model using gensim and adjust the parameters to find the best coherence score we can find. Topic coherence evaluates a single topic by measuring the degree of semantic similarity between high scoring words in the topic.

We loop through different values of num_topics and decay, compute the coherence score for each combination of parameters, and save the results in a dataframe.

In [18]:
def lda_model_evaluation():
    
    """
    This function loops through a number of parameters for an LDA model, creates the model,
    computes the coherenece score, and saves the results in a pandas dataframe. The outputed dataframe
    contains the values of the parameters tested and the resulting coherence score.
    """
    
    #define empty lists to save results
    topic_number, decay_rate_list, score  = [], [], []
    
    #loop through a number of parameters
    for topics in range(5,12):
        for decay_rate in [0.5, 0.6, 0.7]:
                
                #build LDA model
                lda_model = LdaModel(corpus = corpus, id2word = id2word, num_topics = topics, decay = decay_rate,
                               random_state = 0, chunksize = 100, alpha = 'auto', per_word_topics = True)
                
                #compute coherence score
                coherence_model_lda = CoherenceModel(model = lda_model, texts = texts, dictionary = id2word, coherence = 'c_v')
                coherence_score = coherence_model_lda.get_coherence()
                
                #append parameters to lists
                topic_number.append(topics)
                decay_rate_list.append(decay_rate)
                score.append(coherence_score)
                
                print("Model Saved")
    
    #gather result into a dataframe
    results = {"Number of Topics": topic_number,
                "Decay Rate": decay_rate_list,
                "Score": score}
    
    results = pd.DataFrame(results)
    
    return(results) 

In [19]:
#call the evaluation model and save the results
results = lda_model_evaluation()

Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved
Model Saved


In [22]:
results.sort_values(by = "Score", ascending = False)

Unnamed: 0,Number of Topics,Decay Rate,Score
17,10,0.7,0.551599
5,6,0.7,0.543566
8,7,0.7,0.532869
20,11,0.7,0.502876
9,8,0.5,0.497834
14,9,0.7,0.497399
16,10,0.6,0.48574
10,8,0.6,0.485579
15,10,0.5,0.482765
11,8,0.7,0.482605


## Visualize Final Topic Model Results

From our optimiztion function, the optimal model has 10 topics with a decay rate of 0.7.

In [None]:
#build LDA model
final_lda_model = LdaModel(corpus = corpus, id2word = id2word, num_topics = 10, decay = 0.7,
                     random_state = 0, chunksize = 100, alpha = 'auto', per_word_topics = True)

In [None]:
#compute coherence score
coherence_model_lda = CoherenceModel(model = final_lda_model, texts = texts, dictionary = id2word, coherence = 'c_v')
coherence_score = coherence_model_lda.get_coherence()
print(coherence_score)

In [None]:
#create Topic Distance Visualization 
pyLDAvis.enable_notebook()
lda_viz = pyLDAvis.gensim.prepare(final_lda_model, corpus, id2word)
lda_viz

In [None]:
#save as html file
pyLDAvis.save_html(lda_viz, 'Outputs/lda_final.html')