# OpenAlex Topic Modeling - BERT

Author: Alex Davis

Date: 06/29/2025

The purpose of this script is to generate a high-quality topic model using the preprocessed corpus from the 'data_load' script. Instead of a traditional method in modeling.py, this notebook uses encoder-only model BERT Topic.

In [0]:
%pip install sentence_transformers
%pip install umap-learn
%pip install hdbscan
%pip install openai
%pip install plotly==5.19.0

In [0]:
%restart_python

## Import Packages & Global Variables

Import packages for data retrieval, data manipulation, visualization, and BERTopic pipeline. Also, choose to used saved embeddings or rerun embedding model.

In [0]:
#import packages for data management
import pickle

#import packages for topic modeling
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from umap.umap_ import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer

#import packages for data manipulation and visualization
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy as sch

In [0]:
#set to True to used saved embeddings from past run
#set to False to rerun embeddings
use_saved_embeddings = True

## Import Data and Preprocess

Import data from data_load.py and take random sample to speed up processing.

In [0]:
#open the file where we stored the pickled data
file = open('Data/preprocessed_data.pkl', 'rb')

#dump information to that file
data = pickle.load(file)

# close the file
file.close()

In [0]:
#take a sample to speed up process
data = data.sample(frac = 0.2, replace = False, random_state = 1)

## Embedding Model

Use small embedding model to reduce vector size and speed up processing. Either run the embedding model or load previously used embeddings.

In [0]:
#initalize embedding model
embedding_model = SentenceTransformer('thenlper/gte-small')

#use saved embeddings or rerun embeddng model
if use_saved_embeddings == True:
    #open the file where we stored the pickled data
    file = open('Data/sampled_embeddings.pkl', 'rb')
    #dump information to that file
    embeddings = pickle.load(file)
    # close the file
    file.close()

else:
    embeddings = embedding_model.encode(data['all_text'].tolist(), show_progress_bar=True)

In [0]:
#invesigate shape and size of vectors
embeddings.shape

In [0]:
#save embeddings if reran
if use_saved_embeddings == False:
    #save file as .pkl file
    with open('Data/sampled_embeddings.pkl', 'wb') as file: 
        
        # A new file will be created 
        pickle.dump(embeddings, file) 

## Dimensionality Reduction Model

To reduce high-dimension data, use the UMAP model for dimensionality reduction.

In [0]:
#initialize dimensionality reduction model and reduce embeddings
umap_model = UMAP(n_neighbors=5, min_dist=0.0, metric='cosine', random_state=42)
reduced_embeddings = umap_model.fit_transform(embeddings)

In [0]:
#investigate shape and size of new embeddings
reduced_embeddings.shape

## Clustering Model

Use density-based clustering model to create groups and isolate outliers.

In [0]:
#initialize clustering model and cluster
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom').fit(reduced_embeddings)
clusters = hdbscan_model.labels_

In [0]:
#investigate number of clusters
len(set(clusters))

In [0]:
#create dataframe of reduced embeddings and clusters
df = pd.DataFrame(reduced_embeddings, columns = ['x', 'y'])
df['Cluster'] = [str(c) for c in clusters]

#split between clusters and outliers
to_plot = df.loc[df.Cluster != '-1', :]
outliers = df.loc[df.Cluster == '-1', :]

#plot clusters
plt.scatter(outliers.x, outliers.y, alpha = 0.05, s = 2, c = 'grey')
plt.scatter(to_plot.x, to_plot.y, alpha = 0.6, s = 2, c = to_plot.Cluster.astype(int), cmap = 'tab20b')
plt.axis('off')

## BERTopic Pipeline

Use embedding, dimensionality reduction, and clustering models with BERTopic pipeline to create a topic modeling pipeline. Later add other models to fine tune topic representations.

In [0]:
#use models above to BERTopic pipeline
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  verbose = True).fit(data['all_text'].tolist(), embeddings)

In [0]:
#investigate topics
topic_model.get_topic_info().head()

In [0]:
#topics most similar to 'augmented reality'
topic_model.find_topics("augmented reality")

In [0]:
#investigate topic 18
topic_model.get_topic(18)

In [0]:
#initialize tokenizer model
vectorizer_model = CountVectorizer(stop_words="english")

#initialize ctfidf model to weight terms
ctfidf_model = ClassTfidfTransformer()

#add tokenizer and ctfidf to pipeline
topic_model.update_topics(data['all_text'].tolist(), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model)

In [0]:
#investigate how topic representations have changed
topic_model.get_topic(18)

In [0]:
#initilzae representation model and add to pipeline
representation_model = KeyBERTInspired()
topic_model.update_topics(data['all_text'].tolist(), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, representation_model=representation_model)

In [0]:
#investigate how topic representations have changed
topic_model.get_topic(18)

In [0]:
topic_model.visualize_documents(data['all_text'].tolist(), reduced_embeddings=reduced_embeddings)

In [0]:
topic_model.visualize_barchart()

In [0]:
topic_model.visualize_heatmap()

In [0]:
import openai
from bertopic.representation import OpenAI

In [0]:
#promt for GPT to create topic labels
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]

The topic is described by the following key words: [KEYWORDS]

Based on the information above, extract a short topic label in the following format:
topic: <short topic label>
"""

In [0]:
#import GPT
client = openai.OpenAI(api_key='')

#add GPT as representation model
representation_model = OpenAI(client, model = 'gpt-3.5-turbo', exponential_backoff=True, chat=True, prompt=prompt)
topic_model.update_topics(data['all_text'].tolist(), representation_model=representation_model)

In [0]:
#investigate how topic representations have changed
topic_model.get_topic(18)

In [0]:
topic_model.visualize_documents(data['all_text'].tolist(), reduced_embeddings=reduced_embeddings)

In [0]:
#create linkages between topics
linkage_function = lambda x: sch.linkage(x, 'single', optimal_ordering=True)
hierarchical_topics = topic_model.hierarchical_topics(data['all_text'], linkage_function=linkage_function)

In [0]:
#visualize topic model hierarchy
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)