<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_EMBEDS_20241031.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic -- EMBEDDINGS

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors `<-- THIS CODE!`
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [3]:
# This is the first code cell. Execute to give access to Google Drive.
import sys

# Test if we this is a Google Colab
if 'google.colab' in sys.modules:
    print("Running on Colab")
    # Install libraries
    !pip install bertopic[visualization]

    ## FOLDER
    # mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')

    # Correct path to the Google Drive mounted in Colab
    GDRIVE_PATH = "drive/MyDrive"

    ## ENVIRONMENT
    # pip installs go here.
elif sys.platform == 'win32':
    print("Running on Windows")
    # We are in Local.

    ## FOLDER
    # Correct path to the Google Drive folder in Local
    GDRIVE_PATH = "G:/My Drive"

elif sys.platform == 'linux':
    print("Running on Linux")
    # We are in WSL - VSCode
    GDRIVE_PATH = '/mnt/g/My Drive'

    # Load environment
    # virtualenv env-titech
    !source ./env-titech/bin/activate
    #!pip install -r requirements_env_titech.txt

elif sys.platform == 'darwin':
    print("Running on Mac OS X")
    # We are in Mac Os
    GDRIVE_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive"
    !source ./env-tm/bin/activate

else:
    print("Couldn't mount drive. Check your system and path")

Running on Mac OS X
zsh:source:1: no such file or directory: ./env-tm/bin/activate


In [6]:
#!pip install --upgrade pip
#!pip install --upgrade numpy==1.26


In [1]:
import pandas as pd
import os
import json
import pickle
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


# ðŸ”´ Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [4]:
# The bibliometrics folder
ROOT_FOLDER_PATH = f"{GDRIVE_PATH}/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder_name = "Q342_science_communication"

# DATASET settings
settings_directive = "settings_dataset_directive_2026-01-14-17-26.json"

In [5]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder_name}/{settings_directive}', 'r') as file:
    settings = json.load(file)

## Aux Functions

In [6]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

In [7]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [8]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [9]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [10]:
# Input dataset
dataset_file_name = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['network']['from_filtered_dataset']}/dataset_raw_cleaned.csv"

# Within the square brackets, put the list of text columns that will be used for the Topic Model. For example, in the case of academic articles, we are using the Title "TI" and the abstract "AB".
# Those columns will be merged and used as input to the topic model
e_label = find_e_keys(settings['embeds'])[0]

es = settings['embeds'][e_label]

In [11]:
es

{'text_columns': ['TI', 'AB'],
 'to_lowercase': False,
 'remove_stopwords': False,
 'remove_numbers': False,
 'remove_symbols': False,
 'stemming': False,
 'lemmatization': False,
 'id_column': 'UT',
 'transformer_model': 'all-MiniLM-L6-v2',
 'notes': ''}

In [12]:
dataset_file_name

'/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q342_science_communication/f01/dataset_raw_cleaned.csv'

In [13]:
# Open the data file
df_full = pd.read_csv(f"{dataset_file_name}", encoding='latin-1')
#df = pd.read_csv('/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q318_MKT_poverty/f01/dataset_raw_cleaned.txt', delimiter = '\t')
print(df_full.shape)
df_full.head()

(5495, 42)


Unnamed: 0,X_N,uuid,PT,AU,AF,TI,SO,LA,DT,DE,...,AR,DI,PG,WC,SC,OA,UT,Countries,IsoCountries,Institutions
0,1,71b086ac-71ca-457e-bb7f-418abdb2bcd0,J,"Miller, S","Miller, S",Science communication's burnt bridges,PUBLIC UNDERSTANDING OF SCIENCE,English,Review,,...,,10.1177/0963662503012001412,4,Communication; History & Philosophy Of Science,Communication; History & Philosophy of Science,,WOS:000182611700008,united kingdom,GBR,ucl
1,2,9a977195-0f8c-4be9-ba89-30e9d8a2f734,J,"King, K; Kessler, T; Nimox, K; Alemdar, M","King, Katherine; Kessler, Talia; Nimox, Kari; ...",Science communication in action: lessons from ...,FRONTIERS IN COMMUNICATION,English,Article,science communication; science festival; evalu...,...,1622230.0,10.3389/fcomm.2025.1622230,8,Communication,Communication,gold,WOS:001578173300001,usa,USA,georgia inst technol
2,3,1f839175-112b-4d1d-8992-6196552e1442,J,"Roberts, J","Roberts, Jonathan",New texts in science communication,PUBLIC UNDERSTANDING OF SCIENCE,English,Review,,...,,10.1177/0963662515619545,3,Communication; History & Philosophy Of Science,Communication; History & Philosophy of Science,,WOS:000389054700009,united kingdom,GBR,kings coll london; wellcome trust sanger inst
3,4,f78e8c29-49e3-4161-b8ce-0d6caebd2478,J,"Cooper, JA","Cooper, Jack A.",Science Communication at Historical Biology,HISTORICAL BIOLOGY,English,Editorial Material,,...,,10.1080/08912963.2025.2471234,2,Biology; Paleontology,Life Sciences & Biomedicine - Other Topics; Pa...,Bronze,WOS:001433513100001,united kingdom,GBR,swansea univ
4,5,7a395ec0-851e-423f-a33c-8a510f836397,J,"Rycroft-Smith, L; Hartkopf, AM; Henning, E","Rycroft-Smith, Lucy; Hartkopf, Anna Maria; Hen...",Handbook of mathematical science communication,RESEARCH IN MATHEMATICS EDUCATION,English,Book Review,,...,,10.1080/14794802.2023.2296064,8,Education & Educational Research,Education & Educational Research,,WOS:001155505000001,united kingdom,GBR,univ cambridge


# Data Preparation

This step may include multiple sub-steps.
The following is a list of the cleaning process. Those with âœ… are implemented in this notebook.

- Ensure we use text data âœ…
- Remove documents with no data âœ…
- Convert text to lowercase âœ…
- Remove documents that are too short or too long
- Unify or apply transformations to the vocabulary using a dictionary (e.g. convert "AI" to "Artificial Intelligence)
- Remove stopwords
  - English stopwords âœ…
  - Custom stopwords (words we do not want to see in the results)
  - Field specific stopwords (frequent obvious word for a given dataset)
- Remove numbers
- Remove symbols and punctuation
- Stemming or lemmatization

In [14]:
# Ensure all data in this columns is text
text_columns = es['text_columns']
for i in range(0, len(text_columns)):
  df_full[text_columns[i]] = df_full[text_columns[i]].astype(str)

# Create a new column named `text` which is the concatenation of all the columns listed in `text_columns`
df_full['text'] = df_full[text_columns].apply(" ".join, axis=1)

In [15]:
import re

# Remove numbers and symbols from the text column
df_full['text'] = df_full['text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x)))
df_full['text'] = df_full['text'].str.replace('nan ', '').str.strip()
print(df_full['text'][1])

Science communication in action lessons from a mixedmethods case study of a large science festival Introduction Science festivals are a mechanism for connecting public audiences with science topics Scholars have identified best practices for science communication Peterman and Young  facilitating research on how and to what extent effective science communication occurs in the context of science festivalsMethods This mixedmethods evaluation case study centers the experiences of exhibitors ie science communicators at a large science festival event We use a convergent parallel mixed methodological approach with an intent to triangulate observation survey and group interview dataResults Observation data documented the use of effective communication practices by exhibitors such as clear messaging and engaging activities Best practices for science communication were documented more frequently by exhibitors from educational institutions and nonprofit or other organizations compared to exhibito

In [16]:
# Convert to lowercase and remove English stopwords from `text` columns
if es['to_lowercase']:
  df_full.text = df_full.text.str.lower()
if es['remove_stopwords']:
  df_full.text = df_full.text.apply(lambda row: remove_stopwords(row), 1)

In [17]:
# Copy a backup of the object
df_corpus = df_full.copy()

In [18]:
# This needs to be fixed in the analysis settings
es['id_column'] = 'uuid'

In [19]:
# Create folder
embeds_folder_path = os.path.abspath(f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['embeds']['from_filtered_dataset']}/{e_label}")
print(embeds_folder_path)

if not os.path.exists(embeds_folder_path):
  os.makedirs(embeds_folder_path)

/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q342_science_communication/f01/e01


In [20]:
# Retain only the data needed for the topic model
df_corpus = df_corpus[["text", "UT", es['id_column']]]
df_corpus = df_corpus.dropna()
df_corpus = df_corpus.reset_index(drop=True).copy()
df_corpus.to_csv(f'{embeds_folder_path}/corpus.csv', index=False)
df_corpus.head()

Unnamed: 0,text,UT,uuid
0,Science communications burnt bridges nan,WOS:000182611700008,71b086ac-71ca-457e-bb7f-418abdb2bcd0
1,Science communication in action lessons from a...,WOS:001578173300001,9a977195-0f8c-4be9-ba89-30e9d8a2f734
2,New texts in science communication nan,WOS:000389054700009,1f839175-112b-4d1d-8992-6196552e1442
3,Science Communication at Historical Biology nan,WOS:001433513100001,f78e8c29-49e3-4161-b8ce-0d6caebd2478
4,Handbook of mathematical science communication...,WOS:001155505000001,7a395ec0-851e-423f-a33c-8a510f836397




---



## PART 1: Embeddings

`BERTopic()` is the main function.
- [Oficial documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
- [Quick overview](https://maartengr.github.io/BERTopic/index.html)
- [Explanation of parameters](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__)

In [21]:
sentence_model = SentenceTransformer(es['transformer_model'])

In [22]:
# Whe the tqdm progress bar dont show and throw error
#!jupyter nbextension enable --py widgetsnbextension

In [23]:
# Compute and save embeddings.
docs = df_corpus.text.to_list()
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 172/172 [00:32<00:00,  5.37it/s]


In [24]:
# Save files
save_object_as_pickle({'embeddings': embeddings,
                       'embeddings_ids': df_corpus.uuid.to_list(),}, 
                       f'{embeds_folder_path}/embeddings.pck')


In [25]:
with open(f'{embeds_folder_path}/embeds_settings.json', 'w') as file:
    json.dump(settings['embeds'], file, indent=4)

In [26]:
# Function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)
