<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_EMBEDS_20241031.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic -- EMBEDDINGS

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors `<-- THIS CODE!`
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [302]:
# This is the first code cell. Execute to give access to Google Drive.
import sys

# Test if we this is a Google Colab
if 'google.colab' in sys.modules:
    print("Running on Colab")
    # Install libraries
    !pip install bertopic[visualization]

    ## FOLDER
    # mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')

    # Correct path to the Google Drive mounted in Colab
    GDRIVE_PATH = "drive/MyDrive"

    ## ENVIRONMENT
    # pip installs go here.
elif sys.platform == 'win32':
    print("Running on Windows")
    # We are in Local.

    ## FOLDER
    # Correct path to the Google Drive folder in Local
    GDRIVE_PATH = "G:/My Drive"

elif sys.platform == 'linux':
    print("Running on Linux")
    # We are in WSL - VSCode
    GDRIVE_PATH = '/mnt/g/My Drive'

    # Load environment
    # virtualenv env-titech
    !source ./env-titech/bin/activate
    #!pip install -r requirements_env_titech.txt

elif sys.platform == 'darwin':
    print("Running on Mac OS X")
    # We are in Mac Os
    GDRIVE_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive"
    !source ./env-tm/bin/activate

else:
    print("Couldn't mount drive. Check your system and path")

Running on Mac OS X


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [303]:
#!pip install --upgrade pip
#!pip install --upgrade numpy==1.26

In [304]:
import pandas as pd
import os
import json
import pickle
from sentence_transformers import SentenceTransformer

# 🔴 Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [305]:
# The bibliometrics folder
ROOT_FOLDER_PATH = f"{GDRIVE_PATH}/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder_name = "Q10_brain_health_ts_20250501"

# DATASET settings
settings_directive = "settings_dataset_directive_2025-05-07-11-52.json"

In [306]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder_name}/{settings_directive}', 'r') as file:
    settings = json.load(file)

## Aux Functions

In [307]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

In [308]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [309]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [310]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [311]:
# Input dataset
dataset_file_name = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['network']['from_filtered_dataset']}/dataset_raw_cleaned.csv"

# Within the square brackets, put the list of text columns that will be used for the Topic Model. For example, in the case of academic articles, we are using the Title "TI" and the abstract "AB".
# Those columns will be merged and used as input to the topic model
e_label = find_e_keys(settings['embeds'])[0]

es = settings['embeds'][e_label]

In [312]:
es

{'text_columns': ['TI', 'AB'],
 'to_lowercase': False,
 'remove_stopwords': False,
 'remove_numbers': False,
 'remove_symbols': False,
 'stemming': False,
 'lemmatization': False,
 'id_column': 'UT',
 'transformer_model': 'all-MiniLM-L6-v2',
 'notes': ''}

In [313]:
dataset_file_name

'/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q10_brain_health_ts_20250501/f01/dataset_raw_cleaned.csv'

In [314]:
# Open the data file
df_full = pd.read_csv(f"{dataset_file_name}", encoding='latin-1')
#df = pd.read_csv('/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q318_MKT_poverty/f01/dataset_raw_cleaned.txt', delimiter = '\t')
print(df_full.shape)
df_full.head()

(8565, 42)


Unnamed: 0,X_N,uuid,PT,AU,AF,TI,SO,LA,DT,DE,...,AR,DI,PG,WC,SC,OA,UT,Countries,IsoCountries,Institutions
0,1,7cc8aa36-8303-452d-bca6-08ea94f8ad7f,J,"Marra, DE; Umfleet, LG; Sabsevitz, DS; Quasney...","Marra, D. E.; Umfleet, Glass L.; Sabsevitz, D....",The impact of cognitive reserve on cognitive o...,CLINICAL NEUROPSYCHOLOGIST,English,Meeting Abstract,,...,,,1,"Psychology, Clinical; Clinical Neurology; Psyc...",Psychology; Neurosciences & Neurology,,WOS:000375471800099,,,
1,2,7f34d364-b244-4ce4-b260-e780455ac588,J,"Greicius, MD; Mormino, EC; Jagust, WJ","Greicius, Michael D.; Mormino, Elizabeth C.; J...",Imaging Cognitive Reserve: Fronto-Parietal Con...,NEUROLOGY,English,Meeting Abstract,,...,,,1,Clinical Neurology,Neurosciences & Neurology,,WOS:000275274000051,,,
2,3,b7da215d-3cfe-4df9-8346-53ebe6aa60da,J,"Bleecker, ML; Ford, P","Bleecker, Margit L.; Ford, Patrick",Impact of cognitive reserve on the relationshi...,NEUROLOGY,English,Letter,,...,,,2,Clinical Neurology,Neurosciences & Neurology,,WOS:000254576600014,,,
3,4,628d94fe-5540-4d47-9b24-f585ed07de2b,J,"Pedrero-PÃ©rez, EJ; Rojo-Mota, G; de LeÃ³n, JM...","Pedrero-Perez, Eduardo J.; Rojo-Mota, Gloria; ...",Cognitive reserve in substance addicts in trea...,REVISTA DE NEUROLOGIA,Spanish,Article,Activities of daily living; Cognitive impairme...,...,,10.33588/rn.5911.2014435,9,Clinical Neurology,Neurosciences & Neurology,,WOS:000347214000001,spain,ESP,madrid salud; univ complutense madrid
4,5,747ee008-0ac3-4d20-a73c-e4514f3bc0b6,J,"Nelson, ME; Andel, R; Hort, J","Nelson, Monica E.; Andel, Ross; Hort, Jakub","Cognitive reserve, neuropathology, and progres...",AGING-US,English,Editorial Material,dementia; neuropathology; Alzheimer's disease;...,...,,,3,Cell Biology; Geriatrics & Gerontology,Cell Biology; Geriatrics & Gerontology,"gold, Green Published",WOS:001035308800004,usa,USA,univ s florida


# Data Preparation

This step may include multiple sub-steps.
The following is a list of the cleaning process. Those with ✅ are implemented in this notebook.

- Ensure we use text data ✅
- Remove documents with no data ✅
- Convert text to lowercase ✅
- Remove documents that are too short or too long
- Unify or apply transformations to the vocabulary using a dictionary (e.g. convert "AI" to "Artificial Intelligence)
- Remove stopwords
  - English stopwords ✅
  - Custom stopwords (words we do not want to see in the results)
  - Field specific stopwords (frequent obvious word for a given dataset)
- Remove numbers
- Remove symbols and punctuation
- Stemming or lemmatization

In [315]:
# Ensure all data in this columns is text
text_columns = es['text_columns']
for i in range(0, len(text_columns)):
  df_full[text_columns[i]] = df_full[text_columns[i]].astype(str)

# Create a new column named `text` which is the concatenation of all the columns listed in `text_columns`
df_full['text'] = df_full[text_columns].apply(" ".join, axis=1)

In [316]:
import re

# Remove numbers and symbols from the text column
df_full['text'] = df_full['text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x)))
df_full['text'] = df_full['text'].str.replace('nan ', '').str.strip()
print(df_full['text'][1])

Imaging Cognitive Reserve FrontoParietal Connectivity Increases with PIB Retention in Healthy Elderly Subjects nan


In [317]:
# Convert to lowercase and remove English stopwords from `text` columns
if es['to_lowercase']:
  df_full.text = df_full.text.str.lower()
if es['remove_stopwords']:
  df_full.text = df_full.text.apply(lambda row: remove_stopwords(row), 1)

In [318]:
# Copy a backup of the object
df_corpus = df_full.copy()

In [319]:
# This needs to be fixed in the analysis settings
es['id_column'] = 'uuid'

In [320]:
# Create folder
embeds_folder_path = os.path.abspath(f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['embeds']['from_filtered_dataset']}/{e_label}")
print(embeds_folder_path)

if not os.path.exists(embeds_folder_path):
  os.makedirs(embeds_folder_path)

/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q10_brain_health_ts_20250501/f01/e01


In [321]:
# Retain only the data needed for the topic model
df_corpus = df_corpus[["text", "UT", es['id_column']]]
df_corpus = df_corpus.dropna()
df_corpus = df_corpus.reset_index(drop=True).copy()
df_corpus.to_csv(f'{embeds_folder_path}/corpus.csv', index=False)
df_corpus.head()

Unnamed: 0,text,UT,uuid
0,The impact of cognitive reserve on cognitive o...,WOS:000375471800099,7cc8aa36-8303-452d-bca6-08ea94f8ad7f
1,Imaging Cognitive Reserve FrontoParietal Conne...,WOS:000275274000051,7f34d364-b244-4ce4-b260-e780455ac588
2,Impact of cognitive reserve on the relationshi...,WOS:000254576600014,b7da215d-3cfe-4df9-8346-53ebe6aa60da
3,Cognitive reserve in substance addicts in trea...,WOS:000347214000001,628d94fe-5540-4d47-9b24-f585ed07de2b
4,Cognitive reserve neuropathology and progressi...,WOS:001035308800004,747ee008-0ac3-4d20-a73c-e4514f3bc0b6




---



## PART 1: Embeddings

`BERTopic()` is the main function.
- [Oficial documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
- [Quick overview](https://maartengr.github.io/BERTopic/index.html)
- [Explanation of parameters](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__)

In [202]:
sentence_model = SentenceTransformer(es['transformer_model'])

KeyboardInterrupt: 

In [None]:
# Whe the tqdm progress bar dont show and throw error
#!jupyter nbextension enable --py widgetsnbextension

In [None]:
# Compute and save embeddings.
docs = df_corpus.text.to_list()
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches: 100%|██████████| 268/268 [01:03<00:00,  4.22it/s]


In [None]:
# Save files
save_object_as_pickle({'embeddings': embeddings,
                       'embeddings_ids': df_corpus.uuid.to_list(),}, 
                       f'{embeds_folder_path}/embeddings.pck')


In [None]:
with open(f'{embeds_folder_path}/embeds_settings.json', 'w') as file:
    json.dump(settings['embeds'], file, indent=4)

In [None]:
# Function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)
