<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_EMBEDS_20241031.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic -- EMBEDDINGS

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors `<-- THIS CODE!`
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [1]:
# This is the first code cell. Execute to give access to Google Drive.
import sys

# Test if we this is a Google Colab
if 'google.colab' in sys.modules:
    print("Running on Colab")
    # Install libraries
    !pip install bertopic[visualization]

    ## FOLDER
    # mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')

    # Correct path to the Google Drive mounted in Colab
    GDRIVE_PATH = "drive/MyDrive"

    ## ENVIRONMENT
    # pip installs go here.
elif sys.platform == 'win32':
    print("Running on Windows")
    # We are in Local.

    ## FOLDER
    # Correct path to the Google Drive folder in Local
    GDRIVE_PATH = "G:/My Drive"

elif sys.platform == 'linux':
    print("Running on Linux")
    # We are in WSL - VSCode
    GDRIVE_PATH = '/mnt/g/My Drive'

    # Load environment
    # virtualenv env-titech
    !source ./env-titech/bin/activate
    #!pip install -r requirements_env_titech.txt

elif sys.platform == 'darwin':
    print("Running on Mac OS X")
    # We are in Mac Os
    GDRIVE_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive"
    !source ./env-tm/bin/activate

else:
    print("Couldn't mount drive. Check your system and path")

Running on Mac OS X


In [None]:
#!pip install --upgrade pip
#!pip install --upgrade numpy==1.26

In [2]:
import pandas as pd
import os
import json
import pickle
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


# 🔴 Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [3]:
# The bibliometrics folder
ROOT_FOLDER_PATH = f"{GDRIVE_PATH}/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder_name = "Q322_TS_robot_2022_2024"

# DATASET settings
settings_directive = "settings_dataset_directive_2025-01-29-12-14.json"

In [4]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder_name}/{settings_directive}', 'r') as file:
    settings = json.load(file)

## Aux Functions

In [5]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

In [6]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [7]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [8]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [9]:
# Input dataset
dataset_file_name = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['network']['from_filtered_dataset']}/dataset_raw_cleaned.csv"

# Within the square brackets, put the list of text columns that will be used for the Topic Model. For example, in the case of academic articles, we are using the Title "TI" and the abstract "AB".
# Those columns will be merged and used as input to the topic model
e_label = find_e_keys(settings['embeds'])[0]

es = settings['embeds'][e_label]

In [10]:
dataset_file_name

'/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q322_TS_robot_2022_2024/f01/dataset_raw_cleaned.csv'

In [11]:
# Open the data file
df = pd.read_csv(f"{dataset_file_name}")
#df = pd.read_csv('/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q318_MKT_poverty/f01/dataset_raw_cleaned.txt', delimiter = '\t')
print(df.shape)
df.head()

(103649, 42)


Unnamed: 0,X_N,uuid,PT,AU,AF,TI,SO,LA,DT,DE,...,AR,DI,PG,WC,SC,OA,UT,Countries,IsoCountries,Institutions
0,1,fd2aff73-485d-4769-8497-996647a56213,J,"Yu, LL; Huo, SX; Wang, ZJ; Li, KY","Yu, Lingli; Huo, Shuxin; Wang, Zhengjiu; Li, Keyi",Hybrid attention-oriented experience replay fo...,NEUROCOMPUTING,English,Article,Deep reinforcement learning; Multi -robot; MAD...,...,,10.1016/j.neucom.2022.12.020,14,"Computer Science, Artificial Intelligence",Computer Science,,WOS:000904782300005,peoples r china,CHN,cent south univ; hunan xiangjiang artificial i...
1,2,b81562f9-e75d-4d41-9872-86a08744fcd5,J,"Zhang, JY; Lou, ZF; Fan, KC","Zhang, Jiyun; Lou, Zhifeng; Fan, Kuang-Chao",Accuracy improvement of a 3D passive laser tra...,ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING,English,Article,3D passive laser tracker; Error modeling; Erro...,...,102487.0,10.1016/j.rcim.2022.102487,13,"Computer Science, Interdisciplinary Applicatio...",Computer Science; Engineering; Robotics,,WOS:000911218400001,peoples r china,CHN,dalian univ technol
2,3,90718c9e-8150-4eb2-b2a6-443007642a46,J,"Zhang, T; Li, Y; Ning, CX; Zeng, B","Zhang, Ting; Li, Yang; Ning, Chuanxin; Zeng, Bo",Development and Adaptive Assistance Control of...,IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND EN...,English,Article,Robotic hip exoskeleton; assistive control; ga...,...,,10.1109/TASE.2022.3229396,11,Automation & Control Systems,Automation & Control Systems,,WOS:000903508600001,peoples r china,CHN,soochow univ
3,4,c26908c1-e32c-4d10-8700-dc058690972d,J,"Altyar, AE; El-Sayed, A; Abdeen, A; Piscopo, M...","Altyar, Ahmed E.; El-Sayed, Amr; Abdeen, Ahmed...",Future regenerative medicine developments and ...,BIOMEDICINE & PHARMACOTHERAPY,English,Review,Artificial Intelligence; Gene Therapy; Organ-O...,...,114131.0,10.1016/j.biopha.2022.114131,18,"Medicine, Research & Experimental; Pharmacolog...",Research & Experimental Medicine; Pharmacology...,gold,WOS:000904370900003,saudi arabia; egypt; italy; usa; poland,SAU; EGY; ITA; USA; POL,king abdulaziz univ; cairo univ; benha univ; u...
4,5,f34e67c6-1ac6-4030-a673-956e3be6386e,J,"Anantharanga, AT; Hashemi, MS; Sheidaei, A","Anantharanga, Abhijith Thoopul; Hashemi, Moham...",Linking properties to microstructure in liquid...,COMPUTATIONAL MATERIALS SCIENCE,English,Article,Liquid metal embedded elastomers; Multifunctio...,...,111983.0,10.1016/j.commatsci.2022.111983,12,"Materials Science, Multidisciplinary",Materials Science,Green Submitted,WOS:000911684200001,usa,USA,iowa state univ


# Data Preparation

This step may include multiple sub-steps.
The following is a list of the cleaning process. Those with ✅ are implemented in this notebook.

- Ensure we use text data ✅
- Remove documents with no data ✅
- Convert text to lowercase ✅
- Remove documents that are too short or too long
- Unify or apply transformations to the vocabulary using a dictionary (e.g. convert "AI" to "Artificial Intelligence)
- Remove stopwords
  - English stopwords ✅
  - Custom stopwords (words we do not want to see in the results)
  - Field specific stopwords (frequent obvious word for a given dataset)
- Remove numbers
- Remove symbols and punctuation
- Stemming or lemmatization

In [12]:
# Ensure all data in this columns is text
text_columns = es['text_columns']
for i in range(0, len(text_columns)):
  df[text_columns[i]] = df[text_columns[i]].astype(str)

# Create a new column named `text` which is the concatenation of all the columns listed in `text_columns`
df['text'] = df[text_columns].apply(" ".join, axis=1)

In [13]:
# Convert to lowercase and remove English stopwords from `text` columns
if es['to_lowercase']:
  df.text = df.text.str.lower()
if es['remove_stopwords']:
  df.text = df.text.apply(lambda row: remove_stopwords(row), 1)

In [14]:
# Copy a backup of the object
df_full = df.copy()

In [15]:
# Retain only the data needed for the topic model
df = df[["text", es['id_column']]]
df = df.dropna()
df.head()

Unnamed: 0,text,UT
0,Hybrid attention-oriented experience replay fo...,WOS:000904782300005
1,Accuracy improvement of a 3D passive laser tra...,WOS:000911218400001
2,Development and Adaptive Assistance Control of...,WOS:000903508600001
3,Future regenerative medicine developments and ...,WOS:000904370900003
4,Linking properties to microstructure in liquid...,WOS:000911684200001




---



## PART 1: Embeddings

`BERTopic()` is the main function.
- [Oficial documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
- [Quick overview](https://maartengr.github.io/BERTopic/index.html)
- [Explanation of parameters](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__)

In [16]:
sentence_model = SentenceTransformer(es['transformer_model'])

In [17]:
# Whe the tqdm progress bar dont show and throw error
#!jupyter nbextension enable --py widgetsnbextension

In [18]:
# Compute and save embeddings.
df_new = df.reset_index(drop=True).copy()
docs = df_new.text.to_list()
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches: 100%|██████████| 3240/3240 [13:20<00:00,  4.05it/s]


In [19]:
# Create folder
embeds_folder_path = os.path.abspath(f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['embeds']['from_filtered_dataset']}/{e_label}")

if not os.path.exists(embeds_folder_path):
  os.makedirs(embeds_folder_path)

In [20]:
# Save files
df_new.to_csv(f'{embeds_folder_path}/corpus.csv', index=False)
save_object_as_pickle({'embeddings': embeddings,
                       'embeddings_ids': df_new.UT}, f'{embeds_folder_path}/embeddings.pck')


In [21]:
with open(f'{embeds_folder_path}/embeds_settings.json', 'w') as file:
    json.dump(settings['embeds'], file, indent=4)