<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_EMBEDS_20241031.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic -- EMBEDDINGS

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors `<-- THIS CODE!`
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [1]:
!pip install bertopic[visualization]

Collecting bertopic[visualization]
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
[0mCollecting hdbscan>=0.8.29 (from bertopic[visualization])
  Downloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic[visualization])
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic[visualization])
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━

In [2]:
import pandas as pd
import numpy as np
import time
import math
import uuid
import re
import os
import json
import pickle
from datetime import date
from itertools import compress
from bertopic import BERTopic
from umap import UMAP
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

## Connect your Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [4]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

# 🔴 Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [8]:
# The bibliometrics folder
ROOT_FOLDER_PATH = f"drive/MyDrive/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder_name = "Q312_utokyo"
settings_directive = "settings_dataset_directive_2024-11-02-14-59.json"

In [9]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder_name}/{settings_directive}', 'r') as file:
    settings = json.load(file)

In [10]:
# Input dataset
dataset_file_name = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['network']['from_filtered_dataset']}/dataset_raw_cleaned.csv"

# Within the square brackets, put the list of text columns that will be used for the Topic Model. For example, in the case of academic articles, we are using the Title "TI" and the abstract "AB".
# Those columns will be merged and used as input to the topic model
e_label = find_e_keys(settings['embeds'])[0]

es = settings['embeds'][e_label]

In [11]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [12]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [13]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [14]:
# Open the data file
df = pd.read_csv(f"{dataset_file_name}")
print(df.shape)
df.head()

(11120, 39)


Unnamed: 0,X_N,uuid,PT,AU,TI,SO,LA,DT,DE,ID,...,VL,IS,BP,EP,AR,DI,PG,WC,SC,UT
0,1,41e14258-945f-4ec9-9e0a-e58c87752c10,J,"Kishi, T; Kobayashi, K; Sasagawa, K; Sakimura,...",Automated analysis of a novel object recogniti...,BEHAVIOURAL BRAIN RESEARCH,English,Article,Animal behavior; Novel object recognition test...,ANXIETY; MODELS,...,476.0,,,,115278.0,10.1016/j.bbr.2024.115278,5,Behavioral Sciences; Neurosciences,Behavioral Sciences; Neurosciences & Neurology,WOS:001333148000001
1,2,f0daa586-2560-43ef-8132-f3b87f9167a3,J,"Berenger, F; Tsuda, K",An ANI-2 enabled open-source protocol to estim...,JOURNAL OF COMPUTATIONAL CHEMISTRY,English,Article; Early Access,ANI-2x; docking; ligand; LIT-PCBA; MC; QM; SBV...,OPEN-SOURCE PACKAGE; FORCE-FIELD,...,,,,,,10.1002/jcc.27478,11,"Chemistry, Multidisciplinary",Chemistry,WOS:001328968200001
2,3,23327f73-de46-43e9-8796-decc6950c06d,J,"Kato, S; Ono, S",Accelerated Approval for Cancer Drugs in the U...,JOURNAL OF PHARMACEUTICAL INNOVATION,English,Article,Accelerated approval; Oncology; Evidence and a...,END-POINTS; ONCOLOGY,...,19.0,5.0,,,54.0,10.1007/s12247-024-09851-9,12,Pharmacology & Pharmacy,Pharmacology & Pharmacy,WOS:001304011000001
3,4,dff88211-4a40-4566-a7c0-1856b531092f,J,"Kojima, K; Chambers, JK; Yoshizawa, M; Fujioka...",Pathological features of intrathoracic histioc...,JOURNAL OF VETERINARY MEDICAL SCIENCE,English,Article,histiocytic sarcoma; spiny rat; Tokudaia osime...,CHROMOSOME,...,86.0,10.0,1076.0,1080.0,,10.1292/jvms.24-0185,5,Veterinary Sciences,Veterinary Sciences,WOS:001338099800007
4,5,5e23ca4b-52e5-47ad-9812-1e7a65f2b2a8,J,"Delgado-Munoz, J; Matsunaka, R; Hiraki, K",Classification of Known and Unknown Study Item...,BRAIN SCIENCES,English,Article,long term memory; familiarity; electroencephal...,ERP; RECOGNITION; FAMILIARITY; RECOLLECTION,...,14.0,9.0,,,860.0,10.3390/brainsci14090860,21,Neurosciences,Neurosciences & Neurology,WOS:001323906200001


# Data Preparation

This step may include multiple sub-steps.
The following is a list of the cleaning process. Those with ✅ are implemented in this notebook.

- Ensure we use text data ✅
- Remove documents with no data ✅
- Convert text to lowercase ✅
- Remove documents that are too short or too long
- Unify or apply transformations to the vocabulary using a dictionary (e.g. convert "AI" to "Artificial Intelligence)
- Remove stopwords
  - English stopwords ✅
  - Custom stopwords (words we do not want to see in the results)
  - Field specific stopwords (frequent obvious word for a given dataset)
- Remove numbers
- Remove symbols and punctuation
- Stemming or lemmatization

In [15]:
# Ensure all data in this columns is text
text_columns = es['text_columns']
for i in range(0, len(text_columns)):
  df[text_columns[i]] = df[text_columns[i]].astype(str)

# Create a new column named `text` which is the concatenation of all the columns listed in `text_columns`
df['text'] = df[text_columns].apply(" ".join, axis=1)

In [16]:
# Convert to lowercase and remove English stopwords from `text` columns
if es['to_lowercase']:
  df.text = df.text.str.lower()
if es['remove_stopwords']:
  df.text = df.text.apply(lambda row: remove_stopwords(row), 1)

In [17]:
# Copy a backup of the object
df_full = df.copy()

In [18]:
# Retain only the data needed for the topic model
df = df[["text", es['id_column']]]
df = df.dropna()
df.head()

Unnamed: 0,text,UT
0,automated analysis novel object recognition te...,WOS:001333148000001
1,ani-2 enabled open-source protocol estimate li...,WOS:001328968200001
2,accelerated approval cancer drugs united state...,WOS:001304011000001
3,pathological features intrathoracic histiocyti...,WOS:001338099800007
4,classification known unknown study items memor...,WOS:001323906200001




---



## PART 1: Embeddings

`BERTopic()` is the main function.
- [Oficial documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
- [Quick overview](https://maartengr.github.io/BERTopic/index.html)
- [Explanation of parameters](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__)

In [19]:
# Format data as required by the topic model package
#documents = df.text.to_list()
#dates = df['PY'].apply(lambda x: pd.Timestamp(year=int(x),month=1,day=1)).to_list()

In [20]:
sentence_model = SentenceTransformer(es['transformer_model'])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [21]:
# Compute and save embeddings.
df_new = df.reset_index(drop=True).copy()
docs = df_new.text.to_list()
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches:   0%|          | 0/348 [00:00<?, ?it/s]

In [22]:
# Create folder
embeds_folder_path = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['embeds']['from_filtered_dataset']}/{e_label}"

if not os.path.exists(embeds_folder_path):
  !mkdir $embeds_folder_path

In [23]:
# Save files
df_new.to_csv(f'{embeds_folder_path}/corpus.csv', index=False)
save_object_as_pickle({'embeddings': embeddings,
                       'embeddings_ids': df_new.UT}, f'{embeds_folder_path}/embeddings.pck')


In [24]:
with open(f'{embeds_folder_path}/embeds_settings.json', 'w') as file:
    json.dump(settings['embeds'], file, indent=4)