<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_TOPIC_MODEL_20241101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic

🔴 copied from the [Kubota Colab](https://colab.research.google.com/drive/1YsDp5_qGXGJKsEXsS8DO8CA_lqZc6EpA).  

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [2]:
!pip install bertopic[visualization]

zsh:1: no matches found: bertopic[visualization]


In [1]:
import pandas as pd
import numpy as np
import time
import math
import uuid
import re
import os
import json
import pickle
from datetime import date
from itertools import compress
from bertopic import BERTopic
from umap import UMAP
from gensim.parsing.preprocessing import remove_stopwords
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## Connect your Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

In [2]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

# 🔴 Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [6]:
# The bibliometrics folder
# Colab
ROOT_FOLDER_PATH = "drive/MyDrive/Bibliometrics_Drive"

# Mac
ROOT_FOLDER_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder = 'Q322_TS_robot_2022_2024'

analysis_id = 'a01_tm__f01_e01__km01'

# Filtered label
settings_directive = "settings_analysis_directive_2025-01-29-14-07.json"

In [7]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder}/{analysis_id}/{settings_directive}', 'r') as file:
    settings = json.load(file)

In [8]:
# Input dataset
dataset_file_path = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['metadata']['filtered_folder']}/dataset_raw_cleaned.csv"

In [9]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [10]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [11]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [12]:
# Open the data file
df = pd.read_csv(f"{dataset_file_path}")
print(df.shape)
df.head()

(103649, 42)


Unnamed: 0,X_N,uuid,PT,AU,AF,TI,SO,LA,DT,DE,...,AR,DI,PG,WC,SC,OA,UT,Countries,IsoCountries,Institutions
0,1,fd2aff73-485d-4769-8497-996647a56213,J,"Yu, LL; Huo, SX; Wang, ZJ; Li, KY","Yu, Lingli; Huo, Shuxin; Wang, Zhengjiu; Li, Keyi",Hybrid attention-oriented experience replay fo...,NEUROCOMPUTING,English,Article,Deep reinforcement learning; Multi -robot; MAD...,...,,10.1016/j.neucom.2022.12.020,14,"Computer Science, Artificial Intelligence",Computer Science,,WOS:000904782300005,peoples r china,CHN,cent south univ; hunan xiangjiang artificial i...
1,2,b81562f9-e75d-4d41-9872-86a08744fcd5,J,"Zhang, JY; Lou, ZF; Fan, KC","Zhang, Jiyun; Lou, Zhifeng; Fan, Kuang-Chao",Accuracy improvement of a 3D passive laser tra...,ROBOTICS AND COMPUTER-INTEGRATED MANUFACTURING,English,Article,3D passive laser tracker; Error modeling; Erro...,...,102487.0,10.1016/j.rcim.2022.102487,13,"Computer Science, Interdisciplinary Applicatio...",Computer Science; Engineering; Robotics,,WOS:000911218400001,peoples r china,CHN,dalian univ technol
2,3,90718c9e-8150-4eb2-b2a6-443007642a46,J,"Zhang, T; Li, Y; Ning, CX; Zeng, B","Zhang, Ting; Li, Yang; Ning, Chuanxin; Zeng, Bo",Development and Adaptive Assistance Control of...,IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND EN...,English,Article,Robotic hip exoskeleton; assistive control; ga...,...,,10.1109/TASE.2022.3229396,11,Automation & Control Systems,Automation & Control Systems,,WOS:000903508600001,peoples r china,CHN,soochow univ
3,4,c26908c1-e32c-4d10-8700-dc058690972d,J,"Altyar, AE; El-Sayed, A; Abdeen, A; Piscopo, M...","Altyar, Ahmed E.; El-Sayed, Amr; Abdeen, Ahmed...",Future regenerative medicine developments and ...,BIOMEDICINE & PHARMACOTHERAPY,English,Review,Artificial Intelligence; Gene Therapy; Organ-O...,...,114131.0,10.1016/j.biopha.2022.114131,18,"Medicine, Research & Experimental; Pharmacolog...",Research & Experimental Medicine; Pharmacology...,gold,WOS:000904370900003,saudi arabia; egypt; italy; usa; poland,SAU; EGY; ITA; USA; POL,king abdulaziz univ; cairo univ; benha univ; u...
4,5,f34e67c6-1ac6-4030-a673-956e3be6386e,J,"Anantharanga, AT; Hashemi, MS; Sheidaei, A","Anantharanga, Abhijith Thoopul; Hashemi, Moham...",Linking properties to microstructure in liquid...,COMPUTATIONAL MATERIALS SCIENCE,English,Article,Liquid metal embedded elastomers; Multifunctio...,...,111983.0,10.1016/j.commatsci.2022.111983,12,"Materials Science, Multidisciplinary",Materials Science,Green Submitted,WOS:000911684200001,usa,USA,iowa state univ




---



## PART 2: Topic Model

In [13]:
# bibliometrics_folder
# project_folder
# project_name_suffix
# ROOT_FOLDER_PATH = f"drive/MyDrive/{bibliometrics_folder}"

#############################################################
# Embeddings folder
embeddings_folder_name = settings['tmo']['embeds_folder']

# Which column has the year of the documents?
my_year = settings['tmo']['year_column']

# Number of topics. Select the number of topics to extract.
# Choose 0, for automatic detection.
n_topics = settings['tmo']['n_topics']

# Minimum number of documents per topic
min_topic_size = settings['tmo']['min_topic_size']

In [14]:
# Get the embeddings back.
embeddings = load_pickle(f"{ROOT_FOLDER_PATH}/{project_folder}/{settings['metadata']['filtered_folder']}/{embeddings_folder_name}/embeddings.pck")
corpus =     pd.read_csv(f"{ROOT_FOLDER_PATH}/{project_folder}/{settings['metadata']['filtered_folder']}/{embeddings_folder_name}/corpus.csv").reset_index(drop=True)

In [15]:
# Combine embeddings
documents = corpus.text.to_list()

In [47]:
# corpus['uuid'] = [uuid.uuid4() for _ in range(len(corpus.index))]
# corpus['X_N'] = [i for i in range(1, len(corpus.index)+1)]

In [16]:
len(documents)

103649

In [17]:
#len(embeddings) == len(documents)
len(embeddings['embeddings']) == len(documents)

True

In [18]:
from hdbscan.hdbscan_ import HDBSCAN
# Execute the topic model.
# I suggest changing the values marked with #<---
# The others are the default values and they'll work fine in most cases.
# This will take several minutes to finish.

# Initiate UMAP
umap_model = UMAP(n_neighbors=15,
                  n_components=5,
                  min_dist=0.0,
                  metric='cosine',
                  random_state=100)

if n_topics == 0:
  # Initiate topic model with HDBScan (Automatic topic selection)
  topic_model_params = HDBSCAN(min_cluster_size=min_topic_size,
                               metric='euclidean',
                               cluster_selection_method='eom',
                               prediction_data=True)
else:
  # Initiate topic model with K-means (Manual topic selection)
  topic_model_params = KMeans(n_clusters = n_topics)

# Initiate BERTopic
topic_model = BERTopic(umap_model = umap_model,
                       hdbscan_model = topic_model_params,
                       min_topic_size=min_topic_size,
                       #nr_topics=15,          #<--- Footnote 1
                       n_gram_range=(1,3),
                       language='english',
                       calculate_probabilities=True,
                       verbose=True)



# Footnote 1: This controls the number of topics we want AFTER clustering.
# Add a hashtag at the beggining to use the number of topics returned by the topic model.
# When using HDBScan nr_topics will be obtained after orphans removal, and there is no warranty that `nr_topics < HDBScan topics`.
# thus, with HDBScan `nr_topics` means N topics OR LESS.
# When using KMeans nr_topics can be used to further reduce the number of topics.
# We use the topics as returned by the topic model. So we do not need to activate it here.

In [19]:
# Compute topic model
#topics, probabilities = topic_model.fit_transform(documents, embeddings)
topics, probabilities = topic_model.fit_transform(documents, embeddings['embeddings'])

2025-01-29 14:10:08,884 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2025-01-29 14:11:16,607 - BERTopic - Dimensionality - Completed ✓
2025-01-29 14:11:16,609 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-01-29 14:11:18,599 - BERTopic - Cluster - Completed ✓
2025-01-29 14:11:18,611 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-01-29 14:12:40,829 - BERTopic - Representation - Completed ✓


In [20]:
# Get the list of topics
# Topic = the topic number. From the largest topic.
#         "-1" is the generic topic. Genericr keywords are aggegrated here.
# Count = Documents assigned to this topic
# Name = Top 4 words of the topic based on probability
# Representation = The list of words representing this topic
# Representative_Docs = Documents assigned to this topic
tm_summary = topic_model.get_topic_info()
tm_summary

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1086,0_renal_nephrectomy_partial nephrectomy_partial,"[renal, nephrectomy, partial nephrectomy, part...",[Meta-analysis of clinical outcomes of robot-a...
1,1,896,1_service_service robots_customers_customer,"[service, service robots, customers, customer,...",[Impact of the introduction of service robots ...
2,2,837,2_surgery_colorectal_laparoscopic_patients,"[surgery, colorectal, laparoscopic, patients, ...",[Robotic and laparoscopic colectomy: propensit...
3,3,777,3_students_educational_education_robotics,"[students, educational, education, robotics, s...",[Implementation of an extension project in edu...
4,4,723,4_older_care_older adults_adults,"[older, care, older adults, adults, social, de...",[Socially Assistive Robots in Aged Care: Expec...
...,...,...,...,...,...
352,352,36,352_haptic_virtual_interaction_device,"[haptic, virtual, interaction, device, force, ...",[Human Stabilization of Delay-Induced Instabil...
353,353,28,353_chewing_food_oral_chewing robot,"[chewing, food, oral, chewing robot, bolus, vi...",[A Chewing Study of Abuse-Deterrent Tablets Co...
354,354,17,354_pouring_liquid_robotic pouring_viscosity,"[pouring, liquid, robotic pouring, viscosity, ...",[PourIt!: Weakly-supervised Liquid Perception ...
355,355,16,355_rectus_rectus abdominis_breast_abdominis,"[rectus, rectus abdominis, breast, abdominis, ...",[Robotic Rectus Abdominis Harvest for Pelvic R...


In [21]:
# Save the topic model assets
tm_folder_path = f'{ROOT_FOLDER_PATH}/{project_folder}/{settings["metadata"]["analysis_id"]}'

if not os.path.exists(tm_folder_path):
  !mkdir $tm_folder_path

tm_summary.to_csv(f'{tm_folder_path}/topic_model_info.csv', index=False)

In [22]:
# Number of topics found
found_topics = max(tm_summary.Topic) + 1
found_topics

357

In [23]:
# Confirm all documents are assigned
sum(tm_summary.Count) == len(corpus)

True

In [24]:
# Get top 10 terms for a topic
topic_model.get_topic(0)

[('renal', 0.017836893116013112),
 ('nephrectomy', 0.016083365951251778),
 ('partial nephrectomy', 0.013922683036955706),
 ('partial', 0.012736514094043723),
 ('rapn', 0.00919308708723101),
 ('patients', 0.006473974763389658),
 ('tumor', 0.00645706264897906),
 ('robotassisted partial', 0.005963801311871922),
 ('robotassisted partial nephrectomy', 0.005847485249608978),
 ('kidney', 0.0055328875915084084)]

In [25]:
# Get the top 10 documents for a topic
topic_model.get_representative_docs(0)

['Meta-analysis of clinical outcomes of robot-assisted partial nephrectomy and classical open partial nephrectomy Background:Robotic-assisted partial nephrectomy (RAPN) has emerged as a promising alternative to classical partial nephrectomy (CPN).Aim:This study aimed to compare the outcomes of RAPN and CPN for treating localized renal tumors through a meta-analysis of available literature.Methods:Chinese databases, such as CNKI, Chinese Science and Technology Periodicals Database (VIP), and Wanfang Full-text Database, were searched using Chinese search terms, and all published articles on PubMed and Web of Science were searched using English search terms. Articles on Localized Renal Tumors were included. RevMan5.3 software was used for meta-analysis. The funnel plots were drawn using Stata software to assess publication bias.Outcomes:This study aimed to identify the differences between robotic-assisted partial nephrectomy and classic partial nephrectomy in patients with localized renal

In [26]:
# Others

# # Get the number of documents per topic (same as in the table above)
# topic_model.get_topic_freq(0)

# # Get the main keywords per topic
# topic_model.get_topics()

In [27]:
# Print the parameters used. (For reporting)
topic_model.get_params()

{'calculate_probabilities': True,
 'ctfidf_model': ClassTfidfTransformer(),
 'embedding_model': None,
 'hdbscan_model': KMeans(n_clusters=357),
 'language': 'english',
 'low_memory': False,
 'min_topic_size': 10,
 'n_gram_range': (1, 3),
 'nr_topics': None,
 'representation_model': None,
 'seed_topic_list': None,
 'top_n_words': 10,
 'umap_model': UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.0, n_components=5, n_jobs=1, random_state=100, tqdm_kwds={'bar_format': '{desc}: {percentage:3.0f}%| {bar} {n_fmt}/{total_fmt} [{elapsed}]', 'desc': 'Epochs completed', 'disable': True}),
 'vectorizer_model': CountVectorizer(ngram_range=(1, 3)),
 'verbose': True,
 'zeroshot_min_similarity': 0.7,
 'zeroshot_topic_list': None}

In [28]:
tm_params = dict(topic_model.get_params())
for key, value in tm_params.items():
    tm_params[key]=  str(value)
with open(f'{tm_folder_path}/topic_model_params.json', 'w') as f:
    json.dump(tm_params, f, ensure_ascii=False, indent=4)
    print('Done')

Done


In [29]:
# Get the topic score for each paper and its assigned topic
topic_distr, _ = topic_model.approximate_distribution(documents, batch_size=1000)
distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]

100%|██████████| 104/104 [06:42<00:00,  3.87s/it]


In [30]:
topic_distr

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [31]:
# Document information. Including the topic assignation
dataset_clustering_results = topic_model.get_document_info(documents, df = corpus, metadata={"Score": distributions})

# Standar format for report analysis
dataset_clustering_results = dataset_clustering_results.drop(columns=['text'])
dataset_clustering_results['X_E'] = dataset_clustering_results['Score']
dataset_clustering_results['X_C'] = dataset_clustering_results['Topic'] + 1
dataset_clustering_results['level0'] = dataset_clustering_results['Topic'] + 1
dataset_clustering_results['cl99'] = False
dataset_clustering_results['cl-99'] = False
dataset_clustering_results.head()

Unnamed: 0,UT,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document,Score,X_E,X_C,level0,cl99,cl-99
0,WOS:000904782300005,Hybrid attention-oriented experience replay fo...,135,135_multiagent_agents_reinforcement learning_r...,"[multiagent, agents, reinforcement learning, r...",[Transformer-Based Imitative Reinforcement Lea...,multiagent - agents - reinforcement learning -...,False,0.077777,0.077777,136,136,False,False
1,WOS:000911218400001,Accuracy improvement of a 3D passive laser tra...,121,121_calibration_error_accuracy_positioning,"[calibration, error, accuracy, positioning, co...",[An Off-Line Error Compensation Method for Abs...,calibration - error - accuracy - positioning -...,False,0.318425,0.318425,122,122,False,False
2,WOS:000903508600001,Development and Adaptive Assistance Control of...,52,52_walking_gait_ankle_exoskeleton,"[walking, gait, ankle, exoskeleton, hip, assis...",[Identification of Hip and Knee Joint Impedanc...,walking - gait - ankle - exoskeleton - hip - a...,True,0.428906,0.428906,53,53,False,False
3,WOS:000904370900003,Future regenerative medicine developments and ...,167,167_printing_bioprinting_3d_3d printing,"[printing, bioprinting, 3d, 3d printing, mater...",[Revolutionizing manufacturing: A review of 4D...,printing - bioprinting - 3d - 3d printing - ma...,False,0.0,0.0,168,168,False,False
4,WOS:000911684200001,Linking properties to microstructure in liquid...,68,68_origami_metamaterials_structures_mechanical,"[origami, metamaterials, structures, mechanica...",[Rigid-foldable cylindrical origami with tunab...,origami - metamaterials - structures - mechani...,False,0.0,0.0,69,69,False,False


In [32]:
# Save the dataframe
dataset_clustering_results.to_csv(f'{tm_folder_path}/dataset_minimal.csv', index=False)

In [33]:
# Save the topic model
topic_model.save(f'{tm_folder_path}/topic_model_object.pck')





---

