<a href="https://colab.research.google.com/github/cristianmejia00/clustering/blob/main/Topic_Models_using_BERTopic_EMBEDS_20241031.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with BERTopic -- EMBEDDINGS

`Topic Models` are methods to automatically organize a corpus of text into topics.

Topic Model process:
1. Data preparation
2. Tranform text to numeric vectors `<-- THIS CODE!`
3. Multidimensionality reduction
4. Clustering
5. Topic analysis
6. Cluster assignation


This notebook uses the library `BERTopic` which is a one-stop solution for topic modeling including handy functions for plotting and analysis. However, BERTopic does not have a function to extract the X and Y coords from UMAP. If we need the coordinates then use the notebooks `Topic_Models_using_Transformers` instead. In any other situation, when a quick analysis is needed this notebook may be better.

This notebook is also the one needed for the heatmap codes included in this folder.

`BERTopic` is Python library that handles steps 2 to 6.
BERT topic models use the transformer architechture to generate the embeds (i.e. the vector or numeric representation of words) and are currently the state-of-the-art method for vectorization.

This notebook shows how to use it.

---
Reading:
[Topic Modeling with Deep Learning Using Python BERTopic](https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504)
[Advanced Topic Modeling with BERTopic](https://www.pinecone.io/learn/bertopic/)


# Requirements

## Packages installation and initialization

In [1]:
# This is the first code cell. Execute to give access to Google Drive.
import sys

# Test if we this is a Google Colab
if 'google.colab' in sys.modules:
    print("Running on Colab")
    # Install libraries
    !pip install bertopic[visualization]

    ## FOLDER
    # mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')

    # Correct path to the Google Drive mounted in Colab
    GDRIVE_PATH = "drive/MyDrive"

    ## ENVIRONMENT
    # pip installs go here.
elif sys.platform == 'win32':
    print("Running on Windows")
    # We are in Local.

    ## FOLDER
    # Correct path to the Google Drive folder in Local
    GDRIVE_PATH = "G:/My Drive"

elif sys.platform == 'linux':
    print("Running on Linux")
    # We are in WSL - VSCode
    GDRIVE_PATH = '/mnt/g/My Drive'

    # Load environment
    # virtualenv env-titech
    !source ./env-titech/bin/activate
    #!pip install -r requirements_env_titech.txt

elif sys.platform == 'darwin':
    print("Running on Mac OS X")
    # We are in Mac Os
    GDRIVE_PATH = "/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive"
    !source ./env-tm/bin/activate

else:
    print("Couldn't mount drive. Check your system and path")

Running on Mac OS X


In [None]:
#!pip install --upgrade pip
#!pip install --upgrade numpy==1.26

In [1]:
import pandas as pd
import os
import json
import pickle
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


# 🔴 Input files and options

Go to your Google Drive and create a folder in the root directory. We are going to save all related data in that directory.
Upload the dataset of news into the above folder.
- The dataset should be a `.csv` file.
- Every row in the dataset is a document
- It can any kind of columns. Some columns must contain the text we want to analyze. For example, a dataset of academic articles may contain a "Title" and/or "Abstract" column.

In [74]:
# The bibliometrics folder
ROOT_FOLDER_PATH = f"{GDRIVE_PATH}/Bibliometrics_Drive"

# Change to the name of the folder where the dataset is uploaded inside the above folder
project_folder_name = "Q325_ai_libsci"

# DATASET settings
settings_directive = "settings_dataset_directive_2025-01-30-23-40.json"

In [75]:
# Read settings
with open(f'{ROOT_FOLDER_PATH}/{project_folder_name}/{settings_directive}', 'r') as file:
    settings = json.load(file)

## Aux Functions

In [76]:
def find_e_keys(dictionary):
    # List comprehension to find keys starting with 'e'
    e_keys = [key for key in dictionary if str(key).lower().startswith('e')]
    return e_keys

In [77]:
# Function to save files
def save_as_csv(df, save_name_without_extension, with_index):
    "usage: `save_as_csv(dataframe, 'filename')`"
    df.to_csv(f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv", index=with_index)
    print("===\nSaved: ", f"{ROOT_FOLDER_PATH}/{save_name_without_extension}.csv")

In [78]:
# prompt: a function to save object to a pickle file
def save_object_as_pickle(obj, filename):
  """
  Saves an object as a pickle file.

  Args:
      obj: The object to be saved.
      filename: The filename of the pickle file.
  """
  with open(filename, "wb") as f:
    pickle.dump(obj, f)


In [79]:
# prompt: a function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)


In [80]:
# Input dataset
dataset_file_name = f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['network']['from_filtered_dataset']}/dataset_raw_cleaned.csv"

# Within the square brackets, put the list of text columns that will be used for the Topic Model. For example, in the case of academic articles, we are using the Title "TI" and the abstract "AB".
# Those columns will be merged and used as input to the topic model
e_label = find_e_keys(settings['embeds'])[0]

es = settings['embeds'][e_label]

In [81]:
dataset_file_name

'/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q325_ai_libsci/f01/dataset_raw_cleaned.csv'

In [82]:
# Open the data file
df = pd.read_csv(f"{dataset_file_name}", encoding='latin-1')
#df = pd.read_csv('/Users/cristian/Library/CloudStorage/GoogleDrive-cristianmejia00@gmail.com/My Drive/Bibliometrics_Drive/Q318_MKT_poverty/f01/dataset_raw_cleaned.txt', delimiter = '\t')
print(df.shape)
df.head()

(2795, 42)


Unnamed: 0,X_N,uuid,PT,AU,AF,TI,SO,LA,DT,DE,...,AR,DI,PG,WC,SC,OA,UT,Countries,IsoCountries,Institutions
0,1,06ffaeca-2840-45f5-97d7-3adb96371bf8,J,"de Leon, LCR; Flores, LV; Alomo, ARL","de Leon, Lady Catherine R.; Flores, Lejempf V....",Artificial Intelligence and Filipino Academic ...,JOURNAL OF THE AUSTRALIAN LIBRARY AND INFORMAT...,English,Article,Filipino librarians; academic librarians; arti...,...,,10.1080/24750158.2024.2305993,18,Information Science & Library Science,Information Science & Library Science,,WOS:001153914800001,philippines,PHL,univ santo tomas
1,2,151cecd6-3d80-4ffa-835f-b894ff2a97b1,J,"Xu, M; Liu, DA; Zhang, Y","Xu, Min; Liu, DongAo; Zhang, Yan",Design of Interactive Teaching System of Physi...,JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT,English,Article,Artificial intelligence; physical training; in...,...,2240021.0,10.1142/S0219649222400214,16,Information Science & Library Science,Information Science & Library Science,,WOS:000821701500002,peoples r china,CHN,shanghai univ finance & econ shanghai
2,3,70ea7e13-c197-466b-a5ad-80d350d2d431,J,"Filson, CK; Atuase, D","Filson, Christopher K.; Atuase, Diana",Artificial intelligence and academic integrity...,INFORMATION DEVELOPMENT,English,Article; Early Access,artificial intelligence; academic integrity; i...,...,,10.1177/02666669241284230,13,Information Science & Library Science,Information Science & Library Science,,WOS:001327261700001,ghana,GHA,univ cape coast
3,4,bca01a9b-ef10-49ad-bf30-c1d0ae79c348,J,"Walter, L; Denter, NM; Kebel, J","Walter, Lothar; Denter, Nils M.; Kebel, Jan",A review on digitalization trends in patent in...,WORLD PATENT INFORMATION,English,Review,Digitalization; Patent search and analysis; Cl...,...,102107.0,10.1016/j.wpi.2022.102107,11,Information Science & Library Science,Information Science & Library Science,,WOS:000788120900001,germany,DEU,univ bremen
4,5,8fc18d16-8209-4dc4-a16a-05a790ac203a,J,"Borges, AFS; Laurindo, FJB; SpÃ­nola, MM; GonÃ...","Borges, Aline F. S.; Laurindo, Fernando J. B.;...",The strategic use of artificial intelligence i...,INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT,English,Review,Artificial intelligence; Deep learning; Machin...,...,102225.0,10.1016/j.ijinfomgt.2020.102225,16,Information Science & Library Science,Information Science & Library Science,,WOS:000618806300006,brazil,BRA,univ sao paulo; univ paulista; fundacao educ i...


# Data Preparation

This step may include multiple sub-steps.
The following is a list of the cleaning process. Those with ✅ are implemented in this notebook.

- Ensure we use text data ✅
- Remove documents with no data ✅
- Convert text to lowercase ✅
- Remove documents that are too short or too long
- Unify or apply transformations to the vocabulary using a dictionary (e.g. convert "AI" to "Artificial Intelligence)
- Remove stopwords
  - English stopwords ✅
  - Custom stopwords (words we do not want to see in the results)
  - Field specific stopwords (frequent obvious word for a given dataset)
- Remove numbers
- Remove symbols and punctuation
- Stemming or lemmatization

In [83]:
# Ensure all data in this columns is text
text_columns = es['text_columns']
for i in range(0, len(text_columns)):
  df[text_columns[i]] = df[text_columns[i]].astype(str)

# Create a new column named `text` which is the concatenation of all the columns listed in `text_columns`
df['text'] = df[text_columns].apply(" ".join, axis=1)

In [84]:
import re

# Remove numbers and symbols from the text column
df['text'] = df['text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', str(x)))
df['text'] = df['text'].str.replace('nan ', '').str.strip()
print(df['text'][1])
print(df['text'][1][0:2])
len(df['text'][1])




Design of Interactive Teaching System of Physical Training Based on Artificial Intelligence Nowadays with the continuous change and innovation of teaching methods in Colleges and universities the curriculum system of students is also constantly enriched and developed Therefore peoples requirements for teaching management and teaching system are also improving Physical education curriculum is usually based on outdoor teaching and some schools have not established a complete teaching system Therefore the interactive teaching system of physical training based on artificial intelligence is designed First of all through the construction of the interactive teaching system of the total control circuit determine the corresponding circuit address decoding improve the audio control circuit associated video connection interactive drive three parts the intelligent sports training interactive system hardware design Then through the creation of intelligent training function module the design of trai

1423

In [85]:
# Convert to lowercase and remove English stopwords from `text` columns
if es['to_lowercase']:
  df.text = df.text.str.lower()
if es['remove_stopwords']:
  df.text = df.text.apply(lambda row: remove_stopwords(row), 1)

In [86]:
# Copy a backup of the object
df_full = df.copy()

In [87]:
# Retain only the data needed for the topic model
df = df[["text", es['id_column']]]
df = df.dropna()
df.head()

Unnamed: 0,text,UT
0,Artificial Intelligence and Filipino Academic ...,WOS:001153914800001
1,Design of Interactive Teaching System of Physi...,WOS:000821701500002
2,Artificial intelligence and academic integrity...,WOS:001327261700001
3,A review on digitalization trends in patent in...,WOS:000788120900001
4,The strategic use of artificial intelligence i...,WOS:000618806300006




---



## PART 1: Embeddings

`BERTopic()` is the main function.
- [Oficial documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
- [Quick overview](https://maartengr.github.io/BERTopic/index.html)
- [Explanation of parameters](https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.__init__)

In [88]:
sentence_model = SentenceTransformer(es['transformer_model'])

In [89]:
# Whe the tqdm progress bar dont show and throw error
#!jupyter nbextension enable --py widgetsnbextension

In [90]:
# Compute and save embeddings.
df_new = df.reset_index(drop=True).copy()
docs = df_new.text.to_list()
embeddings = sentence_model.encode(docs, show_progress_bar=True)

Batches: 100%|██████████| 88/88 [00:21<00:00,  4.14it/s]


In [91]:
# Create folder
embeds_folder_path = os.path.abspath(f"{ROOT_FOLDER_PATH}/{settings['metadata']['project_folder']}/{settings['embeds']['from_filtered_dataset']}/{e_label}")

if not os.path.exists(embeds_folder_path):
  os.makedirs(embeds_folder_path)

In [92]:
# Save files
df_new.to_csv(f'{embeds_folder_path}/corpus.csv', index=False)
save_object_as_pickle({'embeddings': embeddings,
                       'embeddings_ids': df_new.UT}, f'{embeds_folder_path}/embeddings.pck')


In [93]:
with open(f'{embeds_folder_path}/embeds_settings.json', 'w') as file:
    json.dump(settings['embeds'], file, indent=4)

In [3]:
# Function to load pickle object given a path
def load_pickle(path):
    with open(path, 'rb') as f:
        return pickle.load(f)
