# Topic Modeling with BERTopic 
Part of BA Thesis by Enis Settouf

BERTopic:
Topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

<br>

[Source:](https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing) Basic Instruction used for this notebook, by BERTopic creator Marten Grootendorst

Please activate a GPU and set the Runtime-shape to high RAM (Runtime - change runtime type)

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# 1. Preparing Runtime

## 1.1 Install libraries

Installing RapidsAI Libraries. cuML enables GPU usage for ML-Algorithms


RapidsAI: https://rapids.ai/

Github repository for installing: https://github.com/rapidsai/rapidsai-csp-utils.git

BERTopic Github issue on cuML: https://github.com/MaartenGr/BERTopic/issues/495

In [None]:
!pip install bertopic
!nvidia-smi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.11.0-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 3.8 MB/s 
[?25hCollecting umap-learn>=0.5.0
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[K     |████████████████████████████████| 88 kB 7.9 MB/s 
Collecting pyyaml<6.0
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 69.8 MB/s 
Collecting hdbscan>=0.8.28
  Downloading hdbscan-0.8.28.tar.gz (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 72.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 5.4 MB/s 
Collecting tra

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 300, done.[K
remote: Counting objects: 100% (129/129), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 300 (delta 74), reused 99 (delta 55), pack-reused 171[K
Receiving objects: 100% (300/300), 87.58 KiB | 14.60 MiB/s, done.
Resolving deltas: 100% (136/136), done.
Traceback (most recent call last):
  File "rapidsai-csp-utils/colab/env-check.py", line 1, in <module>
    import pynvml
ModuleNotFoundError: No module named 'pynvml'


In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Hit:1 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:4 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:5 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:6 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:7 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:8 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Get:9 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic InRelease [20.8 kB]
Get:10 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:11 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3,390 kB]
Ign:12 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/

In [None]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:24
🔁 Restarting kernel...


In [None]:
!pip install -q condacolab
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

[0m

AssertionError: ignored

Run a second time:

In [None]:
!pip install -q condacolab
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

[0m✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Found existing installation: cffi 1.15.1
Uninstalling cffi-1.15.1:
  Successfully uninstalled cffi-1.15.1
Found existing installation: cryptography 37.0.4
Uninstalling cryptography-37.0.4:
  Successfully uninstalled cryptography-37.0.4
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cffi==1.15.0
  Downloading cffi-1.15.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (427 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.1/427.1 kB 19.0 MB/s eta 0:00:00
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 21.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
failed with initial frozen solve. Retrying with flexible solve.
failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
done



In [None]:
%%capture
!pip uninstall -y cffi
!pip install cffi
!pip install nltk
!pip install pandas
import os
os._exit(00)

Restarting kernel, start manually:

Import BERTopic and cuML:

In [None]:
from bertopic import BERTopic
import cuml

## 1.2 Mount Google Drive

Mounting Google Drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2. Topic Modeling


## 2.1 Import Dataset

In [None]:
import pandas as pd
model_ending = "1p"

df_legal_sents = pd.read_csv('/content/drive/MyDrive/ba-thesis/data-pre-processing/data_final/data_sentences_train_' + model_ending + '_topicevaluation.csv', sep='_adelimiter528_', encoding='utf-8', engine = 'python')
df_legal_sents = df_legal_sents.rename(columns={df_legal_sents.columns[0]: "id", df_legal_sents.columns[1]: "text", df_legal_sents.columns[2]: "unknown"})
df_legal_sents = df_legal_sents.drop(columns=["unknown"])

# Only select  sentences over 50 chars length, smaller turned out to be irrelevant and diproportionally affect assinging topics 
df_legal_sents = df_legal_sents.loc[df_legal_sents['text'].str.len() > 50]

data = df_legal_sents["text"].tolist()
print(len(data))

160899


## 2.2 Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


### Either load model:

In [None]:
#topic_model = BERTopic.load('/content/drive/MyDrive/ba-thesis/topic-modeling/models/bertopic-xlm-distilroberta-default-test-rapids20220828-0006')

### Or create a new model:

Download stopwords:

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
import time

start_time = time.time()

Instantiate CountVectorize and set stopwords:

In [None]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=german_stop_words)

Create custom UMAP & HDBSCAN models by RapidsAI cuML library:

enables to use GPU

In [None]:
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

# Create instances of GPU-accelerated UMAP and HDBSCAN
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True)

### Using a custom embedding model:
<br> 
uncomment following two cells as well as the embedding_model parameter in the first line of the training cell

Define embedding model name:

In [None]:
# model_name = 'esettouf/esettouf/cross-en-de-roberta-sentence-transformer-openlegal'
# The default tokenizer was not customly trained and saved
# tokenizer_name = "T-Systems-onsite/cross-en-de-roberta-sentence-transformer"

Create huggingface pipeline:

In [None]:
# from transformers.pipelines import pipeline
# hf_model = pipeline("feature-extraction", model=model_name,
#                     tokenizer=tokenizer_name)

Downloading config.json:   0%|          | 0.00/783 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Some weights of the model checkpoint at esettouf/cross-en-de-roberta-sentence-transformer-openlegal-1p-20220902-2004 were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaModel were not initialized from the model checkpoint at esettouf/cross-en-de-roberta-sentence-transformer-openlegal-1p-20220902-2004 and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You

Downloading tokenizer_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Downloading sentencepiece.bpe.model:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [None]:
topic_model = BERTopic(#embedding_model=hf_model, # custom embedding model
                       vectorizer_model=vectorizer_model,
                       top_n_words=5,
                       umap_model=umap_model,
                       hdbscan_model=hdbscan_model,
                       language="german",
                       calculate_probabilities=True, # prevents a document-topic probability matrix from being created - runtime
                       verbose=True,
                       low_memory=True)

topics, probs = topic_model.fit_transform(data) # , embeddings) # -> if embedding are created manually beforehand

print("--- Process ended: %s minutes ---" % round((time.time() - start_time) / 60, 2))

Downloading:   0%|          | 0.00/968 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/645 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/14.8M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/5029 [00:00<?, ?it/s]

2022-09-03 10:12:11,777 - BERTopic - Transformed documents to Embeddings
2022-09-03 10:12:16,107 - BERTopic - Reduced dimensionality
2022-09-03 10:12:23,130 - BERTopic - Clustered reduced embeddings


--- Process ended: 2.8 minutes ---


## 2.3 Map Documents and Topics and save Topic Results

Get Topic Representations:

In [None]:
topic_representations = topic_model.get_topics()
nr_topics = str(len(topic_representations))
print("Amount of individual Topics: " + nr_topics)
print("Amount of Data Elements: " + str(len(data)))

Amount of individual Topics: 682
Amount of Data Elements: 160899


Map Topic Representations with Sentences and Documents:

In [None]:
topic_representations = topic_model.get_topics()
dict_list = []

for sent_id in range(len(topics)):
    topic_id = topics[sent_id]
    topic_words = topic_representations[topic_id]
    doc_id = df_legal_sents.iloc[sent_id]["id"]
    sentence = data[sent_id]

    temp_dictionary = {"doc_id": doc_id,
                       "sent_id": sent_id,
                       "sentence": sentence,
                       "topic_id": topic_id,
                       "topic_words": topic_words}
    dict_list.append(temp_dictionary)

df_topic_results = pd.DataFrame.from_dict(dict_list)

print(len(df_topic_results))
print(df_topic_results.head())

160899
   doc_id  sent_id                                           sentence  \
0       0        0  2\. Die Klägerin trägt die Kosten des Berufung...   
1       0        1  3\. Das vorgenannte Urteil des Landgerichts so...   
2       0        2  Gründe A. Von der Darstellung der tatsächliche...   
3       0        3  B. Die statthafte sowie form- und fristgerecht...   
4       0        4  I. Das Landgericht hat zu Recht einen weiterge...   

   topic_id                                        topic_words  
0        90  [(kosten berufungsverfahrens, 0.39795709073053...  
1        27  [(sicherheitsleistung, 0.09610722191519347), (...  
2        -1  [(sei, 0.001264101045992254), (fur, 0.00125342...  
3        -1  [(sei, 0.001264101045992254), (fur, 0.00125342...  
4        -1  [(sei, 0.001264101045992254), (fur, 0.00125342...  


Save Results in Drive:

In [None]:
from datetime import datetime
date_time = datetime.now().strftime("%Y%m%d-%H%M")

# df.to_csv not usable because of custom seperator which is necessary
results_file = open('/content/drive/MyDrive/ba-thesis/topic-modeling/results/topic_modeling_results_default' + date_time + '.csv', 'w', encoding="utf-8")

# Write column names
results_file.write(str(df_topic_results.columns[0]) + '_adelimiter528_' + str(df_topic_results.columns[1]) + '_adelimiter528_' + str(df_topic_results.columns[2]) + '_adelimiter528_' + str(df_topic_results.columns[3]) + '_adelimiter528_' + str(df_topic_results.columns[4]) + '\n')

# Write rows
for index, row in df_topic_results.iterrows():
    results_file.write(str(row['doc_id']) + '_adelimiter528_' + str(row['sent_id']) + '_adelimiter528_' + str(row['sentence']) + '_adelimiter528_' + str(row['topic_id']) + '_adelimiter528_' + str(row['topic_words']) + '\n')

results_file.close()

Save trained Model in Drive:

In [None]:
# # Save model
# # has been 600mb
# from datetime import datetime
# date_time = datetime.now().strftime("%Y%m%d-%H%M")
# print("date and time:",date_time)	
# topic_model.save('/content/drive/MyDrive/ba-thesis/topic-modeling/models/bertopic_model_default_' + model_ending + '_' + date_time)

date and time: 20220903-1013


Sending an email Notification:

In [None]:
# import smtplib

# server = smtplib.SMTP('smtp.gmail.com', 587)
# server.starttls()
# server.login("email_adress", "password")

# msg = "The model is pushed to the hub."
# server.sendmail("email_adress", msg)
# server.quit()

(221,
 b'2.0.0 closing connection f22-20020a056638113600b00349bb70ab9fsm1442574jar.152 - gsmtp')

**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## 2.4 Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [None]:
freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name
0,-1,113834,-1_sei_fur_beklagten_abs
1,0,4054,0_straße_wohnung_bebauungsplan_baugb
2,1,2207,1_kind_ehefrau_kinder_kindergeld
3,2,2093,2_dr_patienten_arzt_behandlung
4,3,1885,3_00_000_euro_eur


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(3)  # Select the most frequent topic

[('00', 0.02021397597207886),
 ('000', 0.010543100123813114),
 ('euro', 0.009319071148955149),
 ('eur', 0.009019722667456475),
 ('dm', 0.008958808709999242)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

# 3. Visualization
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=15, width=10, height=10)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank()

## Search topics
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = topic_model.find_topics("betrug", top_n=5); similar_topics

[147, 144, 58, 186, 150]

In [None]:
topic_model.get_topic(11)

[('juris', 0.02204025938754442),
 ('ovg', 0.01803476735750185),
 ('88 115', 0.017057094854746428),
 ('beschluss vom', 0.014933293523024608),
 ('ovg nrw', 0.014352627182026166)]