## BERTopic
BERTopic is a topic modeling technique that leverages 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [8]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
import pandas as pd

In [None]:
#connect to G-Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
df = pd.read_csv("/content/gdrive/MyDrive/Colab Notebooks/Topic Modelling/process-steps.csv", sep=',',
    usecols=[0,1,2,3,4,5,6],
    encoding="utf8")
df.head()
docs = df["step_title"].values.tolist()

In [None]:
docs= docs[0:1500]

# **Topic Modeling**



## Training

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>20_000 documents). It is advised to turn this off if you want to speed up the model. 


### Stop Words for Count Vectorizer

In [None]:
# get Greek stop_words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
greek_stopwords = stopwords.words('greek')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
new_words = ['της', 'τη', 'του', 'από']

for word in new_words:
  greek_stopwords.append(word)

### Initialize models & Training

In [10]:
from bertopic import BERTopic 
from sentence_transformers import SentenceTransformer # Embeddings
from umap import UMAP #Dimensionality reduction
from hdbscan import HDBSCAN #clustering
from sklearn.feature_extraction.text import CountVectorizer # CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer

In [37]:
sentence_model = SentenceTransformer("all-mpnet-base-v2")
umap_model = UMAP(n_neighbors=10, n_components=10, min_dist=0.01, metric='cosine')
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=greek_stopwords) #ngram_range combination of words
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)

In [38]:
topic_model = BERTopic(embedding_model=sentence_model, umap_model=umap_model, hdbscan_model=hdbscan_model, vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model)
topics = topic_model.fit_transform(docs)

**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [39]:
freq = topic_model.get_topic_info(); freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,369,-1_πολίτη_έλεγχος_αίτησης πολίτη_πλοίου
1,0,328,0_έκδοση_ενημέρωση_απόφασης_υπαλλήλου
2,1,110,1_μητρώο_μητρώου_πολιτών_μητρώο πολιτών
3,2,58,2_πρωτοκόλληση_ανάθεση_πρωτοκόλληση ανάθεση_υπ...
4,3,51,3_φακέλου_χρέωση_έλεγχος φακέλου_αρμόδιο
5,4,47,4_έλεγχος_έλεγχος συστήματος_έλεγχος αίτησης_σ...
6,5,44,5_έκδοση_εκτύπωση_απόφασης_εκτύπωση πιστοποιητ...
7,6,39,6_δικαιολογητικών_πιστοποιητικού πολίτη δήμο_α...
8,7,33,7_ηλεκτρονική_υποβολή_αίτησης δικαιολογητικών_...
9,8,32,8_δικαιολογητικών_έλεγχος_πληρότητας_προϋποθέσεων


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [40]:
topic_model.get_topic(0)  # Select the most frequent topic

[('έκδοση', 0.012281235776005681),
 ('ενημέρωση', 0.010217347008749836),
 ('απόφασης', 0.009615903994686025),
 ('υπαλλήλου', 0.008556111110851135),
 ('διεύθυνση', 0.008176911085669963),
 ('εισήγηση', 0.00788852656945095),
 ('προϋποθέσεων', 0.007574781643136161),
 ('προς', 0.0075184289188277055),
 ('διαπίστωση', 0.007434648830209443),
 ('έκδοση απόφασης', 0.007425026796474913)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

In [41]:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Top_n_words,Probability,Representative_document
0,Φυσική ταυτοποίηση πολίτη στο ΚΕΠ,18,18_ταυτοποίηση_ταυτοποίηση πολίτη_φυσική ταυτο...,ταυτοποίηση - ταυτοποίηση πολίτη - φυσική ταυτ...,1.000000,False
1,Υποβολή αίτησης στο Πληροφοριακό σύστημα,2,2_πρωτοκόλληση_ανάθεση_πρωτοκόλληση ανάθεση_υπ...,πρωτοκόλληση - ανάθεση - πρωτοκόλληση ανάθεση ...,0.616246,False
2,Εκτύπωση αποδεικτικού φορολογικής ενημερότητας,13,13_πώλησης κτηνιατρικών_φαρμακευτικών_φαρμακευ...,πώλησης κτηνιατρικών - φαρμακευτικών - φαρμακε...,0.919787,False
3,Εκκίνηση διαδικασίας,19,19_διαδικασίας_εκκίνηση διαδικασίας_εκκίνηση_δ...,διαδικασίας - εκκίνηση διαδικασίας - εκκίνηση ...,1.000000,True
4,Παραλαβή αιτήματος για διακίνηση χοιροειδών,0,0_έκδοση_ενημέρωση_απόφασης_υπαλλήλου,έκδοση - ενημέρωση - απόφασης - υπαλλήλου - δι...,0.847124,False
...,...,...,...,...,...,...
1495,Ολοκλήρωση διαδικασίας,19,19_διαδικασίας_εκκίνηση διαδικασίας_εκκίνηση_δ...,διαδικασίας - εκκίνηση διαδικασίας - εκκίνηση ...,1.000000,False
1496,Διαβίβαση εγγράφων υπηρεσιακά εντός του Υπουργ...,0,0_έκδοση_ενημέρωση_απόφασης_υπαλλήλου,έκδοση - ενημέρωση - απόφασης - υπαλλήλου - δι...,1.000000,False
1497,Αξιολόγηση για την περίπτωση υποβολής ή μη προ...,-1,-1_πολίτη_έλεγχος_αίτησης πολίτη_πλοίου,πολίτη - έλεγχος - αίτησης πολίτη - πλοίου - π...,0.000000,False
1498,Υποβολή προσφυγής κατά της κράτησης του ελληνι...,3,3_φακέλου_χρέωση_έλεγχος φακέλου_αρμόδιο,φακέλου - χρέωση - έλεγχος φακέλου - αρμόδιο -...,0.562435,False


In [None]:
### Attributes

## Attributes

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [16]:
topic_model.visualize_topics()

## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [None]:
topic_model.visualize_distribution(probs[200], min_probability=0.001)

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=5)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [None]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

## Visualize Term Score Decline
Topics are represented by a number of words starting with the best representative word. Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline with each word that is added. At some point adding words to the topic representation only marginally increases the total c-TF-IDF score and would not be beneficial for its representation.

To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word. In other words, the position of the words (term rank), where the words with the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method, the select the best number of words in a topic.


In [None]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [None]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [None]:
topic_model.get_topic(0)   # We select topic that we viewed before

[('του', 0.025664358552657897),
 ('υπάλληλο', 0.021681286210734525),
 ('αρμόδιο', 0.02159625460109819),
 ('τμήματος', 0.019105794065593856),
 ('τμήμα', 0.018704252370337602),
 ('προϊστάμενο', 0.018704252370337602),
 ('χρέωση', 0.018354433558546504),
 ('την', 0.016629128086533856),
 ('στην', 0.016082461581266524),
 ('προϊστάμενο του', 0.016073079262104578)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
topic_model.reduce_topics(docs, nr_topics=20)

2023-05-23 09:34:59,667 - BERTopic - Reduced number of topics from 49 to 20


<bertopic._bertopic.BERTopic at 0x7fe4c2bc5b40>

In [None]:
# Access the newly updated topics with:
print(topic_model.topics_)

[2, 1, -1, 8, 6, 6, 0, 6, 6, 8, 0, 1, 0, 6, -1, 6, -1, 0, 5, 6, -1, 6, 8, 0, 0, 1, 0, 0, 0, -1, 0, 5, 5, 5, 1, 0, 17, 17, 17, 17, -1, -1, 17, 1, 11, 1, 0, 16, 16, 0, 3, -1, 0, 0, 0, 5, 0, 0, 8, 6, 6, 6, 6, 6, 2, 0, 18, 0, 0, 8, 1, 0, 0, 0, 1, 8, 0, 11, 4, 0, 0, -1, 0, 0, 4, -1, -1, 18, 1, 0, -1, 0, 0, -1, -1, 8, 0, 1, -1, 0, 5, 4, -1, 0, 5, -1, 0, 0, 5, 1, 0, 17, 17, 17, 17, 17, 0, -1, 17, 15, 1, 0, -1, 0, -1, 12, 0, 0, 0, 0, 2, 0, 2, 0, 1, 0, 0, 5, 3, 6, 6, -1, 3, 5, 3, 6, 6, 8, 0, 1, 0, 0, 12, 13, -1, 0, 0, 5, 8, 0, 1, -1, 1, 0, 4, -1, -1, -1, 0, 0, 0, 2, 0, 0, 0, 2, 4, 2, 0, 0, -1, 0, 0, 0, 0, 15, 8, 0, 1, 15, 15, 15, 1, -1, 0, -1, 6, 0, 8, -1, -1, 2, 0, -1, -1, -1, 4, -1, 0, -1, -1, 4, 4, 0, 0, 0, 5, 8, 0, 1, 0, 15, 0, 0, -1, 1, 13, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, -1, 0, 0, 0, 0, 5, 0, 0, 0, -1, -1, 0, 13, 1, 0, 0, 1, 5, 0, 0, 0, 11, 13, 8, 0, 1, 0, 6, -1, 6, -1, 0, 5, 6, -1, 6, 8, 0, 1, 0, 3, 3, 3, 6, -1, -1, -1, 4, 0, 5, -1, 4, 0, 0, 0, 0, 1, -1, -1, 0, -1, 0, 0, 1, -1, 8, 0, 1, 0,

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term "vehicle". Then, we extract the most similar topic and check the results: 

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[14, 10, 11, 4, 0]

In [None]:
topic_model.get_topic(0)

[('αίτησης', 0.034193058385162005),
 ('της', 0.028939271578152494),
 ('και', 0.027615165135388156),
 ('έλεγχος', 0.02500530017958265),
 ('του', 0.02423669796376737),
 ('βεβαίωσης', 0.021996993103581507),
 ('της αίτησης', 0.02164918737136251),
 ('από', 0.021125869087540137),
 ('παραλαβή', 0.0202417005623797),
 ('δικαιολογητικών', 0.019894030749705614)]