# **Tutorial** - Topic Modeling with BERTopic

In this tutorial we will be exploring how to use BERTopic to create topics from a dataset (corpus). In short, topic is *a matter dealt with in a text, discourse, or conversation*. The most frequent use-cases and methods are discussed together with important parameters to keep a look out for. 


## BERTopic
BERTopic is a topic modeling technique that leverages Hugging Face 🤗 transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

## What is Topic Modelling and Why is it Useful

In machine learning (ML) and natural language processing (NLP), topic modeling is an un- or semisupervised statistical method for discovering abstract ‘topics’ that exist within a collection of documents.

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [64]:
%%capture
!pip install bertopic datasets

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data
For this example, we use [_A Real-Life Scenario Dialogue Summarization Dataset_](https://github.com/cylnlp/dialogsum)


In [1]:
from datasets import load_dataset

docs = load_dataset('knkarthick/dialogsum')['train']['summary']

f"Number of documents in the corpus {len(docs)}"



  0%|          | 0/3 [00:00<?, ?it/s]

'Number of documents in the corpus 12460'

In [2]:
for doc in docs[95:100]:
  print(doc, '\n')

#Person1# and #Person2# are planning the class dinner for the end of the year. They discuss the place and the cost, and decide to fix the amount first and ask a restaurant to provide a meal for that price. 

Martin tells #Person1# about his experience in Europe. Martin is back in Beijing to visit his family and will return to France to finish his degree. #Person1# and Martin decide to meet tonight. 

#Person1# surveys #Person2# about #Person2#'s reading habits. #Person2# loves adventure stories and about 2/3 of #Person2#'s books are bought from online bookstores. 

#Person1# and #Person2# want a place near the university and it's better to be quiet. They decide to go to the estate agent to see the houses. 

#Person2# buys a bouquet of violet for #Person2#'s wife's birthday according to #Person1#'s suggestion. 



# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model. 




## Training

We start by instantiating BERTopic. We set language to `english` since our documents are in the English language. If you would like to use a multi-lingual model, please use `language="multilingual"` instead. 

We will also calculate the topic probabilities. However, this can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model. 


In [3]:
from bertopic import BERTopic
# from sentence_transformers import SentenceTransformer

# sentence_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cuda")
# topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

Batches:   0%|          | 0/390 [00:00<?, ?it/s]

2022-09-11 11:45:24,716 - BERTopic - Transformed documents to Embeddings
2022-09-11 11:45:55,934 - BERTopic - Reduced dimensionality
2022-09-11 11:46:10,263 - BERTopic - Clustered reduced embeddings


## What Happened in the Previous Step
- Transformed documents to Embeddings
- Reduced dimensionality with [UMAP](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html), which is relatively fast even on larger datasets with high (3+) dimensionality and similar points tend to cluster together in reduced dimensionality ([Video](https://youtu.be/eN0wFzBA4Sc))<br>**TL;DR**: keeps a significant portion of the high-dimensional local structure in lower dimensionality
- Clustered reduced embeddings using [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) More on HDBSCAN in this [video](https://youtu.be/RDZUdRSDOok)

**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

If you wish to use specific transformers models, you can use `SentenceTransformers` and pass it to Bertopic as `embeddings_model=` as seen in the commented out code blocks in the cell above.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents. 

In [4]:
topic_model.get_topic_info().head(25)

Unnamed: 0,Topic,Count,Name
0,-1,4398,-1_and_to_the_in
1,0,509,0_job_company_interview_work
2,1,382,1_mary_tom_her_mike
3,2,322,2_orders_food_restaurant_cook
4,3,267,3_class_school_university_course
5,4,194,4_bus_train_station_subway
6,5,185,5_movie_movies_film_watch
7,6,167,6_medicine_doctor_prescription_fever
8,7,160,7_game_play_football_sports
9,8,154,8_shanghai_beijing_chinese_china


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [9]:
topic_model.get_topic(4)  # Select one of the most frequent topics

[('bus', 0.06694440432671274),
 ('train', 0.032595466660595755),
 ('station', 0.025286189036915745),
 ('subway', 0.023391893879401524),
 ('get', 0.013649252440763922),
 ('way', 0.012415083774419263),
 ('pass', 0.01100053323575575),
 ('taking', 0.010279405857525407),
 ('trains', 0.01005017856334884),
 ('how', 0.009814608307687418)]

**NOTE**: BERTopic is stocastich which means that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

# **Visualization**
There are several visualization options available in BERTopic, namely the visualization of topics, probabilities and topics over time. Topic modeling is, to a certain extent, quite subjective. Visualizations help understand the topics that were created. 

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [10]:
topic_model.visualize_topics()

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar 
to an input search_term. Here, we are going to be searching for topics that closely relate the 
search term. Then, we extract the most similar topic and check the results: 

In [27]:
similar_topics, similarity = topic_model.find_topics("finance", top_n=5); list(zip(similar_topics, similarity))

[(160, 0.7074276414689274),
 (78, 0.6940140119929326),
 (18, 0.6462404293945662),
 (23, 0.642921070879396),
 (94, 0.5848665479037725)]

In [28]:
topic_model.get_topic(160)

[('stocks', 0.09853218139937635),
 ('stock', 0.07233517765358904),
 ('stockholders', 0.03941287255975054),
 ('investment', 0.03440395193389928),
 ('market', 0.03303622323433247),
 ('shares', 0.0301365225050779),
 ('worth', 0.02968670576162577),
 ('research', 0.02917608943739513),
 ('lose', 0.027251085261458825),
 ('money', 0.02705545378357918)]

## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so, 
is that you can decide the number of topics after knowing how many are actually created. It is difficult to 
predict before training your model how many topics that are in your documents and how many will be extracted. 
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
new_topics, new_probs = topic_model.reduce_topics(docs, topics, probs, nr_topics=60)

2022-09-11 10:27:57,418 - BERTopic - Reduced number of topics from 190 to 61


## Visualize Topic Probabilities

The variable `probabilities` that is returned from `transform()` or `fit_transform()` can 
be used to understand how confident BERTopic is that certain topics can be found in a document. 

To visualize the distributions, we simply call:

In [33]:
topic_model.visualize_distribution(probs[50], min_probability=0.015)

In [43]:
docs[50]

"#Person1# stabbed the victim because he beat #Person1# first and tried to grab #Person1#'s bag. #Person1# says he didn't kill him on purpose."

## Visualize Topic Hierarchy

The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical structure of the topics, we can use scipy.cluster.hierarchy to create clusters and visualize how they relate to one another. This might help selecting an appropriate nr_topics when reducing the number of topics that you have created.

In [45]:
topic_model.visualize_hierarchy(top_n_topics=20)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [46]:
topic_model.visualize_barchart(top_n_topics=10)

## Visualize Topic Similarity
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity matrix by simply applying cosine similarities through those topic embeddings. The result will be a matrix indicating how similar certain topics are to each other.

In [47]:
topic_model.visualize_heatmap(n_clusters=20, width=1000, height=1000)

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created. 

This allows for fine-tuning the model to your specifications and wishes. 

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 


In [48]:
topic_model.update_topics(docs, topics, n_gram_range=(2, 2))

In [50]:
topic_model.get_topic(1)   # We select topic that we viewed before

[('each other', 0.01024579031357164),
 ('with her', 0.006879685063666506),
 ('mary to', 0.00603765966885388),
 ('she has', 0.00600379563780936),
 ('to go', 0.005911275408086334),
 ('introduce themselves', 0.00528670707161684),
 ('him to', 0.005225850414737693),
 ('other for', 0.005177138091727295),
 ('mary and', 0.005076836855425217),
 ('her to', 0.004814756461133527)]

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
topic_model.save("my_model")	

In [None]:
# Load model
my_model = BERTopic.load("my_model")	

# **Assign a new document to a topic**

Now that we saved out topic model, assuming we want to use it to handle documents we never seen before.

In [51]:
%%capture
new_docs = [
    "This workshop was about how computers can process natural language.",
    "The number of violent crimes and mugging is alarming in London."
]

assigned_topics = topic_model.transform(new_docs)[0]
assigned_topics

2022-09-11 12:19:47,282 - BERTopic - Reduced dimensionality
2022-09-11 12:19:47,309 - BERTopic - Calculated probabilities with HDBSCAN
2022-09-11 12:19:47,311 - BERTopic - Predicted clusters


In [62]:
topic_model.get_topic(assigned_topics[1])

[('police', 0.0313548724982208),
 ('robbed', 0.0283092804016299),
 ('robber', 0.025735709456027177),
 ('crime', 0.024033997412387806),
 ('arrested', 0.023578077316587162),
 ('robbery', 0.02181185569640419),
 ('criminals', 0.017654662521816068),
 ('robbers', 0.017654662521816068),
 ('kim', 0.015441425673616309),
 ('drugs', 0.014526127497503797)]