<a href="https://colab.research.google.com/github/binliu0630/transformers/blob/master/BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Tutorial** - Advanced Customization in BERTopic
(last updated 26-04-2021)

In this tutorial, we will go through some advanced customization options in BERTopic. This includes hyperparameters, optimization, custom sub-models, and more! 

<br>

<img src="https://raw.githubusercontent.com/MaartenGr/BERTopic/master/images/logo.png" width="40%">

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='train',  remove=('headers', 'footers', 'quotes'))['data']

In [None]:
print(docs[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


# **Hyperparameters**
In this section, we will go through most important hyperparameters in BERTopic: 
* language
* top_n_words
* n_gram_range
* min_topic_size
* nr_topics
* low_memory
* calculate_probabilities

## language
The `language` parameter is used to simplify the selection of models for those who are not familiar with sentence-transformers models. 

In essence, there are two options to choose from:
* `language = "english"` or
* `language = "multilingual"`

The English model is "distilbert-base-nli-stsb-mean-tokens" and can be found [here](https://www.sbert.net/docs/pretrained_models.html). It is the default model that is used in BERTopic and works great for English documents. 

The multilingual model is "xlm-r-bert-base-nli-stsb-mean-tokens" and supports over 50+ languages which can be found [here](https://www.sbert.net/docs/pretrained_models.html). The model is very similar to the base model but is trained on many languages and has a slightly different architecture. 

In [None]:
topic_model = BERTopic(language="english").fit(docs)
topic_model.get_topic_info().head(5)

HBox(children=(FloatProgress(value=0.0, max=244715968.0), HTML(value='')))




Unnamed: 0,Topic,Count,Name
0,-1,5566,-1_can_do_me_any
1,19,456,19_pitching_baseball_hit_cubs
2,35,374,35_space_nasa_lunar_orbit
3,0,302,0_etherfind_etherfindcompress_etheric_etherlan
4,12,278,12_key_encryption_keys_nsa


## top_n_words
`top_n_words` refers to the number of words per topic that you want extracted. In practice, I would advise you to keep this value below 30 and preferably between 10 and 20. The reasoning for this is that the more words you put in a topic the less coherent it can become. The top words are the most representative for the topic and should be focused on. 

In [None]:
topic_model = BERTopic(top_n_words=5).fit(docs)

In [None]:
topic_model.get_topic(topic_model.get_topic_freq().iloc[1].Topic)

[('nasa', 0.010241322270525832),
 ('orbit', 0.007411943710752282),
 ('spacecraft', 0.005906838931156356),
 ('mars', 0.0050366750348830635),
 ('planetary', 0.00430543754669191)]

## n_gram_range
The `n_gram_range` parameter refers to the CountVectorizer used when creating the topic representation. It relates to the number of words you want in your topic representation. For example, "New" and "York" are two seperate words but are often used as "New York" which represents an n-gram of 2. Thus, the `n_gram_range` should be set to (1, 2) if you want "New York" in your topic representation. 

In [None]:
topic_model = BERTopic(n_gram_range=(2, 2)).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4831,-1_if you_to be_for the_is the
1,3,452,3_the moon_the space_space station_of space
2,22,424,22_the braves_cubs suck_suck cubs_the cubs
3,5,418,5_gordon banks_gebcadredslpittedu it_and gebca...
4,24,380,24_the flyers_the nhl_the puck_the leafs


## min_topic_size
`min_topic_size` is an important parameter! It is used to specify what the minimum size of a topic can be. The lower this value the more topics are created. If you set this value too high, then it is possible that simply no topics will be created! 

It is advised to play around with this value depending on the size of the your dataset. If it nears a million documents, then it advised to set it much higher than the default of 10, for example 100. 

In [None]:
topic_model = BERTopic(min_topic_size=20).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3447,-1_not_be_are_have
1,20,2253,20_windows_if_can_any
2,30,459,30_god_jesus_we_church
3,7,437,7_game_team_baseball_last
4,23,427,23_car_bike_cars_engine


## nr_topics
`nr_topics` can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced to. For example, if your topic model results in 100 topics but you have set `nr_topics` to 20 then the topic model will try to reduce the number of topics from 100 to 20. 

This reduction can take awhile as each reduction in topics activates a c-TF-IDF calculation. If this is set to None, no reduction is applied. Use "auto" to automatically reduce topics that using HDBSCAN.

In [None]:
topic_model = BERTopic(nr_topics=10).fit(docs)

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,8220,-1_is_have_be_are
1,64,614,64_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_god_jesus_be
2,22,441,22_team_play_game_period
3,6,424,6_key_chip_encryption_will
4,18,407,18_year_his_have_game
5,38,326,38_space_on_launch_nasa
6,0,302,0_etherroutetcp_ethercoax8w_etherfind_etherfin...
7,15,157,15_israel_israeli_jews_not
8,82,153,82_sale_offer_forged_sell
9,2,135,2_armenian_armenians_turkish_people


Note that I have set the number of topics quite low for educational purposes. In practice, do not set this value to low as it forces topics to merge that should not be merged. 

In [None]:
topic_model = BERTopic(nr_topics="auto").fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5158,-1_can_any_if_do
1,20,415,20_pitching_cubs_baseball_runs
2,7,413,7_space_nasa_lunar_satellite
3,22,408,22_game_hockey_nhl_leafs
4,57,313,57_jesus_god_church_bible


## low_memory + calculate_probabilities
`low_memory` sets UMAP's `low_memory` to True to make sure that less memory is used in computation. This slows down computation but allows UMAP to be ran on low memory machines. 

`calculate_probabilities` lets you calculate the probabilities of each topic to each document. This is computationally quite expensive and is turned off by default. 

Thus, to run BERTopic on machines that are a bit less powerful, use the code below:

In [None]:
topic_model = BERTopic(low_memory=True, calculate_probabilities=False).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5282,-1_my_can_but_was
1,4,500,4_hockey_nhl_10_gm
2,25,489,25_bike_cars_engine_bikes
3,17,457,17_baseball_pitching_hit_last
4,18,367,18_food_doctor_patients_disease


# **Custom sub-models**
There are three models underpinning BERTopic that are most important in creating the topics, namely UMAP, HDBSCAN, and CountVectorizer. The parameters of these models have been carefully selected to give the best results. However, there is no one-size-fits-all solution using these default parameters.

Therefore, BERTopic allows you to pass in any custom UMAP, HDBSCAN, and/or CountVectorizer with the parameters that best suit your use-case. For example, you might want to change the minimum document frequency in CountVectorizer or use a different distance metric in HDBSCAN or UMAP.

## UMAP
UMAP is an amazing technique for dimensionality reduction. In BERTopic, it is used to reduce the dimensionality of document embedding into something that is easier to use with HDBSCAN in order to create good clusters. 

However, it does has a significant number of parameters you could take into account. As exposing all parameters in BERTopic would be difficult to manage, we can instantiate our UMAP model and pass it to BERTopic:

In [None]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine')
topic_model = BERTopic(umap_model=umap_model).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4990,-1_was_if_have_are
1,64,752,64_windows_window_thanks_program
2,63,593,63_mac_apple_bus_memory
3,23,417,23_hockey_game_nhl_players
4,22,407,22_game_pitching_baseball_runs


## HDBSCAN
After reducing the embeddings with UMAP, we use HDBSCAN to cluster our documents into clusters of similar documents. Similar to UMAP, HDBSCAN has many parameters that could be tweaked in order to improve the cluster's quality. 

In [None]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model = BERTopic(hdbscan_model=hdbscan_model).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,4547,-1_can_any_if_out
1,40,821,40_god_jesus_bible_church
2,25,437,25_pitching_baseball_hit_last
3,8,422,8_hockey_game_nhl_flyers
4,3,389,3_space_nasa_lunar_satellite


## CountVectorizer
Lastly, in order to create our topic representation we use the CountVectorizer to extract all possible words. Perhaps you want to remove specific stop words or used a different tokenizer. Simply instantiate your CountVectorizer and pass it to BERTopic:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(2, 2), stop_words="english")
topic_model = BERTopic(vectorizer_model=vectorizer_model).fit(docs)

In [None]:
topic_model.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,5316,-1_thanks advance_hard disk_anonymous ftp_id like
1,3,427,3_space station_space shuttle_space center_spa...
2,18,425,18_cubs suck_suck cubs_red sox_home runs
3,21,391,21_stanley cup_eric lindros_ice time_bobby clarke
4,33,311,33_key escrow_private key_encryption technolog...
