# Topic Modelling using BERTopic & cuBERTopic

Sample notebook to show cuBERTopic, a topic modelling technique that is built on top of the NVIDIA RAPIDS ecoysystem, utilizing libraries such as `cudf` and `cuml` to GPU-accelarate end-to-end workflow for extracting topic from a set of documents.

### Installing relevant packages
Here we need to install relevant dependencies for `BERTopic` as well as we copare performance between it and `cuBERTopic`

In [1]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from cuBERTopic import gpu_bertopic

docs = fetch_20newsgroups(subset='all')['data']

### Running `BERTopic`
`BERTopic` provides us the functionality of providing custom embeddings, so we create sentence embeddings using a `SentenceTransformer` model and pass it to `fit_transform` method inside `BERTopic` class, which fits the models on a collection of documents, generate topics, and return the docs with topics.

In [2]:
%%time
model_sbert = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model_sbert.encode(
    docs,
    show_progress_bar=True,
    batch_size=64,
    convert_to_numpy=True,
)
topic_model = BERTopic()
topics_cpu, probs_cpu = topic_model.fit_transform(docs, embeddings)

Batches:   0%|          | 0/295 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

`get_topic_info` returns information about each topic including its id, frequency, and name 

In [3]:
%%time
topic_model.get_topic_info()

CPU times: user 3.56 ms, sys: 644 µs, total: 4.2 ms
Wall time: 3.42 ms


Unnamed: 0,Topic,Count,Name
0,-1,5641,-1_file_program_available_system
1,0,852,0_baseball_game_team_year
2,1,460,1_drive_scsi_ide_drives
3,2,391,2_gun_guns_militia_firearms
4,3,246,3_address_lyme_internet_mailing
...,...,...,...
333,340,11,340_timing_ultralong_timer_snow
331,336,11,336_finnish_finland_players_team
343,342,10,342_fbi_atf_compound_press
344,343,10,343_morality_objective_maddi_conner


`get_topic` returns topics with top n words and their c-TF-IDF score

In [4]:
%%time
topic_model.get_topic(0)

CPU times: user 10 µs, sys: 0 ns, total: 10 µs
Wall time: 18.8 µs


[('baseball', 0.006927860169233681),
 ('game', 0.005870744662191573),
 ('team', 0.005762317236031435),
 ('year', 0.0055254672406377615),
 ('players', 0.005482641720390248),
 ('braves', 0.005254078970830193),
 ('hit', 0.0050523638242629),
 ('games', 0.004971296262296893),
 ('runs', 0.00477009673760912),
 ('pitching', 0.004506804418804296)]

### Running `cuBERTopic`
`cuBERTopic` provides with a similar API for passing in `docs` as a set of strings to model on. `SentenceTransformer` model is used by default in this case.

In [5]:
%%time
gpu_topic = gpu_bertopic()
topics_gpu, probs_gpu = gpu_topic.fit_transform(docs)

Label prop iterations: 25
Label prop iterations: 8
Label prop iterations: 5
Label prop iterations: 3
Label prop iterations: 3
Label prop iterations: 2
Iterations: 6
5198,170,368,18,316,1536
Label prop iterations: 3
Iterations: 1
3409,45,101,4,59,130
Label prop iterations: 2
Iterations: 1
3588,57,101,5,78,97
CPU times: user 3min 37s, sys: 15.5 s, total: 3min 52s
Wall time: 35.1 s


In [6]:
%%time
gpu_topic.get_topic_info()

CPU times: user 17.4 ms, sys: 36 µs, total: 17.5 ms
Wall time: 16 ms


Unnamed: 0,Topic,Count,Name
189,-1,6628,-1_file_available_email_program
58,0,842,0_baseball_game_team_year
168,1,351,1_gun_guns_militia_weapons
2,2,227,2_armenian_turkish_armenians_armenia
177,3,225,3_clayton_cramer_gay_homosexual
...,...,...,...
254,349,10,349_rle_tga_povray_convert
261,350,10,350_gif_gifs_fenway_bump
276,351,10,351_varma_seema_pcboard_seemamadvlsicolumbiaedu
325,352,10,352_parsli_thomaspifiuiono_quisling_thomas


In [7]:
%%time
gpu_topic.get_topic(0)

CPU times: user 12 µs, sys: 1e+03 ns, total: 13 µs
Wall time: 21.2 µs


[('baseball', array(0.00690068)),
 ('game', array(0.00595235)),
 ('team', array(0.00571799)),
 ('year', array(0.00546365)),
 ('players', array(0.00538848)),
 ('braves', array(0.00523236)),
 ('hit', array(0.00501995)),
 ('games', array(0.00499027)),
 ('runs', array(0.00474058)),
 ('pitching', array(0.00449068))]