# How to use BERTopic

## Pre-Requisites

In [2]:
from bertopic import BERTopic

  from .autonotebook import tqdm as notebook_tqdm


## Using an Existing Model - BERTopic_Wikipedia

In [3]:
topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")

Once loaded, you can use BERTopic’s features to predict the topics for new instances:

In [5]:
topic, prob = topic_model.transform("This is an incredible movie!")

Batches: 100%|██████████| 1/1 [00:00<00:00, 15.94it/s]
2024-08-01 17:23:06,140 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


Which gives us the following topic:

In [7]:
topic_index = topic[0]
label = topic_model.topic_labels_[topic_index]
print(label)

64_rating_rated_cinematography_film


## Quick Start - SKLearn fetch_20newsgroups Dataset

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

In [8]:
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

In [9]:
topic_model = BERTopic()

In [10]:
topics, probs = topic_model.fit_transform(docs)

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got f

After generating topics and their probabilities, we can access the frequent topics that were generated:

In [11]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6554,-1_to_the_and_of,"[to, the, and, of, is, for, it, you, in, that]",[\n\nI assume you are posting to encourage com...
1,0,1828,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[Scoring stats for the Swedish NHL players, Ap..."
2,1,603,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,524,2_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, why, each, of, , ]","[\nHuh?, \nYep.\n, \n \n ..."
4,3,469,3_fbi_batf_koresh_fire,"[fbi, batf, koresh, fire, compound, they, gas,...","[From article <C5sEGz.Mwr@dscomsa.desy.de>, by..."
...,...,...,...,...,...
206,205,10,205_memory_shared_server_pixmaps,"[memory, shared, server, pixmaps, xputimage, e...",[\n There's documentation on how to use the s...
207,206,10,206_c610_610_centris_iivx,"[c610, 610, centris, iivx, problems, dealer, a...",[\n\n\n\n\nSounds to me like your dealer reall...
208,207,10,207_xterm_f1_xterms_bold,"[xterm, f1, xterms, bold, 0x0, 0xffbe, directi...","[\n\n\n\nI have tried that with one font, if y..."
209,208,10,208_error_chimes_simm_memory,"[error, chimes, simm, memory, ami, parity, sim...",[In the last two weeks I have the following pr...


-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

In [12]:
topic_model.get_topic(0)

[('game', 0.010557916154626177),
 ('team', 0.009186449229913312),
 ('games', 0.007314238534349736),
 ('he', 0.007248888118315507),
 ('players', 0.006405204912641511),
 ('season', 0.006322366930246725),
 ('hockey', 0.006207963248642722),
 ('play', 0.005879832801609422),
 ('25', 0.005752403395946006),
 ('year', 0.005714676939247958)]

Using .get_document_info, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

In [13]:
topic_model.get_document_info(docs)

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,\n\nI am sure some bashers of Pens fans are pr...,0,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[Scoring stats for the Swedish NHL players, Ap...",game - team - games - he - players - season - ...,1.000000,False
1,My brother is in the market for a high-perform...,4,4_card_monitor_video_drivers,"[card, monitor, video, drivers, vga, monitors,...",[Hello all.\n\tI am thinking about buying an e...,card - monitor - video - drivers - vga - monit...,0.929315,False
2,\n\n\n\n\tFinally you said what you dream abou...,-1,-1_to_the_and_of,"[to, the, and, of, is, for, it, you, in, that]",[\n\nI assume you are posting to encourage com...,to - the - and - of - is - for - it - you - in...,0.000000,False
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,33,33_scsi_scsi2_scsi1_ide,"[scsi, scsi2, scsi1, ide, controller, bus, tra...",[You are making the same mistake I did: you ar...,scsi - scsi2 - scsi1 - ide - controller - bus ...,0.569794,False
4,1) I have an old Jasmine drive which I cann...,73,73_tape_backup_tapes_drive,"[tape, backup, tapes, drive, device, wangdat, ...",[hello all- i have a problem with my micro sol...,tape - backup - tapes - drive - device - wangd...,0.596123,False
...,...,...,...,...,...,...,...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...,38,38_cancer_medical_doctor_medicine,"[cancer, medical, doctor, medicine, patient, a...",[This will be the first of monthly postings of...,cancer - medical - doctor - medicine - patient...,0.757451,False
18842,\nNot in isolated ground recepticles (usually ...,172,172_ground_grounding_conductor_neutral,"[ground, grounding, conductor, neutral, wire, ...","[\nNot according to the NEC nor the CEC, as ex...",ground - grounding - conductor - neutral - wir...,0.536640,False
18843,I just installed a DX2-66 CPU in a clone mothe...,81,81_fan_cpu_heat_sink,"[fan, cpu, heat, sink, fans, cooling, chip, ho...","[N(P>Just got a 66MHz 486DX2 system, and am co...",fan - cpu - heat - sink - fans - cooling - chi...,1.000000,False
18844,\nWouldn't this require a hyper-sphere. In 3-...,21,21_den_polygon_points_algorithm,"[den, polygon, points, algorithm, xxxx, sphere...","[\nSorry!! :-)\n\nCall the four points A, B, C...",den - polygon - points - algorithm - xxxx - sp...,1.000000,False
