#Semi-Supervised Topic Modeling

In this notebook, we will be looking at a new feature of BERTopic, namely (semi)-supervised topic modeling! This allows us to steer the dimensionality reduction of the embeddings into a space that closely follows any labels you might already have. 

## Semi-supervised modeling
(semi)-supervised topic modeling is a class of methods that allows the user to perform topic modeling with previously defined labels. This might help nudge the model towards specific topics or classes for which you have labels.

# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing BERTopic

We start by installing BERTopic from PyPi:

In [1]:
%%capture
!pip install bertopic

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

# **Data**
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts that each is assigned to one of 20 topics:

In [1]:
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data["data"]
targets = data["target"]
target_names = data["target_names"]
classes = [data["target_names"][i] for i in data["target"]]

In [4]:
len(targets), targets[:10]

(18846, array([10,  3, 17,  3,  4, 12,  4, 10, 10, 19]))

In [8]:
classes[17]

'sci.electronics'

Each document can be put into one of the following categories:

In [None]:
target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

# **(semi)-Supervised modeling**


## Basic Model
Before we start with semi-supervised modeling, let us first take a look at the output of the basic model.

In [9]:
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-12-30 16:27:39,837 - BERTopic - Transformed documents to Embeddings
2022-12-30 16:28:18,493 - BERTopic - Reduced dimensionality
2022-12-30 16:28:21,523 - BERTopic - Clustered reduced embeddings


In [10]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name
0,-1,6266,-1_to_the_you_of
1,0,1837,0_game_team_games_he
2,1,637,1_key_clipper_chip_encryption
3,2,527,2_ites_cheek_yep_huh
4,3,460,3_israel_israeli_jews_arab
5,4,425,4_drive_scsi_drives_ide
6,5,424,5_monitor_card_video_drivers
7,6,258,6_god_atheists_atheism_atheist
8,7,248,7_you_context_your_jim
9,8,220,8_ram_sale_drive_price


In [11]:
topic_model.get_topic_info().tail(10)

Unnamed: 0,Topic,Count,Name
200,199,11,199_zoroastrians_worshipped_temple_religion
201,200,10,200_religion_supreme_arf_relation
202,201,10,201_sound_soundbase_stereo_mono
203,202,10,202_cult_cults_religion_distinguishes
204,203,10,203_homosexual_cramer_lesbians_men
205,204,10,204_68070_68040_motorola_instruction
206,205,10,205_jacket_leather_aerostich_piece
207,206,10,206_bits_bit_color_screen
208,207,10,207_probe_mars_spacecraft_mission
209,208,10,208_jpeg_image_gif_gamma


The topics that were created mostly make sense. There are some clearly defined topics such as "nasa, orbit, spacecraft, moon" but also some topics that seem mostly derived from other topics. We can visualize this by extracting the topic representations per class and see if our unsupervised model closely resembles this. 

**NOTE**: You can **hover** over the bars to see the representation per class!!

In [12]:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_unsupervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10)
fig_unsupervised

20it [00:07,  2.82it/s]


In [14]:
topics_per_class.head(10)

Unnamed: 0,Topic,Words,Frequency,Class,Name
0,-1,"my, this, motherboard, it, to",287,comp.sys.ibm.pc.hardware,-1_to_the_you_of
1,0,"quinea, papua, translation, buying, seriously",1,comp.sys.ibm.pc.hardware,0_game_team_games_he
2,2,"ites, cheek, yep, huh, ken",18,comp.sys.ibm.pc.hardware,2_ites_cheek_yep_huh
3,4,"drive, scsi, ide, drives, controller",234,comp.sys.ibm.pc.hardware,4_drive_scsi_drives_ide
4,5,"card, monitor, drivers, video, diamond",147,comp.sys.ibm.pc.hardware,5_monitor_card_video_drivers
5,8,"monitor, mb, card, pc, ram",47,comp.sys.ibm.pc.hardware,8_ram_sale_drive_price
6,9,"gcc, libc, bad, ram, kernel",1,comp.sys.ibm.pc.hardware,9_xterm_x11r5_server_error
7,11,"blasters, sound, pros, oz, digitized",1,comp.sys.ibm.pc.hardware,11_audio_amp_stereo_condition
8,16,"printer, postscript, printers, print, paper",12,comp.sys.ibm.pc.hardware,16_printer_print_hp_deskjet
9,18,"v121, reconsidered, cnet, uci, thank",3,comp.sys.ibm.pc.hardware,18_health_tobacco_disease_cesarean


In [15]:
topics_per_class.tail(10)

Unnamed: 0,Topic,Words,Frequency,Class,Name
925,135,"revolver, kratz, gun, safety, glock",18,talk.politics.guns,135_revolver_kratz_gun_safety
926,136,"parole, women, death, hope, life",1,talk.politics.guns,136_speech_motto_freedom_protesters
927,145,"kent, collegestudent, liberalbashing, eduer, g...",2,talk.politics.guns,145_professors_university_schools_partyi...
928,150,"kkk, huh, violent, disengenuous, plessy",1,talk.politics.guns,150_blacks_african_white_culture
929,159,"weapon, nuclear, plutonium, weapons, reactors",7,talk.politics.guns,159_plutonium_nuclear_weapon_clancy
930,167,"bill, s414, bills, brady, senate",14,talk.politics.guns,167_bill_s414_bills_brady
931,178,"media, toque, empowerment, housebreaker, bittle",2,talk.politics.guns,178_media_nw_washington_dc
932,179,"hb, sb, committee, firearms, defeated",11,talk.politics.guns,179_hb_sb_committee_firearms
933,200,"wince, giveth, taketh, law, establish",1,talk.politics.guns,200_religion_supreme_arf_relation
934,202,"ha, cults, burned, escpecially, cult",2,talk.politics.guns,202_cult_cults_religion_distinguishes


The results do seem promising. Topics like "nasa, space, etc" seem to be clearly related to sci.space, but some topics were created that span many categories. For example, we expect the topic "bike, bikes, etc"  to only appear in rec.motorcycles.  

## Semi-supervised
In the example above you might notice that some topics were somewhat smushed together. What we would like to see is a clear separation between those topics. Fortunately, we have some labels and can use them to improve the model. 

Since we are not interested in any other topics, this method is called semi-supervised topic modeling. In practice, this means that we have the labels of some documents but not all. 

For this example let's say we only have the labels of all computer-related categories:

In [17]:
 labels_to_add = ['comp.graphics', 'comp.os.ms-windows.misc',
                  'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware',
                  'comp.windows.x',]
 indices = [target_names.index(label) for label in labels_to_add]
 new_labels = [label if label in indices else -1 for label in targets]

In [18]:
indices

[1, 2, 3, 4, 5]

In [20]:
new_labels[:10]

[-1, 3, -1, 3, 4, -1, 4, -1, -1, -1]

When generating our new labels it is important to mark unknown classes as **-1**. Next, we use those newly constructed labels to again run BERTopic:

In [21]:
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(docs, y=new_labels)

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-12-30 16:41:10,977 - BERTopic - Transformed documents to Embeddings
2022-12-30 16:41:36,647 - BERTopic - Reduced dimensionality
2022-12-30 16:41:39,343 - BERTopic - Clustered reduced embeddings


In [22]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name
0,-1,6403,-1_of_to_the_and
1,0,1827,0_game_team_games_he
2,1,876,1_window_server_motif_widget
3,2,571,2_key_clipper_chip_encryption
4,3,528,3_whatta_ites_cheek_hi
5,4,453,4_israel_israeli_jews_arab
6,5,212,5_post_jim_you_context
7,6,212,6_bike_riding_ride_my
8,7,192,7_car_cars_ford_mustang
9,8,165,8_space_launch_nasa_shuttle


In [23]:
topic_model.get_topic_info().tail(10)

Unnamed: 0,Topic,Count,Name
199,198,11,198_bullet_wounds_brass_ammunition
200,199,11,199_jacket_pants_piece_pocket
201,200,11,200_cview_temp_directory_files
202,201,11,201_homosexual_cramer_men_lesbians
203,202,11,202_abortion_abortions_women_choice
204,203,11,203_rights_right_foa_association
205,204,10,204_church_cell_churches_choosing
206,205,10,205_s414_bill_brady_senate
207,206,10,206_manhattan_bobbeviceicotekcom_beauchaine_sank
208,207,10,207_ic_identifying_connector_e2prom


Finally, we can again extract the topics per class to see if our semi-supervised approach had some effect:

In [30]:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_semi_supervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10, width=900)
fig_semi_supervised

20it [00:05,  3.82it/s]


We can clearly see that many more topics about computers were created and that the seperation between those topics are solid. This indicates that even if you do not have all the labels, you can definitely improve the model!

However, there are still some clusters that could be improved with the labels that we have. 

In [26]:
topics_per_class.head(10)

Unnamed: 0,Topic,Words,Frequency,Class,Name
0,-1,"scsi, drive, card, bus, have",376,comp.sys.ibm.pc.hardware,-1_of_to_the_and
1,0,"quinea, papua, performance, ranges, absurd",2,comp.sys.ibm.pc.hardware,0_game_team_games_he
2,1,"gtak110zip, ralf, aspi, unx, box",1,comp.sys.ibm.pc.hardware,1_window_server_motif_widget
3,3,"whatta, ites, cheek, hi, yep",18,comp.sys.ibm.pc.hardware,3_whatta_ites_cheek_hi
4,6,"annoying, dir, sitting, notice, talking",2,comp.sys.ibm.pc.hardware,6_bike_riding_ride_my
5,9,"blasters, sound, pros, oz, digitized",1,comp.sys.ibm.pc.hardware,9_amp_condition_speakers_audio
6,12,"v121, frankcsyorkuca, pikelner, reconsidered, ...",4,comp.sys.ibm.pc.hardware,12_health_tobacco_disease_cesarean
7,15,"printer, printers, cpi, hp, printing",6,comp.sys.ibm.pc.hardware,15_printer_hp_deskjet_ink
8,18,"modem, modems, uart, dce, courier",52,comp.sys.ibm.pc.hardware,18_modem_modems_fax_uart
9,20,"drive, drives, disk, bios, controller",102,comp.sys.ibm.pc.hardware,20_drive_drives_disk_bios


In [28]:
topics_per_class.tail(20)

Unnamed: 0,Topic,Words,Frequency,Class,Name
849,89,"feudal, enserf, enserfing, mistreat, serfs",1,talk.politics.guns,89_government_libertarians_libertarian_r...
850,94,"water, science, smoking, scientists, barrier",1,talk.politics.guns,94_water_dept_phd_environmental
851,98,"letter, president, bentsen, mr, myers",5,talk.politics.guns,98_myers_stephanopoulos_president_ms
852,99,"tpg, committee, somebody, died, faq",1,talk.politics.guns,99_graphics_ray_3d_widget
853,113,"bureaucracy, wallet, insurance, filing, paperwork",1,talk.politics.guns,113_insurance_health_private_care
854,118,"document, room, libernet, 3456, facists",6,talk.politics.guns,118_junk_advertising_mail_house
855,121,"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax, meletbo0bcbkd...",1,talk.politics.guns,121_maxaxaxaxaxaxaxaxaxaxaxaxaxaxax_mg9v...
856,136,"revolver, gun, kratz, safety, glock",20,talk.politics.guns,136_revolver_gun_kratz_safety
857,137,"hes, goddam, possessions, hero, dumb",1,talk.politics.guns,137_christian_christianity_oo_philosophe...
858,138,"tpg, restricting, schabel, derisively, rights",1,talk.politics.guns,138_freedom_motto_speech_morally


## Supervised

Finally, we are going to be using all labels. These labels help BERTopic understand where most clusters can be found. However, this does not mean that it will only find the 20 clusters that we have defined. If there are sub-clusters to be found, then there is a good chance BERTopic will find them! 

In [31]:
topic_model = BERTopic(verbose=True)
topics, _ = topic_model.fit_transform(docs, y=targets)

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2022-12-30 16:55:26,027 - BERTopic - Transformed documents to Embeddings
2022-12-30 16:55:48,801 - BERTopic - Reduced dimensionality
2022-12-30 16:55:51,304 - BERTopic - Clustered reduced embeddings


In [32]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name
0,-1,4782,-1_the_to_is_of
1,0,921,0_space_launch_nasa_orbit
2,1,919,1_game_he_year_baseball
3,2,898,2_car_cars_engine_ford
4,3,857,3_gun_guns_firearms_fbi
5,4,811,4_image_jpeg_images_graphics
6,5,637,5_key_clipper_chip_encryption
7,6,527,6_sheeesh_okay_ites_cheek
8,7,378,7_israel_israeli_arab_arabs
9,8,298,8_drive_scsi_ide_drives


In [33]:
topic_model.get_topic_info().tail(10)

Unnamed: 0,Topic,Count,Name
222,221,11,221_keycode_key_xmodmap_emacs
223,222,11,222_boards_solder_mask_green
224,223,11,223_ear_wax_hearing_ears
225,224,11,224_needles_acupuncture_needle_syringe
226,225,11,225_tv_tape_vcr_flyback
227,226,11,226_cview_temp_directory_files
228,227,11,227_slip_packet_0x60_goto
229,228,10,228_cullen_goals_biggest_sanderson
230,229,10,229_noise_pink_octave_hz
231,230,10,230_ku_kong_powerbook_hong


Not only do we see a nice seperation of the topics, there are significantly less outliers which shows that BERTopic has improved in connecting the documents to topics. 

Let's see the results by again visualizing the topic representation per class:

In [34]:
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
fig_supervised = topic_model.visualize_topics_per_class(topics_per_class, top_n_topics=10, width=900)
fig_supervised

20it [00:05,  3.96it/s]


Now that we have used all labels, BERTopic seems to closely match our pre-defined labels. Moreover, it still allows to discover topics that were not previously defined. Thus, you can use this method to find unknown topics in pre-defined topics!