<a href="https://colab.research.google.com/github/czymaraclass/TopicModelling/blob/main/topic_models_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic models in Python

## Install BERTopic library

To install BERTopic in Python, use the following command:

In [None]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap-learn-0.5.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.8/90.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1 (from bertopic)
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━

## Data: AG news articles collection

To compare results, we will work with the same data as in the R exercise. Load and prepare the AG data set using the following code (doing the same as we did before in R):

In [None]:
import pandas as pd
import numpy as np

newspaper_data = pd.read_csv("https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv", header = None)

# draw 10 percent random sample
np.random.seed(1337)
newspaper_data = newspaper_data.sample(frac=0.1)

# prepare
newspaper_data['title'] = newspaper_data[1]
newspaper_data['description'] = newspaper_data[2]

newspaper_data['class'] = newspaper_data[0].replace({1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'})

newspaper_data['class'].value_counts()

Sci/Tech    3063
Business    3014
Sports      2966
World       2957
Name: class, dtype: int64

Show first 3 rows of the data:

In [None]:
print(newspaper_data.iloc[0:3, 3:6]) # Note that Python starts with 0 and excludes last element

                                                   title  \
86110     Oracle to drop PeopleSoft suit if tender fails   
74390  NTT DoCoMo, IBM, Intel team to secure mobile d...   
77491    Election Is Crunch Time for U.S. Secret Service   

                                             description     class  
86110  Oracle Corp. notified Delaware's Court of Chan...  Sci/Tech  
74390  With an eye towards making mobile devices and ...  Sci/Tech  
77491  With just days to go before the U.S. president...  Sci/Tech  


## Preprocessing the data

Skipped because usually not done for BERTopic.

## Run BERTopic model

Import BERTopic class from the *bertopic* library, create a BERTopic model (initializes an instance of the BERTopic class), and fit the model to the newspaper data.

In [None]:
from bertopic import BERTopic
from umap import UMAP

umap_model = UMAP(random_state=1337)

topic_model = BERTopic(umap_model=umap_model) # Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages.
topics, probs = topic_model.fit_transform(newspaper_data['description'])


After generating topics and their probabilities, we can access the frequent topics that were generated (Topic -1 is a garbage topic):

In [None]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3831,-1_and_of_to_the,"[and, of, to, the, in, for, on, that, said, with]",[Google Inc. has made its name on advanced-sea...
1,0,600,0_touchdown_yards_bowl_game,"[touchdown, yards, bowl, game, quarterback, se...",[AP - Joey Harrington threw two touchdown pass...
2,1,368,1_manchester_champions_arsenal_league,"[manchester, champions, arsenal, league, engla...",[It #39;s the biggest game of the English seas...
3,2,365,2_profit_percent_thirdquarter_quarterly,"[profit, percent, thirdquarter, quarterly, ear...",[Reuters - RadioShack Corp. on Tuesday\posted...
4,3,275,3_athens_olympic_gold_medal,"[athens, olympic, gold, medal, olympics, greec...",[ ATHENS (Reuters) - The United States broke s...
...,...,...,...,...,...
153,152,11,152_consumer_october_energy_barometer,"[consumer, october, energy, barometer, septemb...",[WASHINGTON (CBS.MW) -- Prices of US wholesale...
154,153,11,153_companies_options_rules_accounting,"[companies, options, rules, accounting, treat,...",[The US Financial Accounting Standards Board a...
155,154,11,154_saddam_hussein_interim_allawi,"[saddam, hussein, interim, allawi, baghdad, li...",[The trials of top figures in Saddam Hussein #...
156,155,10,155_bird_flu_cats_h5n1,"[bird, flu, cats, h5n1, confirmed, bull, hunt,...",[Malaysian officials on Saturday were testing ...


Get topic information for each article:

In [None]:
topic_model.get_document_info(newspaper_data['description'])

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,Oracle Corp. notified Delaware's Court of Chan...,13,13_oracle_peoplesoft_hostile_takeover,"[oracle, peoplesoft, hostile, takeover, bid, r...",[AP - Oracle Corp.'s hostile #36;7.7 billion ...,oracle - peoplesoft - hostile - takeover - bid...,0.975817,False
1,With an eye towards making mobile devices and ...,152,152_ntt_docomo_mobile_mmo2,"[ntt, docomo, mobile, mmo2, operator, imode, p...",[Japanese mobile network operator NTT DoCoMo a...,ntt - docomo - mobile - mmo2 - operator - imod...,0.489425,False
2,With just days to go before the U.S. president...,-1,-1_and_to_of_the,"[and, to, of, the, in, that, said, its, it, for]","[Reuters - Samsung Electronics Co. Ltd., the w...",and - to - of - the - in - that - said - its -...,0.000000,False
3,Henrik Larsson was left on the bench by Barcel...,1,1_manchester_champions_arsenal_league,"[manchester, champions, arsenal, league, madri...","[LONDON, England -- Arsenal coach Arsene Wenge...",manchester - champions - arsenal - league - ma...,1.000000,False
4,AP - The government will press on with a child...,-1,-1_and_to_of_the,"[and, to, of, the, in, that, said, its, it, for]","[Reuters - Samsung Electronics Co. Ltd., the w...",and - to - of - the - in - that - said - its -...,0.000000,False
...,...,...,...,...,...,...,...,...
11995,AP - Though Congress approved a #36;1.2 billi...,88,88_press_associated_ap_bridis,"[press, associated, ap, bridis, writing, ted, ...","[The Associated Press By Ted Bridis, Bernie Ec...",press - associated - ap - bridis - writing - t...,0.508129,False
11996,Profits at Premiership champions Arsenal soar ...,1,1_manchester_champions_arsenal_league,"[manchester, champions, arsenal, league, madri...","[LONDON, England -- Arsenal coach Arsene Wenge...",manchester - champions - arsenal - league - ma...,1.000000,False
11997,"NAJAF, Iraq : Iraq #39;s top Shiite Muslim cle...",51,51_najaf_cleric_shiite_shrine,"[najaf, cleric, shiite, shrine, radical, holy,...","[NAJAF, Iraq - Thousands of pilgrims streamed ...",najaf - cleric - shiite - shrine - radical - h...,1.000000,False
11998,Michael Eisner will leave the Walt Disney comp...,99,99_disney_walt_eisner_michael,"[disney, walt, eisner, michael, ovitz, co, exe...",[Walt Disney Co. chief executive Michael D. Ei...,disney - walt - eisner - michael - ovitz - co ...,1.000000,True


Vizualize the results:

In [None]:
# topic distance
topic_model.visualize_topics() # topics are closer together if they exhibit similar word distributions or co-occurring patterns, even if they do not share identical words

In [None]:
# word scores
topic_model.visualize_barchart()

In [None]:
# Plot for topics 40 to 51
topic_model.visualize_barchart(topics=list(range(40, 52)))

In [None]:
# Topic similarity
topic_model.visualize_heatmap()

In [None]:
# Topic hierarchy
topic_model.visualize_hierarchy()

 Next, let's take a look at the terror attack topic (49):

In [None]:
topic_model.get_topic(49)

[('killed', 0.03909939300059716),
 ('wounded', 0.03512190743165762),
 ('attack', 0.032953969320191775),
 ('bomb', 0.02923352141031145),
 ('northeastern', 0.028233054024597023),
 ('people', 0.0277208922573563),
 ('explosion', 0.027654527134909044),
 ('nagaland', 0.02633234460838303),
 ('india', 0.025912268008631445),
 ('fire', 0.0250065654676829)]

Show representative articles for topics 49 and 134:

In [None]:
topic_model.get_representative_docs(topic=49)

['BAGHDAD, Iraq - A suicide attacker detonated a car bomb by police on a Baghdad bridge, and U.S. troops foiled a second suicide vehicle bombing in attacks Friday that killed at least five people and wounding at least 21...',
 ' BAGHDAD (Reuters) - At least 13 Iraqis were killed in a  suicide car bomb attack on a major police checkpoint in central  Baghdad on Friday, an Interior Ministry spokesman said.',
 'BAGHDAD: A suicide car bomber killed at least 13 people in an attack on a police checkpoint in Baghdad last night, after US air strikes around rebel-held Falluja that killed scores.']

In [None]:
topic_model.get_representative_docs(topic=82)

['Reuters - A series of bomb blasts killed\\nine people and wounded 35 in northeastern India on Saturday in\\the deadliest attack since a cease-fire with the main\\separatist group in Nagaland began seven years ago.',
 ' GUWAHATI, India (Reuters) - A series of bomb blasts killed  nine people and wounded 35 in northeastern India on Saturday in  the deadliest attack since a cease-fire with the main  separatist group in Nagaland began seven years ago.',
 ' GUWAHATI, India (Reuters) - A series of bomb blasts killed  19 people and wounded more than 50 in northeastern India  Saturday in the deadliest attack since a cease-fire with the  main separatist group in Nagaland began seven years ago.']

The default label of each topic are the top 3 words in each topic combined with an underscore between them. Label the topics:

In [None]:
topic_model.set_topic_labels({49: "Iraq attacks", 82: "India attacks"})
topic_model.get_topic_info(49)

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,49,48,49_baghdad_killed_car_suicide,Iraq attacks,"[baghdad, killed, car, suicide, least, iraq, b...","[BAGHDAD, Iraq - A suicide attacker detonated ..."


In [None]:
topic_model.get_topic_info(82)

Unnamed: 0,Topic,Count,Name,CustomName,Representation,Representative_Docs
0,82,28,82_killed_wounded_attack_bomb,India attacks,"[killed, wounded, attack, bomb, northeastern, ...",[Reuters - A series of bomb blasts killed\nine...


## Topics per class

Similar to the stm idea that regresses topic probabilities on document-level variables, BERTopic can be estimated by genres ("class"):

In [None]:
genres = newspaper_data['class']

topics_per_class = topic_model.topics_per_class(newspaper_data['description'], classes = genres)

topic_model.visualize_topics_per_class(topics_per_class, top_n_topics = 31)

For more information on vizualization, check the [documentation](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html).

## Change clustering algorithm

In [None]:
from sklearn.cluster import KMeans

cluster_model = KMeans(n_clusters = 100) # use k-means clustering, finding 100 topics
topic_model_kmeans = BERTopic(hdbscan_model = cluster_model)
topicsKmeans, probsKmeans = topic_model_kmeans.fit_transform(newspaper_data['description'])

topic_model_kmeans.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,264,0_mobile_phone_wireless_communications,"[mobile, phone, wireless, communications, serv...","[Taiwan Cellular Corporation (TCC), the second..."
1,1,250,1_bank_billion_buy_million,"[bank, billion, buy, million, inc, group, corp...",[ NEW YORK (Reuters) - Nexen Inc. has agreed t...
2,2,243,2_manchester_england_united_arsenal,"[manchester, england, united, arsenal, manager...",[It #39;s the biggest game of the English seas...
3,3,235,3_olympic_athens_gold_medal,"[olympic, athens, gold, medal, olympics, champ...",[ ATHENS (Reuters) - The United States broke s...
4,4,231,4_bowl_no_victory_state,"[bowl, no, victory, state, points, college, se...","[PITTSBURGH (5-3) vs. NOTRE DAME (6-3) When, w..."
...,...,...,...,...,...
95,95,41,95_heavyweight_klitschko_vitali_ring,"[heavyweight, klitschko, vitali, ring, danny, ...","[Following his upset of Mike Tyson in July, Br..."
96,96,38,96_car_oprah_cars_show,"[car, oprah, cars, show, escalade, winfrey, ex...",[ PARIS (Reuters) - Carmakers presented new-ag...
97,97,36,97_stewart_martha_her_she,"[stewart, martha, her, she, judge, prison, dis...","[Martha Stewart, the lifestyle guru convicted ..."
98,98,34,98_cp_canadian_quebec_press,"[cp, canadian, quebec, press, singapore, canad...",[Canadian Press - OTTAWA (CP) - Canada's speci...


In [None]:
topic_model_kmeans.get_document_info(newspaper_data['description'])

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Representative_document
0,Oracle Corp. notified Delaware's Court of Chan...,53,53_oracle_peoplesoft_hostile_takeover,"[oracle, peoplesoft, hostile, takeover, bid, r...",[AP - Oracle Corp.'s hostile #36;7.7 billion ...,oracle - peoplesoft - hostile - takeover - bid...,False
1,With an eye towards making mobile devices and ...,0,0_mobile_phone_wireless_communications,"[mobile, phone, wireless, communications, serv...","[Taiwan Cellular Corporation (TCC), the second...",mobile - phone - wireless - communications - s...,False
2,With just days to go before the U.S. president...,58,58_najaf_iraq_cleric_shiite,"[najaf, iraq, cleric, shiite, iraqi, shrine, h...","[NAJAF, Iraq - Militiamen loyal to rebel Shiit...",najaf - iraq - cleric - shiite - iraqi - shrin...,False
3,Henrik Larsson was left on the bench by Barcel...,21,21_champions_league_chelsea_uefa,"[champions, league, chelsea, uefa, arsenal, wi...",[CHELSEA moved five points clear at the top of...,champions - league - chelsea - uefa - arsenal ...,False
4,AP - The government will press on with a child...,51,51_internet_music_industry_copyright,"[internet, music, industry, copyright, court, ...","[Sharman Networks, the company behind the Kaza...",internet - music - industry - copyright - cour...,False
...,...,...,...,...,...,...,...
11995,AP - Though Congress approved a #36;1.2 billi...,92,92_ap_intelligence_house_faa,"[ap, intelligence, house, faa, congress, press...",[AP - Senators on Monday gave in to several Ho...,ap - intelligence - house - faa - congress - p...,False
11996,Profits at Premiership champions Arsenal soar ...,21,21_champions_league_chelsea_uefa,"[champions, league, chelsea, uefa, arsenal, wi...",[CHELSEA moved five points clear at the top of...,champions - league - chelsea - uefa - arsenal ...,False
11997,"NAJAF, Iraq : Iraq #39;s top Shiite Muslim cle...",58,58_najaf_iraq_cleric_shiite,"[najaf, iraq, cleric, shiite, iraqi, shrine, h...","[NAJAF, Iraq - Militiamen loyal to rebel Shiit...",najaf - iraq - cleric - shiite - iraqi - shrin...,False
11998,Michael Eisner will leave the Walt Disney comp...,42,42_disney_enron_former_executive,"[disney, enron, former, executive, walt, chief...",[ LOS ANGELES (Reuters) - Walt Disney Chief Ex...,disney - enron - former - executive - walt - c...,True


## Guided Topic Models

Similar to the Keyword Assisted Topic Models we learned in R, Guided Topic Models in Python also allow to add information on certain topics before fitting the model.

In [None]:
# Define expected (seeded) topics
seed_topic_list = [["sale", "bank", "profit"],
                   ["moon", "nasa", "space"],
                   ["olympic", "gold", "medal"],
                   ["kill", "bomb", "attack"]]

topic_model = BERTopic(seed_topic_list=seed_topic_list)
topics, probs = topic_model.fit_transform(newspaper_data['description'])

# Access the frequent topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3642,-1_and_to_of_the,"[and, to, of, the, in, for, with, has, on, 39s]",[NEW YORK - FedEx Corp. Monday raised its earn...
1,0,560,0_sox_yankees_red_series,"[sox, yankees, red, series, baseball, league, ...",[A test of his sore right ankle finds him read...
2,1,408,1_space_nasa_moon_mars,"[space, nasa, moon, mars, spacecraft, scientis...",[Shuttle astronauts would do a better job of u...
3,2,351,2_olympic_athens_gold_medal,"[olympic, athens, gold, medal, olympics, greec...",[ATHENS - American gymnast Paul Hamm apparentl...
4,3,248,3_wireless_mobile_phone_communications,"[wireless, mobile, phone, communications, phon...","[Taiwan Cellular Corporation (TCC), the second..."
...,...,...,...,...,...
159,158,11,158_citigroup_menlo_ups_unit,"[citigroup, menlo, ups, unit, cnf, exchange, w...","[Citigroup Inc., the world #39;s largest bank,..."
160,159,10,159_breeders_cup_lukas_azeri,"[breeders, cup, lukas, azeri, trainer, distaff...",[In rationalizing the decision to enter Azeri ...
161,160,10,160_korean_north_school_deserted,"[korean, north, school, deserted, beijing, ent...","[BEIJING -- Seven men and women, apparently No..."
162,161,10,161_critical_security_microsoft_patch,"[critical, security, microsoft, patch, windows...",[Microsoft Releases 10 Security Patches for Wi...


In [None]:
topic_model.visualize_barchart(n_words= 8, top_n_topics=12)

In [None]:
topic_model.get_document_info(newspaper_data['description'])

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,Oracle Corp. notified Delaware's Court of Chan...,18,18_oracle_peoplesoft_hostile_takeover,"[oracle, peoplesoft, hostile, takeover, bid, r...",[AP - Oracle Corp.'s hostile #36;7.7 billion ...,oracle - peoplesoft - hostile - takeover - bid...,0.816944,False
1,With an eye towards making mobile devices and ...,3,3_wireless_mobile_phone_communications,"[wireless, mobile, phone, communications, phon...","[Taiwan Cellular Corporation (TCC), the second...",wireless - mobile - phone - communications - p...,1.000000,False
2,With just days to go before the U.S. president...,-1,-1_and_to_of_the,"[and, to, of, the, in, for, with, has, on, 39s]",[NEW YORK - FedEx Corp. Monday raised its earn...,and - to - of - the - in - for - with - has - ...,0.000000,False
3,Henrik Larsson was left on the bench by Barcel...,6,6_arsenal_champions_manchester_chelsea,"[arsenal, champions, manchester, chelsea, leag...",[Arsenal manager Arsene Wenger believes Manche...,arsenal - champions - manchester - chelsea - l...,0.654319,False
4,AP - The government will press on with a child...,-1,-1_and_to_of_the,"[and, to, of, the, in, for, with, has, on, 39s]",[NEW YORK - FedEx Corp. Monday raised its earn...,and - to - of - the - in - for - with - has - ...,0.000000,False
...,...,...,...,...,...,...,...,...
11995,AP - Though Congress approved a #36;1.2 billi...,73,73_tax_breaks_senate_bill,"[tax, breaks, senate, bill, corporate, billion...",[The House last night passed a sweeping rewrit...,tax - breaks - senate - bill - corporate - bil...,0.638994,False
11996,Profits at Premiership champions Arsenal soar ...,6,6_arsenal_champions_manchester_chelsea,"[arsenal, champions, manchester, chelsea, leag...",[Arsenal manager Arsene Wenger believes Manche...,arsenal - champions - manchester - chelsea - l...,0.818625,False
11997,"NAJAF, Iraq : Iraq #39;s top Shiite Muslim cle...",52,52_najaf_cleric_shiite_shrine,"[najaf, cleric, shiite, shrine, radical, alsad...","[NAJAF, Iraq - Thousands of pilgrims streamed ...",najaf - cleric - shiite - shrine - radical - a...,0.944814,False
11998,Michael Eisner will leave the Walt Disney comp...,115,115_disney_walt_eisner_michael,"[disney, walt, eisner, michael, ovitz, co, exe...",[Walt Disney Co. chief executive Michael D. Ei...,disney - walt - eisner - michael - ovitz - co ...,1.000000,True
