# Topic Modelling using BERTopic

# 1. Install the libraries required

In [2]:
pip install bertopic

Collecting bertopic
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.17.0-py3-none-any.whl (150 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.6/150.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m54.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

# 2. Import Libraries

In [3]:
import pandas as pd
from datasets import load_dataset
from bertopic import BERTopic

# 3. Import Data

In [4]:
ds = load_dataset("abisee/cnn_dailymail", "3.0.0")

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [5]:
ds

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [6]:
# Load just the 'article' column from the train split as a pandas Series
articles = ds["test"].to_pandas()["article"]

In [7]:
articles[25]

'(CNN)Just as mimeograph machines and photocopiers were in their day, online activity -- blogs, YouTube channels, even social media platforms like Facebook and Twitter -- have fully emerged as the alternative to traditional mainstream media. It is not just the low cost of posting online that attracts dissidence, though that in itself is liberating. It is the lack of access to traditional print and broadcast media in authoritarian countries that is really the driving force leading disaffected voices to post online. It is not unique to Asia, but it might seem more pronounced if you live there. Going online has become the path of least resistance if you want to make yourself heard. But it still brings resistance, some of it legal, some of it deadly. Let\'s look at the legal angle first. Amos Yee, the teenage video blogger who was arrested and held pending bail Sunday in Singapore, drew international attention for his anti-Lee Kuan Yew harangue. But jailing critics is not usually the gover

# 4. Initialize a Topic Modeling Object 

In [22]:
topic_model = BERTopic(nr_topics=100, embedding_model="all-MiniLM-L6-v2", language="english", verbose=True)

# 5. Training the Model

In [23]:
topics, probs = topic_model.fit_transform(articles)

2025-04-12 14:51:40,923 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/360 [00:00<?, ?it/s]

2025-04-12 15:03:44,626 - BERTopic - Embedding - Completed ✓
2025-04-12 15:03:44,627 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-12 15:03:49,903 - BERTopic - Dimensionality - Completed ✓
2025-04-12 15:03:49,905 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-12 15:03:50,425 - BERTopic - Cluster - Completed ✓
2025-04-12 15:03:50,426 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-04-12 15:03:57,545 - BERTopic - Representation - Completed ✓
2025-04-12 15:03:57,557 - BERTopic - Topic reduction - Reducing number of topics
2025-04-12 15:03:57,598 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-12 15:04:04,437 - BERTopic - Representation - Completed ✓
2025-04-12 15:04:04,455 - BERTopic - Topic reduction - Reduced number of topics from 193 to 100


## Save the trained model for future use

In [13]:
topic_model.save("bertopic_cnn_dailymail_model2", serialization="pytorch")

## Import a saved model

In [14]:
topic_model = BERTopic.load("bertopic_cnn_dailymail_model2")

# 6. Output the results

In [15]:
# Get topic names
topic_info = topic_model.get_topic_info()
topic_id_to_name = {
    row["Topic"]: row["Name"] for _, row in topic_info.iterrows()
}

In [16]:
# Create the final DataFrame
df_topics = pd.DataFrame({
    "article": articles,
    "topic_id": topics
})
df_topics["topic_name"] = df_topics["topic_id"].map(topic_id_to_name)

In [17]:
df_topics

Unnamed: 0,article,topic_id,topic_name
0,(CNN)The Palestinian Authority officially beca...,41,41_iran_nuclear_deal_agreement
1,(CNN)Never mind cats having nine lives. A stra...,5,5_dog_dogs_animal_animals
2,"(CNN)If you've been following the news lately,...",41,41_iran_nuclear_deal_agreement
3,(CNN)Five Americans who were monitored for thr...,45,45_vaccine_vaccination_children_health
4,(CNN)A Duke student has admitted to hanging a ...,-1,-1_the_and_to_of
...,...,...,...
11485,Telecom watchdogs are to stop a rip-off that a...,25,25_apple_watch_google_battery
11486,The chilling reenactment of how executions are...,42,42_sukumaran_chan_indonesia_indonesian
11487,It is a week which has seen him in deep water ...,55,55_nfl_manziel_hardy_talib
11488,"Despite the hype surrounding its first watch, ...",25,25_apple_watch_google_battery


# 7. Understand what happened

In [24]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,4211,-1_the_to_and_of,"[the, to, and, of, in, was, her, she, for, that]",[We're either guilty of it or we have a friend...
1,0,1501,0_league_season_his_the,"[league, season, his, the, at, to, he, against...",[Suddenly an opportunity. From looking as if a...
2,1,387,1_her_she_was_and,"[her, she, was, and, family, said, had, to, po...",[Family and friends who travelled from around ...
3,2,349,2_isis_al_of_syria,"[isis, al, of, syria, in, the, to, islamic, an...",[Depraved militants fighting for the Islamic S...
4,3,318,3_england_test_masters_his,"[england, test, masters, his, cricket, he, aug...",[England left for the Caribbean on Thursday kn...
...,...,...,...,...,...
95,94,11,94_depay_psv_holland_dutch,"[depay, psv, holland, dutch, van, memphis, uni...",[Bayern Munich have joined Manchester United a...
96,95,11,95_oz_columbia_dr_ingam,"[oz, columbia, dr, ingam, chokal, medical, fac...",[TV celebrity doctor Mehmet Oz has defended hi...
97,96,11,96_luke_search_bushland_missing,"[luke, search, bushland, missing, eildon, sham...",[A beanie believed to have belonged to Luke Sh...
98,97,11,97_mars_planet_water_earth,"[mars, planet, water, earth, nasa, surface, pl...","[The red planet might still have liquid water,..."


In [19]:
topic_model.get_topic(topic=9)

[['fight', 0.039224126227479776],
 ['mayweather', 0.03771825484774972],
 ['pacquiao', 0.03416298657055846],
 ['manny', 0.01612687026181083],
 ['floyd', 0.015851690001487066],
 ['boxing', 0.015354607833008231],
 ['his', 0.014109650878735799],
 ['vegas', 0.012613895644490974],
 ['las', 0.011589665926498207],
 ['he', 0.0108853445334723]]

In [25]:
topic_model.get_representative_docs(9)

 "A grandmother nicknamed the 'Fairy Dogmother' spends more than £28,000 a year looking after the stray or abandoned dogs she has welcomed into her home. For more than 30 years,\xa0Pat Senior, 66,  has shared her five-bedroom home in Bolton, Greater Manchester, with the animals, taking in dogs from as far afield as Romania and Hungary. She estimates that she spends £240 a week on food and treats for the dogs, with veterinary bills adding another £17,000 to the yearly cost of caring for the pets. Pat Senior, 66, \xa0(pictured) who is nicknamed the 'Fairy Dogmother', spends more than £28,000 a year looking after as many as 26 stray or abandoned dogs she has welcomed into her home . Grandmother-of-four Mrs Senior and her businessman husband, Charles, currently have 19 dogs in their care, including lurchers, German Shepherds and Chinese Cresteds who all sleep and live at her home. She has made makeshift beds for the animals in the couple's garage, with some dogs sleeping in the living room

In [26]:
topic_model.visualize_topics()

In [27]:
topic_model.visualize_barchart()

# 8. Hierarchical Topic Modelling

In [28]:
hierarchical_topics = topic_model.hierarchical_topics(articles, topics)

100%|██████████| 98/98 [00:00<00:00, 147.49it/s]


In [29]:
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)