# BerTOPIC improved representation

In [None]:
from bertopic import BERTopic
!pip install -r requirements.txt

### Choosing dataset
First we load chosen dataset. The source is BBC News, but it's a different one from the one they used in the paper.

This dataset has only 5 labels. They are: __business, entertainment, politics, sport, tech__

In [3]:
from datasets import load_dataset

dataset = load_dataset("SetFit/bbc-news")
docs_train = dataset["train"]["text"]
categories_train = dataset["train"]["label_text"]

Column(['wales want rugby league training wales could follow england s lead by training with a rugby league club.  england have already had a three-day session with leeds rhinos  and wales are thought to be interested in a similar clinic with rivals st helens. saints coach ian millward has given his approval  but if it does happen it is unlikely to be this season. saints have a week s training in portugal next week  while wales will play england in the opening six nations match on 5 february.  we have had an approach from wales   confirmed a saints spokesman.  it s in the very early stages but it is something we are giving serious consideration to.  st helens  who are proud of their welsh connections  are obvious partners for the welsh rugby union  despite a spat in 2001 over the collapse of kieron cunningham s proposed £500 000 move to union side swansea. a similar cross-code deal that took iestyn harris from leeds to cardiff in 2001 did go through  before the talented stand-off retur

### Reproduction on standard BERTopic
Now we do training on the standard BERTopic. We are choosing English language and providing no representation model (no parameter for representational_model takes default) which is ClassTfidfTransformer.

In [7]:
from bertopic import BERTopic

topic_model_baseline = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics_base, probs_base = topic_model_baseline.fit_transform(docs_train)

2025-11-20 12:45:36,430 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 39/39 [00:14<00:00,  2.73it/s]
2025-11-20 12:45:55,104 - BERTopic - Embedding - Completed ✓
2025-11-20 12:45:55,108 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-20 12:45:56,392 - BERTopic - Dimensionality - Completed ✓
2025-11-20 12:45:56,394 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-20 12:45:56,498 - BERTopic - Cluster - Completed ✓
2025-11-20 12:45:56,514 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-20 12:45:56,802 - BERTopic - Representation - Completed ✓


In [10]:
info_df = topic_model_baseline.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

fig_map = topic_model_baseline.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,236,"the, to, of, and, mr, he"
2,1,173,"the, to, and, in, he, we"
3,2,105,"film, the, and, of, in, for"
4,3,70,"the, music, and, in, song, band"
5,4,47,"the, in, to, open, was, his"


## Reproduction with different representation model
Now let's do the same, but use KeyBERTInspired representation model. This should dramatically change the topic words, make their meaning more clean and straightforward. Each topic should be represented clearly so that mere looking at keywords should give us understanding of the cluster. The key difference is that KeyBERTInspired does not favor frequency over meaning.

In [11]:
from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()
topic_model_baseline.update_topics(docs_train, representation_model=representation_model)

In [12]:
info_df_enhanced = topic_model_baseline.get_topic_info()
clean_df_enhanced = info_df_enhanced[info_df["Topic"] != -1].copy()

clean_df_enhanced["Top Keywords"] = clean_df_enhanced["Representation"].apply(clean_keywords)

display_table_enhanced = clean_df_enhanced[["Topic", "Count", "Top Keywords"]]

display(display_table_enhanced.head(5))

fig_map_enhanced = topic_model_baseline.visualize_topics()
fig_map_enhanced.show()

fig_bar_enhanced = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar_enhanced.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,236,"tory, blair, ukip, minister, tories, labour"
2,1,173,"rugby, liverpool, chelsea, united, england, arsenal"
3,2,105,"oscar, oscars, nominations, nominated, actress, awards"
4,3,70,"singer, songs, concert, u2, awards, rock"
5,4,47,"federer, tennis, wimbledon, mirza, tournament, roddick"
