# BerTOPIC improved representation

In [None]:
!pip install -r requirements.txt

### Choosing dataset
First we load chosen dataset. The source is BBC News, but it's a different one from the one they used in the paper.

This dataset has only 5 labels. They are: __business, entertainment, politics, sport, tech__

In [23]:
from datasets import load_dataset

dataset = load_dataset("SetFit/bbc-news")
docs_train = dataset["train"]["text"]
categories_train = dataset["train"]["label_text"]

### Reproduction on standard BERTopic
Now we do training on the standard BERTopic. We are choosing English language and providing no representation model (no parameter for representational_model takes default) which is ClassTfidfTransformer.

In [24]:
from bertopic import BERTopic

topic_model_baseline = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics_base, probs_base = topic_model_baseline.fit_transform(docs_train)

2025-11-20 13:16:42,528 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 39/39 [00:14<00:00,  2.65it/s]
2025-11-20 13:17:02,372 - BERTopic - Embedding - Completed ✓
2025-11-20 13:17:02,373 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-20 13:17:03,486 - BERTopic - Dimensionality - Completed ✓
2025-11-20 13:17:03,489 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-20 13:17:03,551 - BERTopic - Cluster - Completed ✓
2025-11-20 13:17:03,561 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-20 13:17:03,909 - BERTopic - Representation - Completed ✓


In [25]:
# this variable will be used later on
old_topics = topic_model_baseline.get_topics()

info_df = topic_model_baseline.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

fig_map = topic_model_baseline.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,225,"the, to, of, mr, and, he"
2,1,175,"the, to, and, in, he, we"
3,2,105,"film, the, and, of, in, for"
4,3,76,"the, music, and, in, of, song"
5,4,47,"the, in, to, open, was, his"


## Reproduction with different representation model
Now let's do the same, but use KeyBERTInspired representation model. This should dramatically change the topic words, make their meaning more clean and straightforward. Each topic should be represented clearly so that mere looking at keywords should give us understanding of the cluster. The key difference is that KeyBERTInspired does not favor frequency over meaning.

In [26]:
from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()
topic_model_baseline.update_topics(docs_train, representation_model=representation_model)

In [27]:
info_df_enhanced = topic_model_baseline.get_topic_info()
clean_df_enhanced = info_df_enhanced[info_df["Topic"] != -1].copy()

clean_df_enhanced["Top Keywords"] = clean_df_enhanced["Representation"].apply(clean_keywords)

display_table_enhanced = clean_df_enhanced[["Topic", "Count", "Top Keywords"]]

display(display_table_enhanced.head(5))

fig_map_enhanced = topic_model_baseline.visualize_topics()
fig_map_enhanced.show()

fig_bar_enhanced = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar_enhanced.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,225,"tory, blair, ukip, minister, tories, labour"
2,1,175,"rugby, liverpool, chelsea, united, england, arsenal"
3,2,105,"oscar, oscars, nominations, nominated, actress, awards"
4,3,76,"singer, songs, rap, u2, awards, rock"
5,4,47,"federer, tennis, wimbledon, mirza, tournament, roddick"


## Evaluation of the second part
We can see clear improvement of the words representing each topic. We got rid of words like "the", "mr", "to", "of", "and" and we now have words with clear meaning representing the topic. For example topic with keywords: "federer", "tennis", "wimbledon", "mirza", "tournament" will most likely represent cluster containing sport topic, particularly tennis.

In [28]:
import pandas as pd

new_topics = topic_model_baseline.get_topics()
new_keywords = {}
for topic_id, topic_list in new_topics.items():
    words = [pair[0] for pair in topic_list[:5]]
    new_keywords[topic_id] = ", ".join(words)

old_keywords = {}
for topic_id, topic_list in old_topics.items():
    words = [pair[0] for pair in topic_list[:5]]
    old_keywords[topic_id] = ", ".join(words)

df_compare = pd.DataFrame({
    "Topic": list(new_keywords.keys()),
    "Baseline Keywords": [old_keywords[t] for t in new_keywords.keys()],
    "Improved Keywords": [new_keywords[t] for t in new_keywords.keys()]
})

df_compare = df_compare[df_compare["Topic"] != -1].sort_values("Topic")

pd.set_option('display.max_colwidth', None)
print("### SIDE-BY-SIDE COMPARISON ###")
display(df_compare.head(10))

### SIDE-BY-SIDE COMPARISON ###


Unnamed: 0,Topic,Baseline Keywords,Improved Keywords
1,0,"the, to, of, mr, and","tory, blair, ukip, minister, tories"
2,1,"the, to, and, in, he","rugby, liverpool, chelsea, united, england"
3,2,"film, the, and, of, in","oscar, oscars, nominations, nominated, actress"
4,3,"the, music, and, in, of","singer, songs, rap, u2, awards"
5,4,"the, in, to, open, was","federer, tennis, wimbledon, mirza, tournament"
6,5,"yukos, the, oil, of, to","yukos, yushchenko, yuganskneftegas, bankruptcy, yugansk"
7,6,"in, the, indoor, race, she","olympic, athletics, marathon, holmes, qualifying"
8,7,"security, to, the, of, that","security, phishing, criminals, viruses, antivirus"
9,8,"growth, economy, in, the, economic","economy, growth, economic, unemployment, fed"
10,9,"deutsche, lse, boerse, the, its","euronext, lse, euros, boerse, exchange"


### Results and Evaluation
Now we can put the results side by side to see the difference. The expectation was that the **Baseline** keywords are more generic, while the **Improved** keywords are more specific and descriptive.

As we can see from the table above, the improvement is clear.

* **Baseline (c-TF-IDF):** The keywords often contain verbs or common names like `the`, `mr`, `to`, `and`. These words appear frequently in the articles, so the math picks them up, but they don't tell us what the news is actually about.
* **Improved (KeyBERT):** The keywords change to specific names and subjects. For example, if we look at the Sports topic, the words change from generic `the, to, and, in, he` to specific `rugby, liverpool, chelsea`. In the Politics topic, we now see `tory, blair, minister` instead of just `the, to, of, mr, and`.

### Conclusion
We successfully reproduced the BERTopic approach on the BBC News dataset. We identified that the standard representation can sometimes be too vague. By implementing the `KeyBERTInspired` representation model, we were able to fix this issue.

This experiment confirms that the modular architecture of BERTopic is very powerful. It allowed us to keep the same clusters (groupings) but completely change how they are presented to the user, making the model much more useful for real-world analysis.