# BERTopic improved representation

Topic models are great at finding clusters, but their labels often fail the human eye test by prioritizing frequent words over meaningful ones. This project focuses on this issue using the BBC News dataset. We start by building a standard BERTopic baseline (replicating the original paper's settings) and then upgrade the representation layer using a KeyBERT-inspired model. By decoupling the clustering from the labeling, we aim to show that semantic embeddings produce cleaner, more interpretable topics than standard frequency counts. This is later also validated with computing Topic-Coherence and Topic-Diversity (same as in the paper).

In [1]:
!pip install -r requirements.txt



### Choosing dataset
First we load chosen dataset. The source is BBC News, but it's a different one from the one they used in the paper.
This dataset has only 5 labels. They are: __business, entertainment, politics, sport, tech__.

In [4]:
from datasets import load_dataset

dataset = load_dataset("SetFit/bbc-news")
docs_train = dataset["train"]["text"]
categories_train = dataset["train"]["label_text"]

### Reproduction on standard BERTopic
Now we do training on the standard BERTopic. We are choosing English language and providing embedding model and count vectorizer the same way they did in the paper. However, there is no representation model, so it takes the default one, which is ClassTfidfTransformer. Important to say is that we are doing unsupervised clustering so we are not giving any semantic boundaries for the creation of clusters (only setting nr_topics other parameters)

In [9]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-mpnet-base-v2")
vectorizer_model = CountVectorizer(stop_words="english")

topic_model_baseline = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="english",
    calculate_probabilities=True,
    verbose=True,
    nr_topics=10
)

topics_base, probs_base = topic_model_baseline.fit_transform(docs_train)

2025-12-04 20:45:48,589 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 39/39 [11:16<00:00, 17.35s/it]
2025-12-04 20:57:05,382 - BERTopic - Embedding - Completed ✓
2025-12-04 20:57:05,396 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-04 20:57:06,982 - BERTopic - Dimensionality - Completed ✓
2025-12-04 20:57:06,989 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-04 20:57:07,082 - BERTopic - Cluster - Completed ✓
2025-12-04 20:57:07,083 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-04 20:57:07,425 - BERTopic - Representation - Completed ✓
2025-12-04 20:57:07,426 - BERTopic - Topic reduction - Reducing number of topics
2025-12-04 20:57:07,441 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-04 20:57:07,626 - BERTopic - Representation - Completed ✓
2025-12-04 20:57:07,628 - BERTopic - Topic reduction - Redu

In [10]:
old_topics = topic_model_baseline.get_topics()

info_df = topic_model_baseline.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

fig_map = topic_model_baseline.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,357,"said, mr, government, labour, party, election"
2,1,207,"said, people, technology, mobile, users, digital"
3,2,173,"england, game, rugby, players, club, wales"
4,3,108,"said, company, yukos, firm, mr, oil"
5,4,78,"music, song, band, album, best, singer"


## Reproduction with different representation model
Now let's update the topics using KeyBERTInspired representation model. This should change the topic words, make their meaning more clean and straightforward. The key difference is that KeyBERTInspired does not favor frequency over meaning.

In [11]:
from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()
topic_model_baseline.update_topics(docs_train, representation_model=representation_model)

In [12]:
info_df_enhanced = topic_model_baseline.get_topic_info()
clean_df_enhanced = info_df_enhanced[info_df["Topic"] != -1].copy()

clean_df_enhanced["Top Keywords"] = clean_df_enhanced["Representation"].apply(clean_keywords)

display_table_enhanced = clean_df_enhanced[["Topic", "Count", "Top Keywords"]]

display(display_table_enhanced.head(5))

fig_map_enhanced = topic_model_baseline.visualize_topics()
fig_map_enhanced.show()

fig_bar_enhanced = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar_enhanced.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,357,"labour, tory, blair, tories, election, uk"
2,1,207,"phones, technology, phone, broadband, apple, computer"
3,2,173,"rugby, liverpool, united, chelsea, arsenal, football"
4,3,108,"yukos, bankruptcy, russian, shareholders, russia, firms"
5,4,78,"songs, song, music, singer, album, u2"


## Evaluation of the second part
We can see an improvement of the words representing each topic. Particularly in the clusters in the baseline model there was word `said` in more topics. This shows the clear shift from just using frequency to represent the clusters to using the actual meaning.

In [13]:
import pandas as pd

new_topics = topic_model_baseline.get_topics()
new_keywords = {}
for topic_id, topic_list in new_topics.items():
    words = [pair[0] for pair in topic_list[:5]]
    new_keywords[topic_id] = ", ".join(words)

old_keywords = {}
for topic_id, topic_list in old_topics.items():
    words = [pair[0] for pair in topic_list[:5]]
    old_keywords[topic_id] = ", ".join(words)

df_compare = pd.DataFrame({
    "Topic": list(new_keywords.keys()),
    "Baseline Keywords": [old_keywords[t] for t in new_keywords.keys()],
    "Improved Keywords": [new_keywords[t] for t in new_keywords.keys()]
})

df_compare = df_compare[df_compare["Topic"] != -1].sort_values("Topic")

pd.set_option('display.max_colwidth', None)
print("### SIDE-BY-SIDE COMPARISON ###")
display(df_compare.head(10))

### SIDE-BY-SIDE COMPARISON ###


Unnamed: 0,Topic,Baseline Keywords,Improved Keywords
1,0,"said, mr, government, labour, party","labour, tory, blair, tories, election"
2,1,"said, people, technology, mobile, users","phones, technology, phone, broadband, apple"
3,2,"england, game, rugby, players, club","rugby, liverpool, united, chelsea, arsenal"
4,3,"said, company, yukos, firm, mr","yukos, bankruptcy, russian, shareholders, russia"
5,4,"music, song, band, album, best","songs, song, music, singer, album"
6,5,"film, best, oscar, films, director","oscars, oscar, awards, nominees, award"
7,6,"olympic, indoor, world, race, year","iaaf, olympic, athletics, olympics, marathon"
8,7,"open, match, australian, tennis, seed","federer, wimbledon, tennis, agassi, roddick"
9,8,"comedy, tv, bbc, series, celebrity","eviction, itv1, celebrity, housemates, bbc"


In [14]:
# 1. Import your function
from calculate_t_coherence_and_diversity import evaluate_bertopic_pmi, TopicModelWrapper

_, npm_imp, div_imp = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline,
    docs=docs_train,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="Baseline"
)

baseline_model_wrapper = TopicModelWrapper(old_topics)

_, npm_base, div_base = evaluate_bertopic_pmi(
    topic_model=baseline_model_wrapper,
    docs=docs_train,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="Improved"
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "Baseline": [npm_base, div_base],
    "Improved (KeyBERT)": [npm_imp, div_imp]
})
display(results_df)

Baseline Model - NPMI: 0.4031
Baseline Model - Diversity: 0.3911
Improved Model - NPMI: 0.2774
Improved Model - Diversity: 0.3511


Unnamed: 0,Metric,Baseline,Improved (KeyBERT)
0,Coherence (NPMI),0.277397,0.403068
1,Diversity,0.351111,0.391111


### Results and Evaluation

To evaluate the impact of changing the representation model, we compared our Baseline (Standard c-TF-IDF with stopword removal) against the Improved (KeyBERT-Inspired) model. We used two metrics: Topic Coherence (NPMI) to measure semantic consistency, and Topic Diversity to measure the uniqueness of keywords across topics.
Then we put the results side by side to see the difference. The expectation was that the **Baseline** keywords are more generic, while the **Improved** keywords are more specific and descriptive.

As we can see from the table above, there is an improvement.

**Coherence**: The KeyBERT-inspired model achieved a coherence score of 0.403, outperforming the baseline (0.277). This indicates that the top words selected by KeyBERT are more semantically related to each other.

**Diversity**: The diversity scores remained relatively stable (~0.35 to ~0.39). This suggests that while KeyBERT changes which words are selected, it does not reduce the vocabulary size or result in repetitive topics compared to the baseline.

### Conclusion
This experiment demonstrates that while the clustering mechanism of BERTopic (finding which documents belong together) is powerful, the default c-TF-IDF representation can sometimes prefer frequent but generic words.

By applying a KeyBERT-inspired model, we successfully:
- Increased the semantic coherence of the topics by more than 0.12 points.
- Generated topic labels that are more descriptive and actionable for human analysts.