# BERTopic improved representation

In [10]:
!pip install -r requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting sklearn (from -r requirements.txt (line 5))
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[15 lines of output][0m
  [31m   [0m The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
  [31m   [0m rather than 'sklearn' for pip commands.
  [31m   [0m 
  [31m   [0m Here is how to fix this error in the main use cases:
  [31m   [0m - use 'pip install scikit-learn' rather than 'pip install sklearn'
  [31m   [0m - replace 'sklearn' by 'scikit-learn' in your pip requirements files
  [31m   [0m   (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
  [31m   [0m - if the 'sklearn' package is used by one of your dependencies,
  [31m   [0m   it would be great if you take some time to track which package u

### Choosing dataset
First we load chosen dataset. The source is BBC News, but it's a different one from the one they used in the paper.

This dataset has only 5 labels. They are: __business, entertainment, politics, sport, tech__

In [11]:
from datasets import load_dataset

dataset = load_dataset("SetFit/bbc-news")
docs_train = dataset["train"]["text"]
categories_train = dataset["train"]["label_text"]

### Reproduction on standard BERTopic
Now we do training on the standard BERTopic. We are choosing English language and providing embedding model and count vectorizer the same way they did in the paper. However, there is no representation model, so it takes the default one, which is ClassTfidfTransformer.

In [18]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-mpnet-base-v2")
vectorizer_model = CountVectorizer(stop_words="english")

topic_model_baseline = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="english",
    calculate_probabilities=True,
    verbose=True,
    nr_topics=10
)

topics_base, probs_base = topic_model_baseline.fit_transform(docs_train)

2025-12-02 15:35:37,477 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 39/39 [01:38<00:00,  2.53s/it]
2025-12-02 15:37:16,236 - BERTopic - Embedding - Completed ✓
2025-12-02 15:37:16,249 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-02 15:37:17,883 - BERTopic - Dimensionality - Completed ✓
2025-12-02 15:37:17,887 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-02 15:37:18,021 - BERTopic - Cluster - Completed ✓
2025-12-02 15:37:18,022 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-02 15:37:18,261 - BERTopic - Representation - Completed ✓
2025-12-02 15:37:18,262 - BERTopic - Topic reduction - Reducing number of topics
2025-12-02 15:37:18,281 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-02 15:37:18,459 - BERTopic - Representation - Completed ✓
2025-12-02 15:37:18,460 - BERTopic - Topic reduction - Redu

In [19]:
old_topics = topic_model_baseline.get_topics()

info_df = topic_model_baseline.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

fig_map = topic_model_baseline.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,362,"said, mr, government, labour, party, election"
2,1,173,"england, game, rugby, players, club, wales"
3,2,172,"people, said, users, mobile, technology, digital"
4,3,106,"said, company, yukos, firm, deutsche, oil"
5,4,101,"music, song, best, band, album, said"


## Reproduction with different representation model
Now let's update the topics using KeyBERTInspired representation model. This should change the topic words, make their meaning more clean and straightforward. The key difference is that KeyBERTInspired does not favor frequency over meaning.

In [20]:
from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()
topic_model_baseline.update_topics(docs_train, representation_model=representation_model)

In [21]:
info_df_enhanced = topic_model_baseline.get_topic_info()
clean_df_enhanced = info_df_enhanced[info_df["Topic"] != -1].copy()

clean_df_enhanced["Top Keywords"] = clean_df_enhanced["Representation"].apply(clean_keywords)

display_table_enhanced = clean_df_enhanced[["Topic", "Count", "Top Keywords"]]

display(display_table_enhanced.head(5))

fig_map_enhanced = topic_model_baseline.visualize_topics()
fig_map_enhanced.show()

fig_bar_enhanced = topic_model_baseline.visualize_barchart(top_n_topics=8)
fig_bar_enhanced.show()

Unnamed: 0,Topic,Count,Top Keywords
1,0,362,"blair, tory, labour, tories, election, party"
2,1,173,"rugby, liverpool, united, chelsea, arsenal, football"
3,2,172,"phones, technology, broadband, phone, internet, music"
4,3,106,"yukos, bankruptcy, russian, shareholders, russia, firms"
5,4,101,"songs, music, singer, album, awards, song"


## Evaluation of the second part
We can see an improvement of the words representing each topic. Particularly in the clusters in the baseline model there was word `said` in more topics. This shows the clear shift from just using frequency to represent the clusters to using the actual meaning.

In [22]:
import pandas as pd

new_topics = topic_model_baseline.get_topics()
new_keywords = {}
for topic_id, topic_list in new_topics.items():
    words = [pair[0] for pair in topic_list[:5]]
    new_keywords[topic_id] = ", ".join(words)

old_keywords = {}
for topic_id, topic_list in old_topics.items():
    words = [pair[0] for pair in topic_list[:5]]
    old_keywords[topic_id] = ", ".join(words)

df_compare = pd.DataFrame({
    "Topic": list(new_keywords.keys()),
    "Baseline Keywords": [old_keywords[t] for t in new_keywords.keys()],
    "Improved Keywords": [new_keywords[t] for t in new_keywords.keys()]
})

df_compare = df_compare[df_compare["Topic"] != -1].sort_values("Topic")

pd.set_option('display.max_colwidth', None)
print("### SIDE-BY-SIDE COMPARISON ###")
display(df_compare.head(10))

### SIDE-BY-SIDE COMPARISON ###


Unnamed: 0,Topic,Baseline Keywords,Improved Keywords
1,0,"said, mr, government, labour, party","blair, tory, labour, tories, election"
2,1,"england, game, rugby, players, club","rugby, liverpool, united, chelsea, arsenal"
3,2,"people, said, users, mobile, technology","phones, technology, broadband, phone, internet"
4,3,"said, company, yukos, firm, deutsche","yukos, bankruptcy, russian, shareholders, russia"
5,4,"music, song, best, band, album","songs, music, singer, album, awards"
6,5,"film, best, oscar, director, actor","oscars, oscar, awards, nominees, award"
7,6,"olympic, indoor, world, race, year","iaaf, olympic, athletics, olympics, marathon"
8,7,"open, match, australian, seed, tennis","federer, wimbledon, tennis, agassi, roddick"
9,8,"games, game, nintendo, ds, gaming","nintendo, playstation, psp, consoles, sony"


In [23]:
# 1. Import your function
from calculate_t_coherence_and_diversity import evaluate_bertopic_pmi

df_imp, npm_imp, div_imp = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline,
    docs=docs_train,
    top_k_coherence=10,
    top_k_diversity=25
)

print(f"Improved Model - NPMI: {npm_imp:.4f}")
print(f"Improved Model - Diversity: {div_imp:.4f}")

class TopicModelWrapper:
    def __init__(self, topics):
        self.topics = topics

    def get_topics(self):
        return self.topics

baseline_model_wrapper = TopicModelWrapper(old_topics)

df_base, npm_base, div_base = evaluate_bertopic_pmi(
    topic_model=baseline_model_wrapper,
    docs=docs_train,
    top_k_coherence=10,
    top_k_diversity=25
)

print(f"Baseline Model - NPMI: {npm_base:.4f}")
print(f"Baseline Model - Diversity: {div_base:.4f}")

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "Baseline": [npm_base, div_base],
    "Improved (KeyBERT)": [npm_imp, div_imp]
})
display(results_df)

Improved Model - NPMI: 0.4373
Improved Model - Diversity: 0.3867
Baseline Model - NPMI: 0.3087
Baseline Model - Diversity: 0.3556


Unnamed: 0,Metric,Baseline,Improved (KeyBERT)
0,Coherence (NPMI),0.308695,0.437273
1,Diversity,0.355556,0.386667


### Results and Evaluation

To evaluate the impact of changing the representation model, we compared our Baseline (Standard c-TF-IDF with stopword removal) against the Improved (KeyBERT-Inspired) model. We utilized two standard metrics: Topic Coherence (NPMI) to measure semantic consistency, and Topic Diversity to measure the uniqueness of keywords across topics.
Now we can put the results side by side to see the difference. The expectation was that the **Baseline** keywords are more generic, while the **Improved** keywords are more specific and descriptive.

As we can see from the table above, there is an improvement.

**Coherence Surge**: The KeyBERT-inspired model achieved a coherence score of 0.437, outperforming the baseline (0.308). This indicates that the top words selected by KeyBERT are more semantically related to each other.

**Stability in Diversity**: The diversity scores remained relatively stable (~0.35 to ~0.38). This suggests that while KeyBERT changes which words are selected, it does not reduce the vocabulary size or result in repetitive topics compared to the baseline.

### Conclusion
This experiment demonstrates that while the clustering mechanism of BERTopic (finding which documents belong together) is powerful, the default c-TF-IDF representation can sometimes over-index on frequent but generic words.

By applying a KeyBERT-inspired model, we successfully:
- Increased the semantic coherence of the topics by nearly 0.10 points.
- Generated topic labels that are more descriptive and actionable for human analysts.