# **BERTopic on German & French Amazon Reviews**
The following notebook:
- trains BERTopic models on German and French Amazon reviews
- inspects the discovered topics using a KeyBERTInspired representation model

In [9]:
!pip install -r requirements.txt



In [10]:
from datasets import load_dataset

ds_german = load_dataset("SetFit/amazon_reviews_multi_de")
ds_french = load_dataset("SetFit/amazon_reviews_multi_fr")
docs_train_german = ds_german["train"]["text"]
categories_train_german = ds_german["train"]["label_text"]

docs_train_french = ds_french["train"]["text"]
categories_train_french = ds_french["train"]["label_text"]

Repo card metadata block was not found. Setting CardData to empty.


### **Set up**

BERTopic will be used here for unsupervised topic modeling (of reviews). We use the KeyBERTInspired representation model, because it produces better topic labels as we have seen in the previous notebook. We use separate instances for the German and French models.

##### **Baseline BERTopic model for German reviews**

Here, we train a BERTopic model on German Amazon reviews. We use the multilingual embedding setting and restrict the topic categorization to 8 topics with a minimum topic size of 30 documents.

In [11]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-mpnet-base-v2")
vectorizer_model = CountVectorizer(stop_words="english")
representation_model_german = KeyBERTInspired()

topic_model_baseline_german = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=8,
    representation_model=representation_model_german
)

topics_base_german, probs_base_german = topic_model_baseline_german.fit_transform(docs_train_german[:50000]) # 200k (all) would take more 90min on T4 collab

2025-12-04 19:00:20,350 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

2025-12-04 19:04:19,113 - BERTopic - Embedding - Completed ✓
2025-12-04 19:04:19,114 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-04 19:05:36,202 - BERTopic - Dimensionality - Completed ✓
2025-12-04 19:05:36,205 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-04 19:08:33,952 - BERTopic - Cluster - Completed ✓
2025-12-04 19:08:33,953 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-04 19:08:35,546 - BERTopic - Representation - Completed ✓
2025-12-04 19:08:35,548 - BERTopic - Topic reduction - Reducing number of topics
2025-12-04 19:08:35,646 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-04 19:08:38,222 - BERTopic - Representation - Completed ✓
2025-12-04 19:08:38,238 - BERTopic - Topic reduction - Reduced number of topics from 263 to 8


##### **Baseline BERTopic model for French reviews**

We repeat the same setup for the French Amazon reviews, again using a multilingual model and the KeyBERTInspired representation.

In [12]:
representation_model_french = KeyBERTInspired()

topic_model_baseline_french = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=8,
    representation_model=representation_model_french
)

topics_base_french, probs_base_french = topic_model_baseline_french.fit_transform(docs_train_french[:50000])

2025-12-04 19:15:50,083 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

2025-12-04 19:18:51,509 - BERTopic - Embedding - Completed ✓
2025-12-04 19:18:51,510 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-04 19:19:44,078 - BERTopic - Dimensionality - Completed ✓
2025-12-04 19:19:44,081 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-04 19:21:28,433 - BERTopic - Cluster - Completed ✓
2025-12-04 19:21:28,434 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-04 19:21:29,602 - BERTopic - Representation - Completed ✓
2025-12-04 19:21:29,604 - BERTopic - Topic reduction - Reducing number of topics
2025-12-04 19:21:29,690 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-04 19:21:31,416 - BERTopic - Representation - Completed ✓
2025-12-04 19:21:31,428 - BERTopic - Topic reduction - Reduced number of topics from 224 to 8


### **Inspecting computed topics for German**

In the following part, we retrieve the topic summary for the German model, clean it by removing the outliers topic (-1) and construct a compact table with topic id, document count and top representative keywords.

In [13]:
info_df = topic_model_baseline_german.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

Unnamed: 0,Topic,Count,Top Keywords
1,0,35791,"funktioniert, gekauft, ich, diese, verkäufer, oder"
2,1,844,"folie, folien, display, einfach, displays, schlecht"
3,2,532,"hülle, hüllen, rückseite, keine, zurück, hatte"
4,3,354,"hose, hosen, oder, pumpe, pumpen, der"
5,4,201,"matte, matten, nichts, nicht, rutschfest, gekauft"


The generated topics are already quite coherent and meaningful. Most topics contian keywords that are often found in similar contexts (e.g. film, tv and story) which means that the embedding and the KeyBERTInspired representation work well and the model can successfully capture semantic patterns.
The keyword cluster also gives us hints on the general themes, which are:
- product quality and functionality
- delivery
- electronics, hardware
- movies, entertainment
- audio

Nevertheless, the keywords include sentiment as well (schlecht, engl. 'bad') which is not representing the topic at all but is typical for reviews. Also, the top keywords consist of different word forms of the same stem (e.g. kaufen, engl. 'buy', and gekauft, engl. 'bought'). This could change if we would apply lemmatization before training and could result in cleaner topic keywords.


#### **Visualization of German topic space**

We visualize the German topics in a 2D projection to see how well-separated they are. In the bar chart, we can inspect their relative sizes.

In [14]:
fig_map = topic_model_baseline_german.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_german.visualize_barchart(top_n_topics=8)
fig_bar.show()

The **Intertopic Distance Map** shows how semantically different the topics are from each other, each bubble representing one topic and its size corresponding to how many documents are included. The map shows that one Topic 0 is much larger than all others (with more than 16k documents). Its centered position also indicated that other topics share semantically similar vocabulary to this topic.

The **bar chart** visualizes how BERTopic differentiates between the different themes. Overall, the topic structures seem to be meaningful and coherent, but they are also some redundant words and noise, like "deutscher", "deutsches", "deutsche". But no cluster seems to overlap strongly, the keywords seem well structured and well distributed and they describe the topic in a meaningful way.

#### **Inspecting computed French topics**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [15]:
info_df = topic_model_baseline_french.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

Unnamed: 0,Topic,Count,Top Keywords
1,0,26618,"pas, autre, vraiment, avoir, fait, vous"
2,1,3845,"chargeur, autre, pas, fait, une, avoir"
3,2,970,"autre, entre, avec, vraiment, pas, correctement"
4,3,92,"hdmi, vga, cable, câble, 4k, adaptateur"
5,4,85,"jamais, je, répondu, déçu, decu, pas"


Our French BERTopic model also computes mostly coherent and meaningful topics. However, we are dealing with morphological redundancy (français, française, france, francais) again and topic 2 can be viewed as a (very) broad generic negative review topic.

#### **Visualization of the French topic space**

Here, we visualize the French topic landscape using the same projections and bar chart as for the German topics. This allows us to compare it qualitatively to the German topics.

In [16]:
fig_map = topic_model_baseline_french.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_french.visualize_barchart(top_n_topics=8)
fig_bar.show()

The **French Intertopic Distance Chart** shows that the topics are well separated and not overlapping each other which means that the model is capable of identifying distinct thematic groups within the reviews. We can see a large central cluster which represents a theme related to the general quality of products. The smaller, more distant topic clusters correspond to more niche themes, like electronics, batteries or language-related comments which shows that the model is capable of capturing more specific product categories.
The second-biggest cluster consisting of words expressing negative sentiment is positioned more central as well and was probably created because of an overlap between some of the review themes (e.g. negative sentiment x bad product features).


The **French bar chart** visualized the most representative keywords for each topic of the French reviews and it shows again clear and well-structured themes. Some topics show a better internal consistency than others. And we can again see the presence of repeated morphological variants, like singular and plural or masculine and feminine form of the same word which could be reduced by applying lemmatization.



In [18]:
from calculate_t_coherence_and_diversity import evaluate_bertopic_pmi

_, npm_imp, div_imp = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_german,
    docs=docs_train_german,
    top_k_coherence=10,
    top_k_diversity=25,
    is_baseline=False
)

_, npm_base, div_base = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_french,
    docs=docs_train_french,
    top_k_coherence=10,
    top_k_diversity=25,
    is_baseline=True
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "Baseline": [npm_base, div_base],
    "Improved (KeyBERT)": [npm_imp, div_imp]
})
display(results_df)

Baseline Model - NPMI: 0.1186
Baseline Model - Diversity: 0.3086
Improved Model - NPMI: 0.1125
Improved Model - Diversity: 0.2514


Unnamed: 0,Metric,Baseline,Improved (KeyBERT)
0,Coherence (NPMI),0.112458,0.118582
1,Diversity,0.251429,0.308571


# Evaluation