### **BERTopic on German & French Amazon Reviews**
The following notebook:
- trains BERTopic models on German and French Amazon reviews
- inspects the discovered topics using a KeyBERTInspired representation model

In [2]:
from datasets import load_dataset

ds_german = load_dataset("SetFit/amazon_reviews_multi_de")
ds_french = load_dataset("SetFit/amazon_reviews_multi_fr")
docs_train_german = ds_german["train"]["text"]
categories_train_german = ds_german["train"]["label_text"]

docs_train_french = ds_french["train"]["text"]
categories_train_french = ds_french["train"]["label_text"]

Repo card metadata block was not found. Setting CardData to empty.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 100%|██████████| 200000/200000 [00:00<00:00, 531028.41 examples/s]
Generating validation split: 100%|██████████| 5000/5000 [00:00<00:00, 220820.25 examples/s]
Generating test split: 100%|██████████| 5000/5000 [00:00<00:00, 212210.80 examples/s]
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 100%|██████████| 200000/200000 [00:00<00:00, 652392.13 examples/s]
Generating validation split: 100%|██████████| 5000/5000 [00:00<00:00, 186669.04 examples/s]
Generating test s

In [4]:
!pip install bertopic
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired




[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\isabe\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### **Set up** 

BERTopic will be used here for unsupervised topic modeling (of reviews). We use the KeyBERTInspired representation model to facilitate that the topic labels are based on semantic similarity rather than just term frequency. We use separate instances for the German and French models.

##### **Baseline BERTopic model for German reviews**

Here, we train a BERTopic model on German Amazon reviews. We use the multilingual embedding setting and restrict the topic categorization to 8 topics with a minimum topic size of 30 documents.

In [5]:
representation_model_german = KeyBERTInspired()
topic_model_baseline_german = BERTopic(
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=8,
    representation_model=representation_model_german
)
topics_base_german, probs_base_german = topic_model_baseline_german.fit_transform(docs_train_german[:40000])

2025-11-30 16:34:29,056 - BERTopic - Embedding - Transforming documents to embeddings.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Batches: 100%|██████████| 1250/1250 [14:18<00:00,  1.46it/s]
2025-11-30 16:49:21,076 - BERTopic - Embedding - Completed ✓
2025-11-30 16:49:21,078 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-30 16:50:33,611 - BERTopic - Dimensionality - Completed ✓
2025-11-30 16:50:33,617 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-30 16:51:58,726 - BERTopic - Cluster - Completed ✓
2025-11-30 16:

##### **Baseline BERTopic model for French reviews**

We repeat the same setup for the French Amazon reviews, again using a multilingual model and the KeyBERTInspired representation.

In [6]:
representation_model_french = KeyBERTInspired()

topic_model_baseline_french = BERTopic(
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=8,
    representation_model=representation_model_french
)
topics_base_french, probs_base_french = topic_model_baseline_french.fit_transform(docs_train_french[:40000])

2025-11-30 16:52:06,502 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 1250/1250 [12:53<00:00,  1.62it/s]
2025-11-30 17:05:06,992 - BERTopic - Embedding - Completed ✓
2025-11-30 17:05:06,994 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-30 17:05:20,922 - BERTopic - Dimensionality - Completed ✓
2025-11-30 17:05:20,929 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-30 17:06:26,927 - BERTopic - Cluster - Completed ✓
2025-11-30 17:06:26,931 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-11-30 17:06:28,551 - BERTopic - Representation - Completed ✓
2025-11-30 17:06:28,559 - BERTopic - Topic reduction - Reducing number of topics
2025-11-30 17:06:28,771 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-30 17:06:32,687 - BERTopic - Representation - Completed ✓
2025-11-30 17:06:32,701 - BERTopic - Topic reduction - 

##### **Inspecting computed topics for German**

In the following part, we retrieve the topic summary for the German model, clean it by removing the outliert topic (-1) and construct a compact table with topic id, document count and top representative keywords.

In [7]:
info_df = topic_model_baseline_german.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

Unnamed: 0,Topic,Count,Top Keywords
1,0,16765,"gekauft, produkt, kaufen, bestellt, kaputt, funktioniert"
2,1,4415,"amazon, lieferung, bestellen, geliefert, händler, verkäufer"
3,2,3555,"iphone, akku, batterie, batterien, kabel, app"
4,3,794,"film, schauspieler, schlecht, tv, filme, story"
5,4,551,"headset, sound, hören, mikrofon, musik, radio"


The generated topics are already quite coherent and meaningful. Most topics contian keywords that are often found in similar contexts (e.g. film, tv and story) which means that the embedding and the KeyBERTInspired representation work well and the model can successfully capture semantic patterns.
The keyword cluster also gives us hints on the general themes, which are:
- product quality and functionality
- delivery
- electronics, hardware
- movies, entertainment
- audio

Nevertheless, the keywords include sentiment as well (schlecht, engl. 'bad') which is not representing the topic at all but is typical for reviews. Also, the top keywords consist of different word forms of the same stem (e.g. kaufen, engl. 'buy', and gekauft, engl. 'bought'). This could change if we would apply lemmatization before training and could result in cleaner topic keywords. 


**Visualization of German topic space**

We visualize the German topics in a 2D projection to see how well-separated they are. In the bar chart, we can inspect their relative sizes.

In [8]:
fig_map = topic_model_baseline_german.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_german.visualize_barchart(top_n_topics=8)
fig_bar.show()

The **Intertopic Distance Map** shows how semantically different the topics aere form each other, each bubble representing one topic and its size corresponding to how many documents are included. The map shows that one Topic 0 is much larger than all others (with more than 16k documents). Its centered position also indicated that other topics share semantically similar vocabulary to this topic.

The **bar chart** visualizes how BERTopic differentiates between the different themes. Overall, the topic structures seem to be meaningful and coherent but they are also some redundant words and noise, like "deutscher", "deutsches", "deutsche". But no cluster seems to overlap strongly, the keywords seem well structured and well distributed and they describe the topic in a meaningful way. 

##### **Inspecting computed French topics**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [9]:
info_df = topic_model_baseline_french.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

Unnamed: 0,Topic,Count,Top Keywords
1,0,10818,"écran, montre, plastique, acheté, mauvaise, dommage"
2,1,8127,"jamais, aucune, aucun, rien, sans, non"
3,2,3118,"téléphone, samsung, iphone, câble, bluetooth, cable"
4,3,1301,"batterie, batteries, chargeur, charger, aspirateur, charge"
5,4,537,"français, française, france, francais, espagnol, langue"


Our French BERTopic model also computes mostly coherent and meaningful topics. However, we are dealing with morphological redundancy (français, française, france, francais) again and topic 2 can be viewd as a (very) broad generic negative review topic. 

**Visualization of the French topic space**

Here, we visualize the French topic landscape using the same projections and bar chart as for the German topics. This allows us to compare it qualitatively to the German topics.

In [10]:
fig_map = topic_model_baseline_french.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_french.visualize_barchart(top_n_topics=8)
fig_bar.show()

The **French Intertopic Distance Chart** shows that the topics are well separated and not overlapping each other which means that the model is capable of identifying disting thematic groups within the reviews. We can see a large central cluster which represents a theme related to the general quality of products. The smaller, more distant topic clusters correspond to more niche themes, like electronics, batteries or language-related comments which shows that the model is capable of capturing more specific product categories.
The second biggest cluster consisting of words expressing negative sentiment is positioned more central as well and was probably created because of an overlap between some of the review themes (e.g. negative sentiment x bad product features).


The **French bar chart** visualized the most representative keywords for each topic of the French reviews and it shows again clear and well-structured themes. Some topics shows a better internal consistency than others. And we can again see the presence of repeated morpholigal variants, like singular and plural or masculin and feminine form of the same word which could be reduced by applying lemmatization.

