# **BERTopic on German & French Amazon Reviews**

In our previous experiment, we proved that **KeyBERT-Inspired representations** significantly outperform standard frequency-based labels (c-TF-IDF) for English news. They produce cleaner, more semantic topic descriptions.

In this notebook, we test if that same strategy holds up in a multilingual environment using **Amazon Reviews** in **German** and **French**. Reviews are \"messy\" data filled with slang, typos, and strong sentiment. We will train separate BERTopic models for each language and apply the KeyBERT-inspired representation to see if we can extract clear, interpretable product categories from raw customer feedback.


In [38]:
!pip install --upgrade pip
!pip install -r requirements.txt
!python -m spacy download de_core_news_lg
!python -m spacy download fr_core_news_lg

Collecting de-core-news-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.8.0/de_core_news_lg-3.8.0-py3-none-any.whl (567.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.8/567.8 MB[0m [31m9.1 MB/s[0m  [33m0:00:58[0m0:00:01[0m00:02[0m
[?25hInstalling collected packages: de-core-news-lg
Successfully installed de-core-news-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_lg')
Collecting fr-core-news-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_lg-3.8.0/fr_core_news_lg-3.8.0-py3-none-any.whl (571.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m571.8/571.8 MB[0m [31m8.8 MB/s[0m  [33m0:01:02[0m0:00:01[0m00:02[0m
[?25hInstalling collected packages: fr-core-news-lg
Successfully installed fr-core-news-lg-3.8.0
[38;5;2m✔ Download and installation su

In [39]:
from datasets import load_dataset

ds_german = load_dataset("SetFit/amazon_reviews_multi_de")
ds_french = load_dataset("SetFit/amazon_reviews_multi_fr")
docs_train_german = ds_german["train"]["text"]
categories_train_german = ds_german["train"]["label_text"]

docs_train_french = ds_french["train"]["text"]
categories_train_french = ds_french["train"]["label_text"]

Repo card metadata block was not found. Setting CardData to empty.


### Data Preparation
Based on the results of the previous experiment, we decided to use **lemmatization** this time to improve BERTopic. We did so because of the following reasons:
- we can remove the noise (words like Produkte, Produkten, Produktes will be merged into one -> Produkt)
- topics should be more semantically consistent (because of fewer repeated word variants)
- smaller vocabulary means more stable clustering

In [40]:
import spacy

nlp_de = spacy.load("de_core_news_lg")
nlp_fr = spacy.load("fr_core_news_lg")

def lemmatize(text: str, lang: str):
    if lang == "de":
        doc = nlp_de(text)
    else:
        doc = nlp_fr(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

lemmatized_german = [lemmatize(text, "de") for text in docs_train_german[:30000]]
print("Original German:", docs_train_german[0])
print("Lemmatized German:", lemmatized_german[0])

lemmatized_french = [lemmatize(text, "fr") for text in docs_train_french[:30000]]
print("Original French:", docs_train_french[0])
print("Lemmatized French:", lemmatized_french[0])

Original German: Armband ist leider nach 1 Jahr kaputt gegangen
Lemmatized German: ['Armband', 'sein', 'leider', 'nach', 'Jahr', 'kaputt', 'gehen']
Original French: A déconseiller - Article n'a fonctionné qu'une fois - Je ne recommande pas du tout ce produit - Je l'ai jeté ...
Lemmatized French: ['a', 'déconseiller', 'article', 'avoir', 'fonctionner', 'un', 'fois', 'je', 'ne', 'recommander', 'pas', 'de', 'tout', 'ce', 'produit', 'je', 'avoir', 'jeter']


In [41]:
def remove_stopwords(lemmas: list[str], lang: str) -> str:
    if lang == "de":
        stopwords = nlp_de.Defaults.stop_words
    elif lang == "fr":
        stopwords = nlp_fr.Defaults.stop_words
    else:
        raise ValueError("lang must be 'de' or 'fr'")

    cleaned = [lemma for lemma in lemmas if lemma.lower() not in stopwords]
    return " ".join(cleaned)

docs_cleaned_german = [remove_stopwords(lemmas, "de") for lemmas in lemmatized_german]
print("Cleaned German:", docs_cleaned_german[0])

docs_cleaned_french = [remove_stopwords(lemmas, "fr") for lemmas in lemmatized_french]
print("Cleaned French:", docs_cleaned_french[0])

Cleaned German: Armband kaputt
Cleaned French: déconseiller article fonctionner fois recommander produit jeter


### **Set up**
##### **Baseline BERTopic model for German reviews**

Here, we train a BERTopic model on German Amazon reviews. We use the multilingual embedding setting and restrict the topic categorization to 10 topics with a minimum topic size of 20 documents.

In [42]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-mpnet-base-v2")
vectorizer_model = CountVectorizer(stop_words="english")

In [43]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

representation_model_german = KeyBERTInspired()

topic_model_baseline_german = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=20,
    nr_topics=10,
    representation_model=representation_model_german
)

topics_base_german, probs_base_german = topic_model_baseline_german.fit_transform(docs_cleaned_german)

2025-12-07 02:14:47,767 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 938/938 [04:06<00:00,  3.81it/s]
2025-12-07 02:18:54,733 - BERTopic - Embedding - Completed ✓
2025-12-07 02:18:54,739 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-07 02:19:04,018 - BERTopic - Dimensionality - Completed ✓
2025-12-07 02:19:04,025 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-07 02:20:18,726 - BERTopic - Cluster - Completed ✓
2025-12-07 02:20:18,732 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-07 02:20:19,328 - BERTopic - Representation - Completed ✓
2025-12-07 02:20:19,329 - BERTopic - Topic reduction - Reducing number of topics
2025-12-07 02:20:19,371 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-07 02:20:31,843 - BERTopic - Representation - Completed ✓
2025-12-07 02:20:31,868 - BERTopic - Topic reduction - Re

##### **Baseline BERTopic model for French reviews**

We repeat the same setup for the French Amazon reviews, again using a multilingual model and the KeyBERTInspired representation.

In [47]:
representation_model_french = KeyBERTInspired()

topic_model_baseline_french = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=20,
    nr_topics=10,
    representation_model=representation_model_french
)

topics_base_french, probs_base_french = topic_model_baseline_french.fit_transform(docs_cleaned_french)

2025-12-07 02:22:12,625 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 938/938 [04:37<00:00,  3.38it/s] 
2025-12-07 02:26:50,405 - BERTopic - Embedding - Completed ✓
2025-12-07 02:26:50,410 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-07 02:26:58,978 - BERTopic - Dimensionality - Completed ✓
2025-12-07 02:26:58,984 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-07 02:27:40,550 - BERTopic - Cluster - Completed ✓
2025-12-07 02:27:40,554 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-07 02:27:41,042 - BERTopic - Representation - Completed ✓
2025-12-07 02:27:41,043 - BERTopic - Topic reduction - Reducing number of topics
2025-12-07 02:27:41,078 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-07 02:27:49,721 - BERTopic - Representation - Completed ✓
2025-12-07 02:27:49,735 - BERTopic - Topic reduction - R

### **Inspecting computed topics for German**

In [48]:
info_df = topic_model_baseline_german.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(6))

Unnamed: 0,Topic,Count,Top Keywords
1,0,21851,"bewertung, produkt, bild, funktionieren, verarbeitung, bleiben"
2,1,176,"ankommen, gekommen, ankratzen, bekommen, anbieten, ansehen"
3,2,87,"galaxy, solarleuchte, reichen, empfehlen, solarzelle, passen"
4,3,67,"liefern, lieferbar, geliefert, ärgern, kaufen, ständig"
5,4,67,"funktioniert, funktionierte, funktionieren, funktionern, funktionstüchtig, funktionsfähig"
6,5,35,"kaputt, kapputt, kapot, erbitt, kaufen, günstig"


The model successfully identifies specific product categories despite the noisy nature of review data. We see clear clusters for lighting (lampe, leds), screen protectors (folie, anbringen), and phone cases (hülle, rückseite).

However, we see the limitations of working with raw review data. Unlike the clean news dataset, these topics are heavily polluted by **sentiment words** (`schlecht`, `zurück`) and morphological variations (e.g., `folie` vs. `folien`). While KeyBERT captures the semantic core of the topic, the lack of lemmatization means the model treats singular and plural forms as completely different keywords, wasting valuable space in the topic representation.


#### **Visualization of German topic space**

We visualize the German topics in a 2D projection to see how well-separated they are. In the bar chart, we can inspect particular words in those clusters along with their significance.

In [45]:
fig_map = topic_model_baseline_german.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_german.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Inspecting computed French topics**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [49]:
info_df = topic_model_baseline_french.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(6))

Unnamed: 0,Topic,Count,Top Keywords
1,0,14732,"recevoir, décevoir, vraiment, devoir, pouvoir, voir"
2,1,3572,"batterie, chargeur, recharger, téléphone, telephone, charger"
3,2,250,"tactile, sensibilité, pouvoir, vraiment, protéger, vitre"
4,3,245,"plaque, dentifrice, vraiment, dentiste, médecin, décevoir"
5,4,237,"fonctionner, fonctionnait, fonctionnement, fonctionnr, froisser, foutaise"
6,5,132,"bracelet, anniversaire, recevoir, vraiment, décevoir, devoir"



The French model demonstrates a strong ability to identify specific hardware niches, even within a noisy dataset. We see distinct, high-quality clusters for power accessories (Topic 1: `chargeur`, `batterie`), tablets (Topic 3: `ipad`, `tablette`), and even electronic cigarettes (Topic 5: `cigare`, `électronique`).

However, the noise in the dataset is maybe even more aggressive here than in the German dataset. Topic 0 and Topic 2 are junk clusters dominated by high-frequency stop words like `pas` (not), `fait` (done/makes), and `autre` (other). Unlike standard nouns, these words don't describe a product; they describe the act of reviewing. Furthermore, Topic 4 acts as a sentiment trap, clustering reviews based solely on the word `vraiment` (really/truly).

#### **Visualization of the French topic space**

Here, we visualize the French topic landscape using the same projections and bar chart as for the German topics. This allows us to compare it qualitatively to the German topics.

In [50]:
fig_map = topic_model_baseline_french.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_french.visualize_barchart(top_n_topics=10)
fig_bar.show()

In [51]:
from calculate_t_coherence_and_diversity import evaluate_bertopic_pmi

_, npm_german, div_german = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_german,
    docs=docs_cleaned_german,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="German"
)

_, npm_french, div_french = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_french,
    docs=docs_cleaned_french,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="French"
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "German": [npm_german, div_german],
    "French": [npm_french, div_french],
})
display(results_df)

German Model - NPMI: 0.1316
German Model - Diversity: 0.3378
French Model - NPMI: 0.1233
French Model - Diversity: 0.3333


Unnamed: 0,Metric,German,French
0,Coherence (NPMI),0.131616,0.123252
1,Diversity,0.337778,0.333333


# Evaluation

This project aimed to replicate the success of our English News experiment on a more challenging, multilingual dataset of Amazon Reviews. While we successfully extracted interpretable product categories for both German and French, the quantitative metrics reveal that it was far less successful. The coherence scores for both languages are between **0.10 and 0.14**, which is significantly lower than the ~0.40 we achieved with the BBC News dataset.

These lower scores do not necessarily mean the model failed, they rather highlight the nature of the review data. Reviews are short, informal, and heavily saturated with sentiment (e.g., "bad", "perfect") which makes them harder to work with in topics extraction. While the KeyBERT-Inspired representation successfully identified specific items like "LEDs" or "iPads," it could not fully overcome the noise of raw text.

To bridge the gap between the score of **0.14** and our previous benchmark of **0.45**, future iterations would require a strict lemmatization pipeline to merge morphological variants (e.g., `folie` vs. `folien`) and a custom stopword list to filter out the generic transaction language that currently dominates the largest clusters.