# **BERTopic on German & French Amazon Reviews**
The following notebook:
- trains BERTopic models on German and French Amazon reviews
- inspects the discovered topics using a KeyBERTInspired representation model

In [2]:
!pip install -r requirements.txt
!python -m spacy download de_core_news_sm
!python -m spacy download fr_core_news_sm





[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\isabe\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
     ---------------------------------------- 0.0/14.6 MB ? eta -:--:--
     --------------------------------------- 0.0/14.6 MB 330.3 kB/s eta 0:00:45
     --------------------------------------- 0.0/14.6 MB 393.8 kB/s eta 0:00:38
     --------------------------------------- 0.1/14.6 MB 469.7 kB/s eta 0:00:32
     --------------------------------------- 0.2/14.6 MB 952.6 kB/s eta 0:00:16
     - -------------------------------------- 0.6/14.6 MB 2.5 MB/s eta 0:00:06
     --- ------------------------------------ 1.1/14.6 MB 4.1 MB/s eta 0:00:04
     ---- ----------------------------------- 1.7/14.6 MB 5.2 MB/s eta 0:00:03
     ----- ---------------------------------- 2.2/14.6 MB 6.1 MB/s eta 0:00:03
     ------- -------------------------------- 2.7/14.6 MB 6.6 MB/s eta 0:00:02
     -------- --------------------


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\isabe\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
     ---------------------------------------- 0.0/16.3 MB ? eta -:--:--
     --------------------------------------- 0.0/16.3 MB 330.3 kB/s eta 0:00:50
     --------------------------------------- 0.0/16.3 MB 393.8 kB/s eta 0:00:42
     --------------------------------------- 0.1/16.3 MB 409.6 kB/s eta 0:00:40
     --------------------------------------- 0.2/16.3 MB 876.1 kB/s eta 0:00:19
     - -------------------------------------- 0.5/16.3 MB 2.1 MB/s eta 0:00:08
     -- ------------------------------------- 1.1/16.3 MB 4.1 MB/s eta 0:00:04
     ----- ---------------------------------- 2.1/16.3 MB 6.5 MB/s eta 0:00:03
     ------- -------------------------------- 3.2/16.3 MB 8.8 MB/s eta 0:00:02
     ---------- ----------------------------- 4.1/16.3 MB 10.2 MB/s eta 0:00:02
     ------------ ---------------


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\isabe\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [3]:
from datasets import load_dataset

ds_german = load_dataset("SetFit/amazon_reviews_multi_de")
ds_french = load_dataset("SetFit/amazon_reviews_multi_fr")
docs_train_german = ds_german["train"]["text"]
categories_train_german = ds_german["train"]["label_text"]

docs_train_french = ds_french["train"]["text"]
categories_train_french = ds_french["train"]["label_text"]

  from .autonotebook import tqdm as notebook_tqdm
Repo card metadata block was not found. Setting CardData to empty.


In [4]:
!pip install bertopic
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\isabe\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip




#### **Data Preparation**

Adding **lemmatization** and **stop word removal** is expected to improve BERTopic because of the following reasons:
- removal of inflection noice (e.g. Produkte, Produkten, Produktes -> Produkt)
- creation of topics that are semantically more consistent (because fewer repeated word variants appear)
- improvement of embedding quality 
- smaller vocabulary means more stable clustering

In [23]:
from datasets import load_dataset
import spacy

nlp_de = spacy.load("de_core_news_sm")
nlp_fr = spacy.load("fr_core_news_sm")

def lemmatize(text: str, lang: str):
    if lang == "de":
        doc = nlp_de(text)
    else:
        doc = nlp_fr(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

lemmatized_german = [lemmatize(text, "de") for text in docs_train_german]
lemmatized_french = [lemmatize(text, "fr") for text in docs_train_french]

print("Original German:", docs_train_german[0])
print("Lemmatized German:", lemmatized_german[0])

print("Original French:", docs_train_french[0])
print("Lemmatized French:", lemmatized_french[0])


Original German: Armband ist leider nach 1 Jahr kaputt gegangen
Lemmatized German: ['Armband', 'sein', 'leider', 'nach', 'Jahr', 'kaputt', 'gehen']
Original French: A déconseiller - Article n'a fonctionné qu'une fois - Je ne recommande pas du tout ce produit - Je l'ai jeté ...
Lemmatized French: ['avoir', 'déconseiller', 'article', 'avoir', 'fonctionner', 'un', 'fois', 'je', 'ne', 'recommander', 'pas', 'de', 'tout', 'ce', 'produit', 'je', 'avoir', 'jeter']


In [25]:
# stop word removal

def remove_stopwords(lemmas: list[str], lang: str) -> str:
    if lang == "de":
        stopwords = nlp_de.Defaults.stop_words
    elif lang == "fr":
        stopwords = nlp_fr.Defaults.stop_words
    else:
        raise ValueError("lang must be 'de' or 'fr'")
    
    cleaned = [lemma for lemma in lemmas if lemma.lower() not in stopwords]
    return " ".join(cleaned)

cleaned_german = [remove_stopwords(lemmas, "de") for lemmas in lemmatized_german]
cleaned_french = [remove_stopwords(lemmas, "fr") for lemmas in lemmatized_french]

print("Cleaned German:", cleaned_german[0])
print("Cleaned French:", cleaned_french[0])


Cleaned German: Armband kaputt
Cleaned French: déconseiller article fonctionner fois recommander produit jeter


### **Set up** 

BERTopic will be used here for unsupervised topic modeling (of reviews). We use the KeyBERTInspired representation model, because it produces better topic labels as we have seen in the previous notebook. We use separate instances for the German and French models.


##### **Baseline BERTopic model for German reviews**

Here, we train a BERTopic model on German Amazon reviews. We use the multilingual embedding setting and restrict the topic categorization to 8 topics with a minimum topic size of 30 documents.

In [None]:
representation_model_german = KeyBERTInspired()

topic_model_baseline_german = BERTopic(
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=8,
    representation_model=representation_model_german
)

topics_base_german, probs_base_german = topic_model_baseline_german.fit_transform(cleaned_german)


2025-12-05 14:36:00,236 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 6250/6250 [22:00<00:00,  4.73it/s] 
2025-12-05 14:58:07,757 - BERTopic - Embedding - Completed ✓
2025-12-05 14:58:07,757 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-05 14:59:24,790 - BERTopic - Dimensionality - Completed ✓
2025-12-05 14:59:24,790 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-06 13:41:29,410 - BERTopic - Cluster - Completed ✓
2025-12-06 13:41:29,453 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-06 13:41:33,848 - BERTopic - Representation - Completed ✓
2025-12-06 13:41:33,856 - BERTopic - Topic reduction - Reducing number of topics
2025-12-06 13:41:34,382 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-06 13:41:37,770 - BERTopic - Representation - Completed ✓
2025-12-06 13:41:37,793 - BERTopic - Topic reduction -

##### **Baseline BERTopic model for French reviews**

We repeat the same setup for the French Amazon reviews, again using a multilingual model and the KeyBERTInspired representation.

In [None]:
representation_model_french = KeyBERTInspired()

topic_model_baseline_french = BERTopic(
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=8,
    representation_model=representation_model_french
)

topics_base_french, probs_base_french = topic_model_baseline_french.fit_transform(cleaned_french)


2025-12-06 13:44:30,135 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 6250/6250 [23:24<00:00,  4.45it/s]   
2025-12-06 14:08:05,070 - BERTopic - Embedding - Completed ✓
2025-12-06 14:08:05,071 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-06 14:09:22,754 - BERTopic - Dimensionality - Completed ✓
2025-12-06 14:09:22,754 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-06 18:34:17,507 - BERTopic - Cluster - Completed ✓
2025-12-06 18:34:17,518 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-12-06 18:34:20,589 - BERTopic - Representation - Completed ✓
2025-12-06 18:34:20,594 - BERTopic - Topic reduction - Reducing number of topics
2025-12-06 18:34:21,102 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-06 18:34:23,885 - BERTopic - Representation - Completed ✓
2025-12-06 18:34:23,908 - BERTopic - Topic reduction

### **Inspecting computed topics for German**

In the following part, we retrieve the topic summary for the German model, clean it by removing the outliers topic (-1) and construct a compact table with topic id, document count and top representative keywords.

In [None]:
info_df = topic_model_baseline_german.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

Unnamed: 0,Topic,Count,Top Keywords
1,0,45931,"produkt, verpackung, stoff, bild, material, geeignet"
2,1,34206,"lieferung, schnell, verpackung, liefern, paket, verarbeiten"
3,2,23824,"smartphone, telefon, kabel, akku, batterie, iphone"
4,3,6964,"trinken, flasche, wasser, verpackung, kaputt, lecker"
5,4,6689,"lautstärke, mikrofon, hören, sound, musik, schauspieler"


The generated topics for the German reviews are a lot more compact, cleaner and semantically consistent than compared to before applying lemmatization and stop word removal. 
As expected, the stopwords and non-topic words (e.g. 'schlecht', engl. *bad*) have completely disappeared and the topic descriptions now consistent of mostly meaningful nouns and infinitives which gives a better view of what each topic actually means.

Also, redundant variants (e.g. 'kaufen, gekauft', engl. *buy, bought*) have disappeared. Now, the topics appear more distinct and also more domain-specific. We can identify the following topics:

1) product/material quality
2) delivery
3) electronics & hardware
4) beverages & packaging
5) audio & sound equipment
6) commerce & sellers (see topic bar chart)
7) medication (see topic bar chart)

The topics are still similar to the ones from before lemmatization and stop word removal but they are even more specfic now and contain less noise. 

#### **Visualization of German topic space**

We visualize the German topics in a 2D projection to see how well-separated they are. In the bar chart, we can inspect their relative sizes.

In [None]:
fig_map = topic_model_baseline_german.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_german.visualize_barchart(top_n_topics=8)
fig_bar.show()

The **Intertopic Distance Map** shows a clearer separation of the clusters after applying lemmatization and stop word removal. They are also tighter and, all in all, further away from each other than before which means the the topics share fewer ambiguous or overlapping words. There is also no dominant topic cluster in the center anymore. Previously, that topic (0) absorbed a lot of the vocabulary but now, since the vocabulary is cleaer, UMAP separated the topics more evenly.

The **Topic Word Bar Chart** consists of much cleaner and semantically more precise keywords. And less redundant keywords mean improved topic interpretability.

#### **Inspecting computed French topics**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [None]:
info_df = topic_model_baseline_french.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(5))

Unnamed: 0,Topic,Count,Top Keywords
1,0,54086,"qualité, couleur, agréable, léger, efficace, emballage"
2,1,37850,"qualité, produit, recommander, acheter, prix, achat"
3,2,31460,"smartphone, téléphone, batterie, samsung, appareil, qualité"
4,3,2486,"bébé, bebe, grossesse, enceinte, petit, confortable"
5,4,798,"cd, album, guitar, guitare, musical, musique"


After applying lemmatization and stop word removal also on the French dataset, we can observe similar improvements. 
The topics appear much cleaner, without so many different spelling variants or inflection (e.g. 'français, française, francais', engl. *French*). Now, the topics show only lemmas, so verb infinitives, singular nouns, base adjectives and gender-neutral forms. 
Also, the previously generated Topic 2, which could be interpreted as a cluster of meaningless, negative words (e.g. 'aucun, sans, rien', engl. *no, without, nothing*), is not displayed anymore. Instead we can identify clear, domain-specific topics:
1) general product attributes (quality, colour, efficiency)
2) shopping experiences 
3) smartphone & electronics
4) baby products & maternity
5) music & albums
6) pricing/market in european context
7) medication

As for the German dataset, the keywords are all very meaningful and specific for the content of each cluster.

#### **Visualization of the French topic space**

Here, we visualize the French topic landscape using the same projections and bar chart as for the German topics. This allows us to compare it qualitatively to the German topics.

In [None]:
fig_map = topic_model_baseline_french.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_french.visualize_barchart(top_n_topics=8)
fig_bar.show()

The **Intertopic Distance Map** now shows that the topics are more clearly separated and they form distinct clusters without overlaps. This shows that lemmatization and stop word removal helped reducing noise, unifying the vocabulary and allowing UMAP to position the documents more accurately in this semantic space. 
Here, we can also observe that topics whose theme is semantically more related, lie closter together, while more specific categories, like baby products or music, are furhter apart. This indicates that their vocabulary is more unique. 

The **Topic Word Bar Chart** shows the improved topic keywords that we already saw further above. 

In [None]:
# evaluation

import importlib
import calculate_t_coherence_and_diversity as evalmod

importlib.reload(evalmod)

_, npm_german, div_german = evalmod.evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_german,
    docs=docs_train_german,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="German"
)

_, npm_french, div_french = evalmod.evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_french,
    docs=docs_train_french,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="French"
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "French": [npm_french, div_french],
    "Improved (KeyBERT)": [npm_german, div_german]
})
display(results_df)

German Model - NPMI: 0.1424
German Model - Diversity: 0.3771
French Model - NPMI: 0.1165
French Model - Diversity: 0.3486


Unnamed: 0,Metric,French,Improved (KeyBERT)
0,Coherence (NPMI),0.116486,0.142432
1,Diversity,0.348571,0.377143


#### **Evaluation**

