!pip install -r requirements.txt
!python -m spacy download de_core_news_sm
!python -m spacy download fr_core_news_sm


In [3]:
!pip install --upgrade pip
!pip install -r requirements.txt
!python -m spacy download de_core_news_sm
!python -m spacy download fr_core_news_sm



In [4]:
from datasets import load_dataset

ds_german = load_dataset("SetFit/amazon_reviews_multi_de")
ds_french = load_dataset("SetFit/amazon_reviews_multi_fr")
docs_train_german = ds_german["train"]["text"]
categories_train_german = ds_german["train"]["label_text"]

docs_train_french = ds_french["train"]["text"]
categories_train_french = ds_french["train"]["label_text"]

Repo card metadata block was not found. Setting CardData to empty.


### Data Preparation
Based on the results of the previous experiment, we decided to use **lemmatization** this time to improve BERTopic. We did so because of the following reasons:
- we can remove the noise (words like Produkte, Produkten, Produktes will be merged into one -> Produkt)
- topics should be more semantically consistent (because of fewer repeated word variants)
- smaller vocabulary means more stable clustering

In [None]:
import spacy

nlp_de = spacy.load("de_core_news_sm")
nlp_fr = spacy.load("fr_core_news_sm")

def lemmatize(text: str, lang: str):
    if lang == "de":
        doc = nlp_de(text)
    else:
        doc = nlp_fr(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

lemmatized_german = [lemmatize(text, "de") for text in docs_train_german]
lemmatized_french = [lemmatize(text, "fr") for text in docs_train_french]

print("Original German:", docs_train_german[0])
print("Lemmatized German:", lemmatized_german[0])

print("Original French:", docs_train_french[0])
print("Lemmatized French:", lemmatized_french[0])

In [None]:
def remove_stopwords(lemmas: list[str], lang: str) -> str:
    if lang == "de":
        stopwords = nlp_de.Defaults.stop_words
    elif lang == "fr":
        stopwords = nlp_fr.Defaults.stop_words
    else:
        raise ValueError("lang must be 'de' or 'fr'")

    cleaned = [lemma for lemma in lemmas if lemma.lower() not in stopwords]
    return " ".join(cleaned)

docs_cleaned_german = [remove_stopwords(lemmas, "de") for lemmas in lemmatized_german]
docs_cleaned_french = [remove_stopwords(lemmas, "fr") for lemmas in lemmatized_french]

print("Cleaned German:", docs_cleaned_german[0])
print("Cleaned French:", docs_cleaned_french[0])

### **Set up**
##### **Baseline BERTopic model for German reviews**

Here, we train a BERTopic model on German Amazon reviews. We use the multilingual embedding setting and restrict the topic categorization to 10 topics with a minimum topic size of 20 documents.

In [5]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("all-mpnet-base-v2")
vectorizer_model = CountVectorizer(stop_words="english")

In [6]:
representation_model_german = KeyBERTInspired()

topic_model_baseline_german = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=20,
    nr_topics=10,
    representation_model=representation_model_german
)

topics_base_german, probs_base_german = topic_model_baseline_german.fit_transform(docs_train_german) # 200k (all) would take more 90min on T4 collab

2025-12-04 22:28:21,135 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 1250/1250 [09:47<00:00,  2.13it/s]
2025-12-04 22:38:10,285 - BERTopic - Embedding - Completed ✓
2025-12-04 22:38:10,285 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-04 22:38:31,784 - BERTopic - Dimensionality - Completed ✓
2025-12-04 22:38:31,787 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZER

##### **Baseline BERTopic model for French reviews**

We repeat the same setup for the French Amazon reviews, again using a multilingual model and the KeyBERTInspired representation.

In [7]:
representation_model_french = KeyBERTInspired()

topic_model_baseline_french = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=20,
    nr_topics=10,
    representation_model=representation_model_french
)

topics_base_french, probs_base_french = topic_model_baseline_french.fit_transform(docs_cleaned_french)

2025-12-04 22:44:38,345 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 1250/1250 [07:35<00:00,  2.75it/s]
2025-12-04 22:52:13,864 - BERTopic - Embedding - Completed ✓
2025-12-04 22:52:13,866 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-04 22:52:24,898 - BERTopic - Dimensionality - Completed ✓
2025-12-04 22:52:24,908 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZER

### **Inspecting computed topics for German**

In [14]:
info_df = topic_model_baseline_german.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(6))

Unnamed: 0,Topic,Count,Top Keywords
1,0,23893,"zurück, schlecht, der, ich, oder, funktioniert"
2,1,4097,"produkt, gekauft, amazon, der, zurück, oder"
3,2,1034,"lampe, lampen, leds, led, mehr, nicht"
4,3,602,"folie, folien, anleitung, einfach, anbringen, entfernen"
5,4,499,"hülle, hüllen, rückseite, keine, der, hatte"
6,5,173,"patronen, patrone, gedruckt, drucken, drucker, druckerpatronen"


The model successfully identifies specific product categories despite the noisy nature of review data. We see clear clusters for lighting (lampe, leds), screen protectors (folie, anbringen), and phone cases (hülle, rückseite).

However, we see the limitations of working with raw review data. Unlike the clean news dataset, these topics are heavily polluted by **sentiment words** (`schlecht`, `zurück`) and morphological variations (e.g., `folie` vs. `folien`). While KeyBERT captures the semantic core of the topic, the lack of lemmatization means the model treats singular and plural forms as completely different keywords, wasting valuable space in the topic representation.


#### **Visualization of German topic space**

We visualize the German topics in a 2D projection to see how well-separated they are. In the bar chart, we can inspect particular words in those clusters along with their significance.

In [9]:
fig_map = topic_model_baseline_german.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_german.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Inspecting computed French topics**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [13]:
info_df = topic_model_baseline_french.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(6))

Unnamed: 0,Topic,Count,Top Keywords
1,0,23890,"pas, fait, autre, comme, vous, vraiment"
2,1,1388,"chargeur, batterie, fait, pas, charge, recharge"
3,2,758,"autre, pas, fait, vitre, avec, vraiment"
4,3,94,"tablette, tablettes, ipad, pas, autre, avoir"
5,4,70,"vraiement, vraiment, pas, couleur, avec, couleurs"
6,5,51,"cigare, électronique, cigarette, dernière, électrique, fait"



The French model demonstrates a strong ability to identify specific hardware niches, even within a noisy dataset. We see distinct, high-quality clusters for power accessories (Topic 1: `chargeur`, `batterie`), tablets (Topic 3: `ipad`, `tablette`), and even electronic cigarettes (Topic 5: `cigare`, `électronique`).

However, the noise in the dataset is maybe even more aggressive here than in the German dataset. Topic 0 and Topic 2 are junk clusters dominated by high-frequency stop words like `pas` (not), `fait` (done/makes), and `autre` (other). Unlike standard nouns, these words don't describe a product; they describe the act of reviewing. Furthermore, Topic 4 acts as a sentiment trap, clustering reviews based solely on the word `vraiment` (really/truly).

#### **Visualization of the French topic space**

Here, we visualize the French topic landscape using the same projections and bar chart as for the German topics. This allows us to compare it qualitatively to the German topics.

In [11]:
fig_map = topic_model_baseline_french.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_french.visualize_barchart(top_n_topics=10)
fig_bar.show()

In [29]:
from calculate_t_coherence_and_diversity import evaluate_bertopic_pmi

_, npm_german, div_german = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_german,
    docs=docs_train_german,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="German"
)

_, npm_french, div_french = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_french,
    docs=docs_train_french,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="French"
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "French": [npm_french, div_french],
    "Improved (KeyBERT)": [npm_german, div_german]
})
display(results_df)

German Model - NPMI: 0.1371
German Model - Diversity: 0.2933
French Model - NPMI: 0.1050
French Model - Diversity: 0.2622


Unnamed: 0,Metric,French,Improved (KeyBERT)
0,Coherence (NPMI),0.105015,0.137134
1,Diversity,0.262222,0.293333


# Evaluation

This project aimed to replicate the success of our English News experiment on a more challenging, multilingual dataset of Amazon Reviews. While we successfully extracted interpretable product categories for both German and French, the quantitative metrics reveal that it was far less successful. The coherence scores for both languages are between **0.10 and 0.14**, which is significantly lower than the ~0.40 we achieved with the BBC News dataset.

These lower scores do not necessarily mean the model failed, they rather highlight the nature of the review data. Reviews are short, informal, and heavily saturated with sentiment (e.g., "bad", "perfect") which makes them harder to work with in topics extraction. While the KeyBERT-Inspired representation successfully identified specific items like "LEDs" or "iPads," it could not fully overcome the noise of raw text.

To bridge the gap between the score of **0.14** and our previous benchmark of **0.45**, future iterations would require a strict lemmatization pipeline to merge morphological variants (e.g., `folie` vs. `folien`) and a custom stopword list to filter out the generic transaction language that currently dominates the largest clusters.