# **BERTopic on Italian, German and English on Amazon_massive_intent dataset**
In our earlier experiment with English news articles, the KeyBERT-inspired topic representations clearly outperformed basic frequency-based labels (c-TF-IDF), giving us cleaner and more meaningful descriptions.

We then tested the same approach on German and French Amazon reviews, which are much messier due to slang, typos, and strong opinions. We trained separate BERTopic models for each language and applied the KeyBERT-style method to see if we could still get clear product-related topics.However, the results were much weaker: both languages reached coherence scores of only 0.10–0.14, far below the ~0.40 we achieved on the BBC News dataset.

**In this notebook,** we test whether the same strategy works in a multilingual setting using the **amazon_massive_intent dataset** in Italian and German. This dataset is simpler and cleaner than Amazon Reviews, so we want to see if the model performs better when there is less linguistic noise.

We train separate BERTopic models for each language, just like we did with the Amazon Reviews experiments, and apply the KeyBERT-inspired representation to check whether we can extract clear and interpretable for **assistant command patterns**.


We used the English multilingual dataset to check whether the model performs better in English, since English data and embeddings are generally higher quality, making it a good reference point for comparison.Also, we included the German of this dataset to compare how the same language performs on a different less noisy dataset.


**Goal:** To assess whether BERTopic can consistently uncover similar **intent-related clusters** across three different languages Italian, German, English in the massive dataset, and to analyze how language-specific phrasing influences cluster structure.



In [2]:
!pip install --upgrade pip
!pip install bertopic datasets sentence_transformers pandas spacy scikit-learn
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm
!python -m spacy download it_core_news_sm
!python -m spacy download es_core_news_sm

Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m39.1 MB/s[0m  [33m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m139.7 MB/s[0m  [33m0:00:00[0m
[?25h[38;5;2m✔ Download and installat

In [3]:
from datasets import load_dataset

ds_italian = load_dataset("SetFit/amazon_massive_intent_it-IT")
docs_train_italian = ds_italian["train"]["text"]
categories_train_italian = ds_italian["train"]["label_text"]


ds_de = load_dataset("SetFit/amazon_massive_intent_de-DE")
docs_train_de = ds_de["train"]["text"]
categories_train_de = ds_de["train"]["label_text"]

ds_english = load_dataset("SetFit/amazon_massive_intent_en-US")
docs_train_english = ds_english["train"]["text"]
categories_train_english = ds_english["train"]["label_text"]


# spanish dataset
ds_es = load_dataset("SetFit/amazon_massive_intent_es-ES")
docs_train_es = ds_es["train"]["text"]
categories_train_es = ds_es["train"]["label_text"]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


train.jsonl: 0.00B [00:00, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/11514 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2033 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2974 [00:00<?, ? examples/s]

train.jsonl: 0.00B [00:00, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/11514 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2033 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2974 [00:00<?, ? examples/s]

train.jsonl: 0.00B [00:00, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/11514 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2033 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2974 [00:00<?, ? examples/s]

train.jsonl: 0.00B [00:00, ?B/s]

validation.jsonl: 0.00B [00:00, ?B/s]

test.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/11514 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2033 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2974 [00:00<?, ? examples/s]

### Data Preparation
We decided to use **lemmatization** this time to improve BERTopic. We did so because of the following reasons:
- we can remove the noise (words like Produkte, Produkten, Produktes will be merged into one -> Produkt)

- smaller vocabulary means more stable clustering

In [4]:
import spacy

nlp_it = spacy.load("it_core_news_sm")

#lemmatization for Italian only
def lemmatize_it(text: str):
    doc = nlp_it(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

#apply lemmatization to italian documents
lemmatized_it = [lemmatize_it(text) for text in docs_train_italian[:50000]]

#example
print("Original Italian:", docs_train_italian[38])
print("Lemmatized Italian:", lemmatized_it[38])


Original Italian: voglio del ramen da portare via olly hai qualche consiglio
Lemmatized Italian: ['volere', 'di il', 'ramen', 'da', 'portare', 'via', 'olly', 'avere', 'qualche', 'consiglio']


In [5]:
import spacy


nlp_en = spacy.load("en_core_web_sm")

#lemmatization for English only
def lemmatize_en(text: str):
    doc = nlp_en(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

#lemmatization to English documents
lemmatized_en = [lemmatize_en(text) for text in docs_train_english[:50000]]

#example
print("Original English:", docs_train_english[0])
print("Lemmatized English:", lemmatized_en[0])

Original English: wake me up at nine am on friday
Lemmatized English: ['wake', 'I', 'up', 'at', 'nine', 'am', 'on', 'friday']


In [6]:
import spacy


nlp_de = spacy.load("de_core_news_sm")

#lemmatization for german only
def lemmatize_de(text: str):
    doc = nlp_de(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

#lemmatization to English documents
lemmatized_de = [lemmatize_de(text) for text in docs_train_de[:50000]]

#example
print("Original German:", docs_train_de[0])
print("Lemmatized German:", lemmatized_de[0])

Original German: weck mich am freitag um neun uhr auf
Lemmatized German: ['Weck', 'mich', 'an', 'Freitag', 'um', 'neun', 'Uhr', 'auf']


In [7]:
import spacy

nlp_es = spacy.load("es_core_news_sm")

#lemmatization for Spanish only
def lemmatize_es(text: str):
    doc = nlp_es(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]


lemmatized_es = [lemmatize_es(text) for text in docs_train_es[:50000]]

#example
print("Original Spanish:", docs_train_es[0])
print("Lemmatized Spanish:", lemmatized_es[0])

Original Spanish: despiértame a las nueve de la mañana el viernes
Lemmatized Spanish: ['despiértame', 'a', 'el', 'nueve', 'de', 'el', 'mañana', 'el', 'viernes']


In [8]:
def remove_stopwords(lemmas: list[str], lang: str) -> str:
    stopwords_map = {
        "it": nlp_it.Defaults.stop_words,
        "de": nlp_de.Defaults.stop_words,
        "en": nlp_en.Defaults.stop_words,
        "es": nlp_es.Defaults.stop_words

    }

    if lang not in stopwords_map:
        raise ValueError("lang must be 'it' or 'en' or 'es' or 'de' " )

    stopwords = stopwords_map[lang]
    cleaned = (lemma for lemma in lemmas if lemma.lower() not in stopwords)
    return " ".join(cleaned)

docs_cleaned_italian = [remove_stopwords(lemmas, "it") for lemmas in lemmatized_it]
print("Cleaned Italian:", docs_cleaned_italian[38])

docs_cleaned_english = [remove_stopwords(lemmas, "en") for lemmas in lemmatized_en]
print("Cleaned English:", docs_cleaned_english[0])

docs_cleaned_de = [remove_stopwords(lemmas, "de") for lemmas in lemmatized_de]
print("Cleaned German:", docs_cleaned_de[0])

docs_cleaned_es = [remove_stopwords(lemmas, "es") for lemmas in lemmatized_es]
print("Cleaned Spanish:", docs_cleaned_es[0])

Cleaned Italian: volere di il ramen portare olly
Cleaned English: wake friday
Cleaned German: Weck Freitag
Cleaned Spanish: despiértame mañana viernes


### **Set up**
##### **BERTopic model for Italian**

We trained a BERTopic model on the Amazon Massive intent dataset using different language embeddings e.g. Italian, German, English. Instead of restricting the model to 10 predefined topics, we allowed BERTopic to learn the topic structure organically, specifying only a minimum topic size of 30 documents.

In [9]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

vectorizer_model = CountVectorizer(stop_words=None)

  $max \{ core_k(a), core_k(b), 1/\alpha d(a,b) \}$.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [10]:
representation_model_italian = KeyBERTInspired()

topic_model_baseline_italian = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="italian",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_italian
)

topics_base_italian, probs_base_italian = topic_model_baseline_italian.fit_transform(docs_train_italian)

2025-12-08 11:21:04,461 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/360 [00:00<?, ?it/s]

2025-12-08 11:21:14,749 - BERTopic - Embedding - Completed ✓
2025-12-08 11:21:14,750 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-08 11:21:44,603 - BERTopic - Dimensionality - Completed ✓
2025-12-08 11:21:44,604 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-08 11:21:47,764 - BERTopic - Cluster - Completed ✓
2025-12-08 11:21:47,778 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-08 11:21:49,902 - BERTopic - Representation - Completed ✓


##### **BERTopic model for German**

We repeated the same setup for the German dataset; however, instead of using the Amazon multilingual review dataset, we used  the Amazon MASSIVE intent dataset_german and followed the same procedure.

In [14]:
representation_model_de = KeyBERTInspired()

topic_model_baseline_de = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="german",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_de
)

topics_de, probs_de = topic_model_baseline_de.fit_transform(docs_train_de)

2025-12-08 11:27:06,939 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/360 [00:00<?, ?it/s]

2025-12-08 11:27:17,885 - BERTopic - Embedding - Completed ✓
2025-12-08 11:27:17,886 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-08 11:27:26,791 - BERTopic - Dimensionality - Completed ✓
2025-12-08 11:27:26,793 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-08 11:27:29,575 - BERTopic - Cluster - Completed ✓
2025-12-08 11:27:29,581 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-08 11:27:32,115 - BERTopic - Representation - Completed ✓


##### **BERTopic model for English**

We repeat the same setup for the English Dataset.

In [17]:
representation_model_en = KeyBERTInspired()

topic_model_baseline_en = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="english",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_en
)

topics_en, probs_en = topic_model_baseline_en.fit_transform(docs_train_english)

2025-12-08 11:28:37,915 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/360 [00:00<?, ?it/s]

2025-12-08 11:28:48,201 - BERTopic - Embedding - Completed ✓
2025-12-08 11:28:48,202 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-08 11:28:57,210 - BERTopic - Dimensionality - Completed ✓
2025-12-08 11:28:57,212 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-08 11:28:59,537 - BERTopic - Cluster - Completed ✓
2025-12-08 11:28:59,543 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-08 11:29:01,481 - BERTopic - Representation - Completed ✓


##### **BERTopic model for Spanish**

We repeat the same setup for the Spanish Dataset.

In [11]:
representation_model_spanish = KeyBERTInspired()

topic_model_baseline_spanish = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="spanish",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_spanish
)

topics_base_spanish, probs_base_spanish = topic_model_baseline_spanish.fit_transform(docs_train_es)


2025-12-08 11:24:15,360 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/360 [00:00<?, ?it/s]

2025-12-08 11:24:25,522 - BERTopic - Embedding - Completed ✓
2025-12-08 11:24:25,522 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-12-08 11:24:34,634 - BERTopic - Dimensionality - Completed ✓
2025-12-08 11:24:34,636 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-12-08 11:24:37,574 - BERTopic - Cluster - Completed ✓
2025-12-08 11:24:37,580 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-12-08 11:24:39,858 - BERTopic - Representation - Completed ✓


### **Inspecting computed topics for Italian**

In [15]:
info_df = topic_model_baseline_italian.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Unnamed: 0,Topic,Count,Top Keywords
1,0,900,"sono, favore, lette, se, chiocciola, alla"
2,1,791,"mettere, metti, favore, alla, brani, riprodurre"
3,2,443,"casa, spegni, soggiorno, salotto, accendi, sfumatura"
4,3,373,"momento, metti, se, suona, mettere, alzarmi"
5,4,348,"aggiornami, sono, testa, stati, tav, sulla"
6,5,337,"giorno, dammi, sono, dimmi, settimana, per"
7,6,324,"tra, favore, della, alla, sera, fissa"
8,7,302,"mese, favore, vai, dagli, gli, per"


The model is able to identify some meaningful intent-related themes in the Italian MASSIVE dataset. For example topic-1 showed somewhat clearer structure, capturing media-playback actions through verbs like **“mettere”** and **“metti”** (put/play) and nouns such as **“brani”** (songs). On the other hand, BERTopic in italian dataset produced several mixed or noisy clusters, indicating difficulty in extracting clean intent categories from short Italian utterances. However, the output also highlights key limitations. For instance, Topic 0 was dominated by extremely frequent conversational tokens such as **“sono”** (I am), **“favore”** (please), and even **“chiocciola”** (the @ symbol).

Together, these examples show that BERTopic struggles to form clean intent clusters when input texts are short, formulaic, and dominated by high-frequency filler words.

### **Inspecting computed topics for English**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [18]:
info_df = topic_model_baseline_en.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Unnamed: 0,Topic,Count,Top Keywords
1,0,880,"let, hour, read, late, any, please"
2,1,517,"tonight, olly, eleven, two, show, list"
3,2,458,"kitchen, house, room, roomba, garage, porch"
4,3,372,"show, pa, now, on, check, next"
5,4,310,"rest, busy, am, schedule, clean, scheduled"
6,5,290,"on, five, olly, ten, too, be"
7,6,224,"garner, chan, parton, keanu, channing, beiber"
8,7,188,"twitter, via, on, and, too, my"


As predicted, the English BERTopic model performs noticeably better than the Italian and German versions, producing cleaner and more semantically consistent clusters. For example, Topic 0 captures a coherent lighting-control intent with keywords such as **“darken,”** **“lighting,”** **“lamp,”** **“brighten,”** and **“lights”**, which aligns well with typical smart-home commands.

However, some clusters still suffer from noise and mixed semantics. Topic 5, for instance, blends celebrity names **(“kim,” “elvis,” “miley”) **with unrelated temporal words **(“when”)**, indicating that BERTopic occasionally forms clusters based on accidental lexical co-occurrence rather than true intent similarity

### **Inspecting computed topics for German**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [19]:
info_df = topic_model_baseline_de.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Unnamed: 0,Topic,Count,Top Keywords
1,0,835,"stunde, posteingang, sende, bitte, auf, irgendwelche"
2,1,831,"öffne, mir, hörbuch, möchte, auf, spielt"
3,2,443,"viel, uhrzeit, wird, weit, mir, statt"
4,3,423,"auf, schalten, ausmachen, aller, bitte, der"
5,4,379,"prüfe, brauche, weckalarm, bestätige, tech, stunde"
6,5,325,"bitte, heutige, schön, gerade, samstag, tage"
7,6,211,"sprich, dich, bitte, kopfhörer, machen, sehr"
8,7,210,"sofort, anmachen, ausmachen, gut, rede, hoch"


The German BERTopic model shows mixed performance.
On the positive side, one cluster cleanly captures email-related actions, with keywords like in topic 2 **“posteingang”** (inbox), **“öffne” **(open), **“erhalten”** (receive), and **“schicke”** (send), indicating that the model successfully extracts a coherent communication-intent topic.


However, a clear limitation appears with the repeated use of the politeness word “bitte.” It shows up in Topic 0, Topic 3, Topic 5, and Topic 6, even though these topics correspond to different intents such as media playback, device control, daily schedule etc. Because
“bitte”is so common in short German commands, BERTopic overweights it, which causes clusters to mix polite filler words with the actual intent-related verbs and it reduced the overall clarity of the topics.

### **Inspecting computed topics for Spanish**

We repeat the same steps for the inspection of the topics generated by the Spanish model by building a clean topic overview table with interpretable keyword representations.

In [20]:
info_df = topic_model_baseline_spanish.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Unnamed: 0,Topic,Count,Top Keywords
1,0,925,"ver, algún, para, este, su, favor"
2,1,760,"guardar, poner, ponme, modo, guarda, toca"
3,2,438,"casa, salón, porche, habitación, apagar, muebles"
4,3,372,"reuniré, tengo, para, próxima, viene, seis"
5,4,275,"quiero, hora, horas, hoy, está, pasando"
6,5,273,"días, día, cita, quiero, hoy, los"
7,6,272,"compaq, pasa, dime, rápido, gonzález, quiero"
8,7,224,"actualmente, hora, momento, serán, ahora, hace"


Like other language some topics capturing real assistant style intents. For example, Topic 3 groups together home related commands with keywords like “casa” (house), “salón” (living room), “porche” (porch), and “apagar” (turn off), indicating a coherent cluster focused on smart home control. However, several limitations also appear. Topic 0 contains extremely general words like “ver” (see), “algún” (some), and “favor” (please), which do not correspond to a specific intent. High-frequency filler terms such as “quiero” (I want) and “para” (for) show up across multiple clusters, reducing their distinctiveness.

### Visualization

#### **Visualization of Italian topic space**

Similar to the previous notebook, we visualize the topics in a 2D projection to assess how well-separated they are.We did in three different language italian, english, spanish german for amazon massive intent dataset. The accompanying bar chart allows us to inspect the key words within each cluster along with their relative importance.

In [22]:
fig_map = topic_model_baseline_italian.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_italian.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Visualization of the English topic space**

In [23]:
fig_map = topic_model_baseline_en.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_en.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Visualization of the German topic space**

In [24]:
fig_map = topic_model_baseline_de.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_de.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Visualization of the Spanish topic space**

In [25]:
fig_map = topic_model_baseline_spanish.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_spanish.visualize_barchart(top_n_topics=10)
fig_bar.show()

### Evaluation

In [28]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import itertools
from math import log

class TopicModelWrapper:
    def __init__(self, topics):
        self.topics = topics

    def get_topics(self):
        return self.topics

def evaluate_bertopic_pmi(
    topic_model,
    docs,
    top_k_coherence: int = 5,
    top_k_diversity: int = 10,
    skip_outlier: bool = True,
    tag: str = "Baseline",
):


    raw_topics = topic_model.get_topics()

    new_keywords = {
        topic_id: [word for word, _ in word_scores]
        for topic_id, word_scores in raw_topics.items()
    }


    if skip_outlier and -1 in new_keywords:
        new_keywords = {tid: kws for tid, kws in new_keywords.items() if tid != -1}


    all_keywords = sorted({w for kws in new_keywords.values() for w in kws})

    vectorizer = CountVectorizer(vocabulary=all_keywords, lowercase=True)
    X = vectorizer.fit_transform(docs)

    n_docs, n_terms = X.shape

    X_bin = (X > 0).astype(int)

    word_doc_counts = np.asarray(X_bin.sum(axis=0)).ravel()
    p_w = word_doc_counts / n_docs

    cooc_counts = (X_bin.T @ X_bin).toarray()
    p_ij = cooc_counts / n_docs

    vocab = np.array(vectorizer.get_feature_names_out())
    word2id = {w: i for i, w in enumerate(vocab)}


    def npmi_pair(w1, w2):
        i = word2id.get(w1)
        j = word2id.get(w2)
        if i is None or j is None:
            return None

        pij = p_ij[i, j]
        if pij == 0:
            return None

        pi = p_w[i]
        pj = p_w[j]

        pmi = log(pij / (pi * pj))
        return pmi / (-log(pij))

    def topic_npmi_coherence(topic_words, top_k=None):
        if top_k is not None:
            topic_words = topic_words[:top_k]

        scores = []
        for w1, w2 in itertools.combinations(topic_words, 2):
            score = npmi_pair(w1, w2)
            if score is not None:
                scores.append(score)

        if not scores:
            return float("nan")
        return float(np.mean(scores))

    topic_scores = {
        topic_id: topic_npmi_coherence(words, top_k=top_k_coherence)
        for topic_id, words in new_keywords.items()
    }

    coherence_df = pd.DataFrame(
        {
            "Topic": list(topic_scores.keys()),
            "NPMI": list(topic_scores.values()),
        }
    ).sort_values("Topic")

    mean_npmi = float(np.nanmean(coherence_df["NPMI"]))


    def topic_diversity(topics_dict, k=10):
        # collect top-k words for each topic
        topk_words = []
        for tid, words in topics_dict.items():
            topk_words.extend(words[:k])

        if not topk_words:
            return float("nan")

        unique_words = set(topk_words)
        T = len(topics_dict)
        total_words = T * k

        return len(unique_words) / total_words

    diversity = float(topic_diversity(new_keywords, k=top_k_diversity))

    print(f"{tag} Model - NPMI: {mean_npmi:.4f}")
    print(f"{tag} Model - Diversity: {diversity:.4f}")

    return coherence_df, mean_npmi, diversity

In [31]:


_, npm_german, div_german = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_de,
    docs=docs_cleaned_de,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="German"
)

_, npm_italian, div_italian = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_italian,
    docs=docs_cleaned_italian,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="Italian"
)
_, npm_en, div_en = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_en,
    docs=docs_cleaned_english,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="English"
)
_, npm_es, div_es = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_spanish,
    docs=docs_cleaned_es,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="Spanish"
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "German": [npm_german, div_german],
    "Italian": [npm_italian, div_italian],
    "English": [npm_en, div_en],
    "Spanish": [npm_es, div_es]
})
display(results_df)

German Model - NPMI: 0.2415
German Model - Diversity: 0.2254
Italian Model - NPMI: 0.1377
Italian Model - Diversity: 0.1742
English Model - NPMI: 0.1996
English Model - Diversity: 0.2099
Spanish Model - NPMI: 0.1848
Spanish Model - Diversity: 0.1926


Unnamed: 0,Metric,German,Italian,English,Spanish
0,Coherence (NPMI),0.241506,0.137677,0.199629,0.184774
1,Diversity,0.225405,0.174177,0.209855,0.192593


# Overall Summary for four Languages

> It,Es,En,De



The goal of this project was to replicate the success observed in our English news experiment and, to some extent, in the more challenging multilingual Amazon Reviews dataset, **using a simpler dataset to evaluate whether the model would perform better.** Additionally, we sought to compare how the Italian and spanish language behaves relative to the others. Previously, we experimented with the German Amazon multilingual review dataset, but the data proved too noisy due to its nature as a review corpus.

Although we were able to extract intent-related themes for Italian, Spanish, German, and English, the quantitative metrics indicate that the results were again considerably less successful. The coherence scores for German, Italian and spanish were 0.29 and 0.18, 0.19 respectively. We expected the English portion of the Amazon MASSIVE Intent dataset to perform better; however, it produced a coherence score of 0.19, which is also very low. In contrast, the BBC News dataset achieved a coherence score of 0.40, highlighting the performance gap. Although the coherence score for the German dataset increased slightly from 0.26 to 0.29, this improvement is minimal.



With the Amazon Reviews dataset, we initially hypothesized that the model underperformed due to the nature of review data—texts that are short, informal, and heavily saturated with sentiment (e.g., “bad,” “perfect”), which complicates topic extraction. However, the Amazon MASSIVE Intent dataset is simple and clean, without slang, typos, or strong sentiment. Despite this, the model still performed poorly.

These results suggest that the multilingual BERTopic model may still have notable limitations, particularly when applied to intent-based datasets across different languages.On the other hand, the availability of suitable, high-quality datasets in multiple languages is limited, which further constrains our ability to comprehensively evaluate BERTopic’s multilingual performance.