# **BERTopic on Italian, Spanish, German and English on Amazon_massive_intent dataset**
In our earlier experiment with English news articles, the KeyBERT-inspired topic representations clearly outperformed basic frequency-based labels (c-TF-IDF), giving us cleaner and more meaningful descriptions.

We then tested the same approach on German and French Amazon reviews, which are much messier due to slang, typos, and strong opinions. We trained separate BERTopic models for each language and applied the KeyBERT-style method to see if we could still get clear product-related topics.However, the results were much weaker: both languages reached coherence scores of only 0.10–0.14, far below the ~0.40 we achieved on the BBC News dataset.

**In this notebook,** we test whether the same strategy works in a multilingual setting using the **amazon_massive_intent dataset** in Italian, Spanish, German and English. This dataset is simpler and cleaner than Amazon Reviews, so we want to see if the model performs better when there is less linguistic noise.

We train separate BERTopic models for each language, just like we did with the Amazon Reviews experiments, and apply the KeyBERT-inspired representation to check whether we can extract clear and interpretable for **assistant command patterns**.


We used the English multilingual dataset to check whether the model performs better in English, since English data and embeddings are generally higher quality, making it a good reference point for comparison.Also, we included the German of this dataset to compare how the same language performs on a different less noisy dataset.


**Goal:** To assess whether BERTopic can consistently uncover similar **intent-related clusters** across three different languages Italian, German, English in the massive dataset, and to analyze how language-specific phrasing influences cluster structure.

In [1]:
!pip install --upgrade pip
!pip install bertopic datasets sentence_transformers pandas spacy scikit-learn
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm
!python -m spacy download it_core_news_sm
!python -m spacy download es_core_news_sm

Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 25.1
    Uninstalling pip-25.1:
      Successfully uninstalled pip-25.1
Successfully installed pip-25.3
Collecting bertopic
  Downloading bertopic-0.17.4-py3-none-any.whl.metadata (24 kB)
Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting sentence_transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting spacy
  Downloading spacy-3.8.11-cp313-cp313-macosx_10_13_x86_64.whl.metadata (27 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40.tar.gz (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m22.3 MB/s[0m 

In [None]:
from datasets import load_dataset

ds_italian = load_dataset("SetFit/amazon_massive_intent_it-IT")
docs_train_italian = ds_italian["train"]["text"]
categories_train_italian = ds_italian["train"]["label_text"]


ds_de = load_dataset("SetFit/amazon_massive_intent_de-DE")
docs_train_de = ds_de["train"]["text"]
categories_train_de = ds_de["train"]["label_text"]

ds_english = load_dataset("SetFit/amazon_massive_intent_en-US")
docs_train_english = ds_english["train"]["text"]
categories_train_english = ds_english["train"]["label_text"]


# spanish dataset
ds_es = load_dataset("SetFit/amazon_massive_intent_es-ES")
docs_train_es = ds_es["train"]["text"]
categories_train_es = ds_es["train"]["label_text"]

### Data Preparation

In [None]:
# Italian Dataset
import spacy

nlp_it = spacy.load("it_core_news_sm")


def lemmatize_it(text: str):
    doc = nlp_it(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]


lemmatized_it = [lemmatize_it(text) for text in docs_train_italian[:50000]]

#example
print("Original Italian:", docs_train_italian[38])
print("Lemmatized Italian:", lemmatized_it[38])

In [None]:
# English Dataset
import spacy


nlp_en = spacy.load("en_core_web_sm")


def lemmatize_en(text: str):
    doc = nlp_en(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]

lemmatized_en = [lemmatize_en(text) for text in docs_train_english[:50000]]


print("Original English:", docs_train_english[0])
print("Lemmatized English:", lemmatized_en[0])

In [None]:
# German Dataset
import spacy


nlp_de = spacy.load("de_core_news_sm")


def lemmatize_de(text: str):
    doc = nlp_de(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]


lemmatized_de = [lemmatize_de(text) for text in docs_train_de[:50000]]


print("Original German:", docs_train_de[0])
print("Lemmatized German:", lemmatized_de[0])

In [None]:
# Spanish Dataset
import spacy

nlp_es = spacy.load("es_core_news_sm")

#lemmatization for Spanish only
def lemmatize_es(text: str):
    doc = nlp_es(text)
    return [tok.lemma_ for tok in doc if tok.is_alpha]


lemmatized_es = [lemmatize_es(text) for text in docs_train_es[:50000]]

#example
print("Original Spanish:", docs_train_es[0])
print("Lemmatized Spanish:", lemmatized_es[0])

In [None]:
def remove_stopwords(lemmas: list[str], lang: str) -> str:
    stopwords_map = {
        "it": nlp_it.Defaults.stop_words,
        "de": nlp_de.Defaults.stop_words,
        "en": nlp_en.Defaults.stop_words,
        "es": nlp_es.Defaults.stop_words

    }

    if lang not in stopwords_map:
        raise ValueError("lang must be 'it' or 'en' or 'es' or 'de' " )

    stopwords = stopwords_map[lang]
    cleaned = (lemma for lemma in lemmas if lemma.lower() not in stopwords)
    return " ".join(cleaned)

docs_cleaned_italian = [remove_stopwords(lemmas, "it") for lemmas in lemmatized_it]
print("Cleaned Italian:", docs_cleaned_italian[38])

docs_cleaned_english = [remove_stopwords(lemmas, "en") for lemmas in lemmatized_en]
print("Cleaned English:", docs_cleaned_english[0])

docs_cleaned_de = [remove_stopwords(lemmas, "de") for lemmas in lemmatized_de]
print("Cleaned German:", docs_cleaned_de[0])

docs_cleaned_es = [remove_stopwords(lemmas, "es") for lemmas in lemmatized_es]
print("Cleaned Spanish:", docs_cleaned_es[0])

### **Set up**
##### **BERTopic model for Italian**

We trained a BERTopic model on the Amazon Massive intent dataset using different language embeddings e.g. Italian, German, English. Instead of restricting the model to 10 predefined topics, we allowed BERTopic to learn the topic structure naturally, specifying only a minimum topic size of 30 documents.

In [None]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

vectorizer_model = CountVectorizer(stop_words=None)

In [None]:
representation_model_italian = KeyBERTInspired()

topic_model_baseline_italian = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="italian",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_italian
)

topics_base_italian, probs_base_italian = topic_model_baseline_italian.fit_transform(docs_train_italian)

##### **BERTopic model for German**

We repeated the same setup for the German dataset; however, instead of using the Amazon multilingual review dataset, we used  the Amazon MASSIVE intent dataset_german and followed the same procedure.

In [None]:
representation_model_de = KeyBERTInspired()

topic_model_baseline_de = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="german",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_de
)

topics_de, probs_de = topic_model_baseline_de.fit_transform(docs_train_de)

##### **BERTopic model for English**

We repeat the same setup for the English Dataset.

In [None]:
representation_model_en = KeyBERTInspired()

topic_model_baseline_en = BERTopic(
    embedding_model=sentence_model,      
    vectorizer_model=vectorizer_model,   
    language="english",                  
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,                      
    representation_model=representation_model_en
)

topics_en, probs_en = topic_model_baseline_en.fit_transform(docs_train_english)

##### **BERTopic model for Spanish**

We repeat the same setup for the Spanish Dataset.

In [None]:
representation_model_spanish = KeyBERTInspired()

topic_model_baseline_spanish = BERTopic(
    embedding_model=sentence_model,
    vectorizer_model=vectorizer_model,
    language="spanish",
    calculate_probabilities=True,
    verbose=True,
    min_topic_size=30,
    nr_topics=None,
    representation_model=representation_model_spanish
)

topics_base_spanish, probs_base_spanish = topic_model_baseline_spanish.fit_transform(docs_train_es)

### **Inspecting computed topics for Italian**

In [None]:
info_df = topic_model_baseline_italian.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Findings: The model is able to identify some meaningful intent-related themes in the Italian MASSIVE dataset. For example topic-1 showed somewhat clearer structure, capturing media-playback actions through verbs like **“mettere”** and **“metti”** (put/play) and nouns such as **“brani”** (songs). On the other hand, BERTopic in italian dataset produced several mixed or noisy clusters, indicating difficulty in extracting clean intent categories from short Italian utterances. However, the output also highlights key limitations. For instance, Topic 0 was dominated by extremely frequent conversational tokens such as **“sono”** (I am), **“favore”** (please), and even **“chiocciola”** (the @ symbol).

Together, these examples show that BERTopic struggles to form clean intent clusters when input texts are short, formulaic, and dominated by high-frequency filler words.

### **Inspecting computed topics for English**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [None]:
info_df = topic_model_baseline_en.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

As predicted, the English BERTopic model performs noticeably better than the Italian and German versions, producing cleaner and more semantically consistent clusters. For example, Topic 0 captures a coherent lighting-control intent with keywords such as **“darken,”** **“lighting,”** **“lamp,”** **“brighten,”** and **“lights”**, which aligns well with typical smart-home commands.

However, some clusters still suffer from noise and mixed semantics. Topic 5, for instance, blends celebrity names **(“kim,” “elvis,” “miley”) **with unrelated temporal words **(“when”)**, indicating that BERTopic occasionally forms clusters based on accidental lexical co-occurrence rather than true intent similarity

### **Inspecting computed topics for German**

We repeat the same steps for the inspection of the topics generated by the French model by building a clean topic overview table with interpretable keyword representations.

In [None]:
info_df = topic_model_baseline_de.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Findings: The German BERTopic model shows mixed performance.
On the positive side, one cluster cleanly captures email-related actions, with keywords like in topic 2 **“posteingang”** (inbox), **“öffne” **(open), **“erhalten”** (receive), and **“schicke”** (send), indicating that the model successfully extracts a coherent communication-intent topic.


However, a clear limitation appears with the repeated use of the politeness word “bitte.” It shows up in Topic 0, Topic 3, Topic 5, and Topic 6, even though these topics correspond to different intents such as media playback, device control, daily schedule etc. Because
“bitte”is so common in short German commands, BERTopic overweights it, which causes clusters to mix polite filler words with the actual intent-related verbs and it reduced the overall clarity of the topics.

### **Inspecting computed topics for Spanish**

We repeat the same steps for the inspection of the topics generated by the Spanish model by building a clean topic overview table with interpretable keyword representations.

In [None]:
info_df = topic_model_baseline_spanish.get_topic_info()
clean_df = info_df[info_df["Topic"] != -1].copy()

def clean_keywords(repr_list):
    return ", ".join(repr_list[:6])

clean_df["Top Keywords"] = clean_df["Representation"].apply(clean_keywords)

display_table = clean_df[["Topic", "Count", "Top Keywords"]]

import pandas as pd
pd.set_option('display.max_colwidth', None)
display(display_table.head(8))

Findings: Like other language some topics capturing real assistant style intents. For example, Topic 3 groups together home related commands with keywords like “casa” (house), “salón” (living room), “porche” (porch), and “apagar” (turn off), indicating a coherent cluster focused on smart home control. However, several limitations also appear. Topic 0 contains extremely general words like “ver” (see), “algún” (some), and “favor” (please), which do not correspond to a specific intent. High-frequency filler terms such as “quiero” (I want) and “para” (for) show up across multiple clusters, reducing their distinctiveness.

### Visualization

#### **Visualization of Italian topic space**

Similar to the previous notebook, we visualize the topics in a 2D projection to assess how well-separated they are.We did in three different language italian, english, spanish german for amazon massive intent dataset. The accompanying bar chart allows us to inspect the key words within each cluster along with their relative importance.

In [None]:
fig_map = topic_model_baseline_italian.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_italian.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Visualization of the English topic space**

In [None]:
fig_map = topic_model_baseline_en.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_en.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Visualization of the German topic space**

In [None]:
fig_map = topic_model_baseline_de.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_de.visualize_barchart(top_n_topics=10)
fig_bar.show()

#### **Visualization of the Spanish topic space**

In [None]:
fig_map = topic_model_baseline_spanish.visualize_topics()
fig_map.show()

fig_bar = topic_model_baseline_spanish.visualize_barchart(top_n_topics=10)
fig_bar.show()

### Evaluation

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import itertools
from math import log

class TopicModelWrapper:
    def __init__(self, topics):
        self.topics = topics

    def get_topics(self):
        return self.topics

def evaluate_bertopic_pmi(
    topic_model,
    docs,
    top_k_coherence: int = 5,
    top_k_diversity: int = 10,
    skip_outlier: bool = True,
    tag: str = "Baseline",
):

    
    raw_topics = topic_model.get_topics()

    new_keywords = {
        topic_id: [word for word, _ in word_scores]
        for topic_id, word_scores in raw_topics.items()
    }

    if skip_outlier and -1 in new_keywords:
        new_keywords = {tid: kws for tid, kws in new_keywords.items() if tid != -1}


    all_keywords = sorted({w for kws in new_keywords.values() for w in kws})

    vectorizer = CountVectorizer(vocabulary=all_keywords, lowercase=True)
    X = vectorizer.fit_transform(docs)  # shape: (n_docs, n_terms)

    n_docs, n_terms = X.shape

    X_bin = (X > 0).astype(int)

    word_doc_counts = np.asarray(X_bin.sum(axis=0)).ravel()
    p_w = word_doc_counts / n_docs

    cooc_counts = (X_bin.T @ X_bin).toarray()
    p_ij = cooc_counts / n_docs

    vocab = np.array(vectorizer.get_feature_names_out())
    word2id = {w: i for i, w in enumerate(vocab)}


    def npmi_pair(w1, w2):
        i = word2id.get(w1)
        j = word2id.get(w2)
        if i is None or j is None:
            return None

        pij = p_ij[i, j]
        if pij == 0:
            return None  # never co-occur

        pi = p_w[i]
        pj = p_w[j]

        pmi = log(pij / (pi * pj))
        return pmi / (-log(pij))

    def topic_npmi_coherence(topic_words, top_k=None):
        if top_k is not None:
            topic_words = topic_words[:top_k]

        scores = []
        for w1, w2 in itertools.combinations(topic_words, 2):
            score = npmi_pair(w1, w2)
            if score is not None:
                scores.append(score)

        if not scores:
            return float("nan")
        return float(np.mean(scores))

    topic_scores = {
        topic_id: topic_npmi_coherence(words, top_k=top_k_coherence)
        for topic_id, words in new_keywords.items()
    }

    coherence_df = pd.DataFrame(
        {
            "Topic": list(topic_scores.keys()),
            "NPMI": list(topic_scores.values()),
        }
    ).sort_values("Topic")

    mean_npmi = float(np.nanmean(coherence_df["NPMI"]))


    def topic_diversity(topics_dict, k=10):
        # collect top-k words for each topic
        topk_words = []
        for tid, words in topics_dict.items():
            topk_words.extend(words[:k])

        if not topk_words:
            return float("nan")

        unique_words = set(topk_words)
        T = len(topics_dict)
        total_words = T * k

        return len(unique_words) / total_words

    diversity = float(topic_diversity(new_keywords, k=top_k_diversity))

    print(f"{tag} Model - NPMI: {mean_npmi:.4f}")
    print(f"{tag} Model - Diversity: {diversity:.4f}")

    return coherence_df, mean_npmi, diversity

In [None]:


_, npm_german, div_german = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_de,
    docs=docs_cleaned_de,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="German"
)

_, npm_italian, div_italian = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_italian,
    docs=docs_cleaned_italian,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="Italian"
)
_, npm_en, div_en = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_en,
    docs=docs_cleaned_english,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="English"
)
_, npm_es, div_es = evaluate_bertopic_pmi(
    topic_model=topic_model_baseline_spanish,
    docs=docs_cleaned_es,
    top_k_coherence=10,
    top_k_diversity=25,
    tag="Spanish"
)

results_df = pd.DataFrame({
    "Metric": ["Coherence (NPMI)", "Diversity"],
    "German": [npm_german, div_german],
    "Italian": [npm_italian, div_italian],
    "English": [npm_en, div_en],
    "Spanish": [npm_es, div_es]
})
display(results_df)

# Overall Summary:

The goal of this project was to replicate the success observed in our English news experiment and, to some extent, in the more challenging multilingual Amazon Reviews dataset, **using a simpler dataset to evaluate whether the model would perform better.** Additionally, we sought to compare how the Italian and spanish language behaves relative to the others. Previously, we experimented with the German Amazon multilingual review dataset, but the data proved too noisy due to its nature as a review corpus.

Although we were able to extract intent-related themes for Italian, Spanish, German, and English, the quantitative metrics indicate that the results were again considerably less successful. The coherence scores for German, Italian and spanish were 0.29 and 0.18, 0.19 respectively. We expected the English portion of the Amazon MASSIVE Intent dataset to perform better; however, it produced a coherence score of 0.19, which is also very low. In contrast, the BBC News dataset achieved a coherence score of 0.40, highlighting the performance gap. Although the coherence score for the German dataset increased slightly from 0.26 to 0.29, this improvement is minimal.



With the Amazon Reviews dataset, we initially hypothesized that the model underperformed due to the nature of review data—texts that are short, informal, and heavily saturated with sentiment (e.g., “bad,” “perfect”), which complicates topic extraction. However, the Amazon MASSIVE Intent dataset is simple and clean, without slang, typos, or strong sentiment. Despite this, the model still performed poorly.

These results suggest that the multilingual BERTopic model may still have notable limitations, particularly when applied to intent-based datasets across different languages.On the other hand, the availability of suitable, high-quality datasets in multiple languages is limited, which further constrains our ability to comprehensively evaluate BERTopic’s multilingual performance.