# **Taiwanese and Indonesian Presidential Inauguration Speech: What Does the Future Hold?**

**Inaugurational Speech: the case of Taiwan and Indonesia**

Course Instructor: Dr. Maciej Światła


> I Putu Agastya Harta Pratama - 472876; Si Tang Lin - 476912  
Faculty of Economic Sciences  
University of Warsaw  
Poland  
2026

## **Introduction**

This project analyzes presidential inauguration speeches from Taiwan and Indonesia. Both countries are Asian democracies, but they differ in political systems, culture, and historical background. By comparing their inauguration speeches, we aim to understand whether they share similar political themes or express different visions for the future.

We use a topic modeling approach based on multilingual BERT, which allows texts in different languages to be analyzed within the same semantic space. This makes cross-country comparison of political discourse possible.

## **Specific Project Objective**

- To apply topic modeling separately to Taiwanese and Indonesian inauguration speeches in order to identify the main themes in each country
- To select the better-performing model and then combine both corpora to compare topics using cosine similarity, measuring how similar or different the themes are across countries.

To ensure model quality, we test two configurations, v2 and H384. For each model, we conduct parameter tuning and diagnostic evaluation to examine trade-offs between coherence, topic separation, and topic diversity. Based on these results, we select the more suitable model for cross-country analysis.

In this notebook, we present the full workflow, including text preprocessing, conbined corpus, topic modeling with the v2 model(best result yields), parameter tuning, and diagnostic analysis. This workflow forms the basis for the final comparative study.

# **Import Libraries**

In [None]:
!pip install requests nltk spacy bertopic gensim

Collecting bertopic
  Downloading bertopic-0.17.4-py3-none-any.whl.metadata (24 kB)
Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading bertopic-0.17.4-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim, bertopic
Successfully installed bertopic-0.17.4 gensim-4.4.0


In [None]:
# for scrapper
import requests
from bs4 import BeautifulSoup
import os
import re

# others
from collections import Counter
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import math
import statistics
import pickle

# topic modelling

from nltk.tokenize import RegexpTokenizer
from itertools import combinations
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel

  $max \{ core_k(a), core_k(b), 1/\alpha d(a,b) \}$.


# **Import Indonesian president inauguration speech data**

- Dataset: Preseident inauguration speech(2024)


Key Details

- Source: Ministry of State Secretariat of the Republic of Indonesia

- URL: https://www.setneg.go.id/baca/index/pidato_presiden_prabowo_subianto_pada_sidang_paripurna_mpr_ri_dalam_rangka_pelantikan_presiden_dan_wakil_presiden_ri_terpilih_periode_2024_2029

- Mothod of collection : Scrapping

- Number of tokens:

- Feature types: Texts

- Subject area: Political Science

In [None]:
# data scrapping
url = "https://www.setneg.go.id/baca/index/pidato_presiden_prabowo_subianto_pada_sidang_paripurna_mpr_ri_dalam_rangka_pelantikan_presiden_dan_wakil_presiden_ri_terpilih_periode_2024_2029"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

speech_div = soup.find("div", class_="reading_text")

with open("full_speech_indonesian.txt", "w", encoding="utf-8") as f:
    f.write(speech_div.get_text(separator="\n", strip=True))  # pyright: ignore[reportOptionalMemberAccess]

In [None]:
# data import
with open("full_speech_indonesian.txt", "r") as fp:
    speech_uncleaned = fp.read()

data_indo = [speech_uncleaned]   # simplest: one-element list
type(data_indo)
text_indo = data_indo[0]  # extract the only text from your one-element listx

In [None]:
text_indo

# **2.1. Data cleaning: Indonesian president speech**

In [None]:
# Rapikan newline
speech_clean = speech_uncleaned.replace("\r\n", "\n").strip()

# 1) Split per baris
lines = [line.strip() for line in speech_clean.split("\n") if line.strip()]

# === Cut ceremonial greetings ===
content_lines = lines[67:]

# Gabungkan isi jadi satu teks utuh
text_body = " ".join(content_lines)
text_body = re.sub(r"\s+", " ", text_body).strip()

# === Sentence segmentation (Indonesian) ===
raw_sentences = re.split(r'(?<=[.!?])\s+', text_body)

sentences_indo = [
    s.strip() for s in raw_sentences
    if len(s.split()) >= 3
]

print("Jumlah kalimat Indonesia:", len(sentences_indo))


Jumlah kalimat Indonesia: 150


# **3. Import Taiwanese president inauguration speech data**

- Dataset: Preseident inauguration speech(2024)


Key Details

- Source: Office of the President Republic of China(Taiwan)

- URL: https://www.president.gov.tw/Page/700

- Mothod of collection : Scrapping

- Number of tokens: 5297(each token is a single character in traditional chinese)

- Feature types: Texts

- Subject area: Political Science

In [None]:

url = "https://www.president.gov.tw/NEWS/28428"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
response.encoding = "utf-8"

soup = BeautifulSoup(response.text, "html.parser")

speech_div = soup.find("div", class_="article1")

if speech_div is None:
    raise ValueError(" missing article1")

text_zh = speech_div.get_text(separator="\n", strip=True)

with open("full_speech_chinese.txt", "w", encoding="utf-8") as f:
    f.write(text_zh)

unwanted = "中華民國第16任總統賴清德伉儷及副總統蕭美琴今（20）日上午參加在總統府府前廣場舉行的就職慶祝大會，總統並以「打造民主和平繁榮的新臺灣」為題，發表就職演說，演說全文為："

text_zh = text_zh.replace(unwanted, "").strip()


text_zh

'蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看直播的好朋友，全體國人同胞，大家好：\n我年輕的時候，立志行醫救人。我從政的時候，立志改變臺灣。現在，站在這裡，我立志壯大國家！\n我以無比堅定的心情，接受人民的託付，就職中華民國第十六任總統，我將依據中華民國憲政體制，肩負起帶領國家勇往前進的重責大任。\n回想1949年的今天，臺灣實施戒嚴，全面進入專制的黑暗年代。\n1996年的今天，臺灣第一位民選總統宣誓就職，向國際社會傳達，中華民國臺灣是一個主權獨立的國家、主權在民。\n2024年的今天，臺灣在完成三次政黨輪替之後，第一次同一政黨連續執政，正式展開第三任期！臺灣也揚帆進入一個充滿挑戰，又孕育無限希望的新時代。\n這段歷程，是這塊土地上的人們，前仆後繼、犧牲奉獻所帶來的結果。雖然艱辛，但我們做到了！\n此時此刻，我們不只見證新政府的開始，也是再一次迎接得來不易的民主勝利！\n許多人將我和蕭美琴副總統的當選，解讀為「打破八年政黨輪替魔咒」。事實上，民主就是人民作主，每一次的選舉，虛幻的魔咒並不存在，只有人民對執政黨最嚴格的檢驗、對國家未來最真實的選擇。\n我要感謝，過去八年來，蔡英文前總統、陳建仁前副總統和行政團隊的努力，為臺灣的發展，打下堅實的基礎。也請大家一起給他們一個最熱烈的掌聲！\n我也要感謝國人同胞大家的支持，不受外來勢力的影響，堅定守護民主，向前走；不回頭，為臺灣翻開歷史的新頁！\n在未來任期的每一天，我將「行公義，好憐憫，存謙卑的心」，「視民如親」，不愧於每一分信賴與託付。新政府也將兢兢業業，拿出最好的表現，來接受全民的檢驗。我們的施政更要不斷革新，開創臺灣政治的新風貌。\n一、行政立法協調合作，共同推動國政\n今年二月上任的新國會，是臺灣時隔十六年後，再次出現「三黨不過半」的立法院。面對這個政治新局，有些人抱持期待，也有些人感到憂心。\n我要告訴大家，這是全民選擇的新模式，當我們以新思維看待「三黨不過半」，這代表著，朝野政黨都能分享各自的理念，也將共同承擔國家的種種挑戰。\n然而，全民對於政黨的理性問政，也有很大的期待。政黨在競爭之外，也應該有合作的信念，國家才能踏出穩健的步伐。\n立法院的議事運作，應該遵守程序正義，多數尊重少數，少數服從多數，才能避免衝突，維持社會的安定和諧。\n在民主社會，人民的利益

# **3.1. Data cleaning: Taiwanese president speech**


- Removed URLs and HTML tags from the text.
- Normalized whitespace by collapsing multiple spaces and trimming leading.


In [None]:
def clean_text_zh(text):
    text = str(text)

    text = re.sub(r"http\S+|www\.\S+", "", text)
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n+", "\n", text)

    return text.strip()


In [None]:
def split_sentences_zh(text):
    raw_sentences = re.split(r"[。！？]", text)
    sentences = [
        s.strip() for s in raw_sentences
        if len(s.strip()) >= 6
    ]
    return sentences

# **4. Separate into Chunk**:

### What was done

- a document is defined as a contiguous chunk of 2–4 sentences
- segmented based on sentence boundaries.
- Combine corpus as document to feed in model

### Why this step is needed


Before combining the Taiwanese and Indonesian speech corpora, sentence chunking is a necessary preprocessing step. The two languages differ greatly in sentence structure and length. If we directly use full speeches or single sentences as inputs, the model may be influenced more by text length and structure than by actual semantic content.

By splitting the texts into fixed-size chunks (e.g., three sentences per chunk), we create more comparable semantic units across both languages. This helps reduce bias caused by differences in speech length and ensures that the combined corpus is more balanced. As a result, later topic modeling and cosine similarity analysis are based on content rather than formatting differences.



In [None]:
# Indo_text
def make_chunks(sentences, chunk_size=3):
    chunks = []
    i = 0
    while i < len(sentences):
        chunk_sents = sentences[i:i+chunk_size]

        # If only one sentence remains at the end, merge it into the previous chunk.
        if len(chunk_sents) < 2 and chunks:
            chunks[-1] += " " + " ".join(chunk_sents)
        else:
            chunks.append(" ".join(chunk_sents))

        i += chunk_size

    return chunks


In [None]:
# because indo chunk size is larger than ch. we set into 2
chunks_indo = make_chunks(sentences_indo, chunk_size=2)
chunks_indo = [c.lower() for c in chunks_indo]


# Indonesian lowercasing
chunks_indo = [c.lower() for c in chunks_indo]

print("Numbers of chunk：", len(chunks_indo))
print("First three chunks：")
for c in chunks_indo[:3]:
    print("-", c)


Numbers of chunk： 75
First three chunks：
- saudara-saudara sekalian, beberapa saat yang lalu di hadapan majelis yang terhormat ini, di hadapan seluruh rakyat indonesia, dan yang terpenting dihadapan tuhan yang mahakuasa allah swt., saya prabowo subianto dan saudara gibran rakabuming raka, telah mengucapkan sumpah untuk mempertahankan undang-undang dasar kita, untuk menjalankan semua undang-undang dan peraturan yang berlaku, untuk berbakti pada negara dan bangsa. sumpah tersebut akan kami jalankan dengan sebaik-baiknya, dengan penuh rasa tanggung jawab dan dengan semua kekuatan yang ada pada jiwa dan raga kami.
- kami akan menjalankan kepemimpinan pemerintah republik indonesia, kepemimpinan negara dan bangsa indonesia dengan tulus, dengan mengutamakan kepentingan seluruh rakyat indonesia, termasuk mereka-mereka yang tidak memilih kami. kami akan mengutamakan kepentingan bangsa indonesia, kepentingan rakyat indonesia di atas segala kepentingan, di atas segala golongan, apalagi kepentinga

In [None]:
# Chinese_text
def make_chunks(sentences, chunk_size=3):
    chunks = []

    i = 0
    while i < len(sentences):
        chunk_sents = sentences[i:i+chunk_size]
        if len(chunk_sents) < 2 and chunks:
            chunks[-1] += " " + " ".join(chunk_sents)
        else:
            chunks.append(" ".join(chunk_sents))

        i += chunk_size

    return chunks


In [None]:

cleaned_zh = clean_text_zh(text_zh)


sentences_zh = split_sentences_zh(cleaned_zh)
chunks_zh = make_chunks(sentences_zh, chunk_size=3)

print("Numbers of chunk：", len(chunks_zh))
print("First three chunks：")
for c in chunks_zh[:3]:
    print("-", c)



Numbers of chunk： 40
First three chunks：
- 蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看直播的好朋友，全體國人同胞，大家好：
我年輕的時候，立志行醫救人 我從政的時候，立志改變臺灣 現在，站在這裡，我立志壯大國家
- 我以無比堅定的心情，接受人民的託付，就職中華民國第十六任總統，我將依據中華民國憲政體制，肩負起帶領國家勇往前進的重責大任 回想1949年的今天，臺灣實施戒嚴，全面進入專制的黑暗年代 1996年的今天，臺灣第一位民選總統宣誓就職，向國際社會傳達，中華民國臺灣是一個主權獨立的國家、主權在民
- 2024年的今天，臺灣在完成三次政黨輪替之後，第一次同一政黨連續執政，正式展開第三任期 臺灣也揚帆進入一個充滿挑戰，又孕育無限希望的新時代 這段歷程，是這塊土地上的人們，前仆後繼、犧牲奉獻所帶來的結果


In [None]:
print("ZH chunks:", len(chunks_zh))
print("ID chunks:", len(chunks_indo))

print("Avg length ZH:",
      sum(len(c) for c in chunks_zh)/len(chunks_zh))
print("Avg length ID:",
      sum(len(c) for c in chunks_indo)/len(chunks_indo))


ZH chunks: 40
ID chunks: 75
Avg length ZH: 132.3
Avg length ID: 273.0133333333333



### Why the results are valid (language-specific perspective)

The results show that the Chinese (ZH) corpus has fewer chunks with shorter average length, while the Indonesian (ID) corpus has more chunks with longer average length. This pattern is expected and reflects natural language differences rather than preprocessing errors.

Chinese political speeches tend to be more information-dense, often expressing multiple ideas within a single sentence. Indonesian speeches, in contrast, frequently use longer sentences and repetition to emphasize key messages. Even with the same chunking rule, these linguistic characteristics lead to different chunk lengths.

Because this project uses multilingual BERT embeddings, which focus on semantic meaning rather than surface-level word counts, these length differences do not invalidate the analysis. Instead, they preserve language-specific expression while still allowing meaningful cross-lingual comparison.



### Methodological note

This preprocessing step balances two goals: ensuring comparability across languages while maintaining each language’s natural rhetorical structure. Therefore, the combined-corpus topic modeling can be considered methodologically sound and interpretable.

# **5.Model Training - MiniLM-L12-v2**


In [None]:
pip install bertopic sentence-transformers hdbscan umap-learn




In [None]:
docs = chunks_zh + chunks_indo

meta = (
    [{"country": "TW", "lang": "zh"} for _ in chunks_zh] +
    [{"country": "ID", "lang": "id"} for _ in chunks_indo]
)


In [None]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
import hdbscan
docs = chunks_zh + chunks_indo


stop_id = {"yang","dan","di","ke","dari","untuk","pada","dengan","ini","itu","kita","saudara","harus"}
stop_zh = {"我們","大家","各位","以及","因此","這些","這個","進行","推動","持續","將","也","更"}

vectorizer_model = CountVectorizer(stop_words=list(stop_id | stop_zh), ngram_range=(1,2))
embedding_model = SentenceTransformer(
    "paraphrase-multilingual-MiniLM-L12-v2"
)

umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.1,
    metric="cosine",
    random_state=42
)

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=3,
    min_samples=3,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True
)

#topic_model = BERTopic(
    #embedding_model=embedding_model,
    #umap_model=umap_model,
    #hdbscan_model=hdbscan_model,
    #language="multilingual",
    #calculate_probabilities=True,
    #verbose=True
#)

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True
)

topics, probabilities = topic_model.fit_transform(docs)

topic_model.get_topic_info()



modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1,-1_rakyat_karena_dunia karena_karena mendukung,"[rakyat, karena, dunia karena, karena mendukun...","[saudara-saudara, karena itu kita punya prinsi..."
1,0,80,0_tidak_bangsa_sekalian_pemimpin,"[tidak, bangsa, sekalian, pemimpin, rakyat, be...",[sekarang saya mengajak saudara-saudara teruta...
2,1,34,1_indonesia_kepada_sekalian_presiden,"[indonesia, kepada, sekalian, presiden, bangsa...","[saudara-saudara sekalian, mereka semua dengan..."


### **Short Conclusion**


The combined-corpus topic modeling shows that when Taiwanese and Indonesian inauguration speeches are analyzed together, the model captures broad and dominant themes rather than country-specific details. The largest topics are mainly driven by frequently used rhetorical expressions, such as addressing the people, national unity, and leadership, which are common in presidential speeches.

This result suggests that combining corpora highlights shared political language and general styles across countries, while finer differences may be weakened or absorbed into larger themes.

# **6. Parameter Tuning**

In [None]:
import numpy as np
from itertools import combinations

# Prepare documents as sets of tokens

analyzer = vectorizer_model.build_analyzer()

docs_as_sets = [set(analyzer(doc)) for doc in docs]
N = len(docs_as_sets)  # total number of documents

# count the number of documents containing a given word
def doc_freq(word):
    return sum(1 for doc in docs_as_sets if word in doc)

# Umass coherence
# count the number of documents containing both w1 and w2
def co_doc_freq(w1, w2):
    return sum(1 for doc in docs_as_sets if w1 in doc and w2 in doc)

def umass_coherence(words, eps=1e-12):
    scores_sum = 0.0
    n_pairs = 0

    for w1, w2 in combinations(words, 2):
        D_w1w2 = co_doc_freq(w1, w2)
        D_w2 = doc_freq(w2)

        if D_w2 == 0:
            continue

        scores_sum += np.log((D_w1w2 + eps) / D_w2)
        n_pairs += 1

    return (
        scores_sum * 2 / (len(words) * (len(words) - 1))
        if n_pairs > 0 else 0.0
    )
# Uci coherence
def uci_coherence(words, eps=1e-12):
    scores_sum = 0.0
    n_pairs = 0

    for w1, w2 in combinations(words, 2):
        p_w1 = doc_freq(w1) / N
        p_w2 = doc_freq(w2) / N
        p_w1w2 = co_doc_freq(w1, w2) / N

        if p_w1 == 0 or p_w2 == 0:
            continue

        scores_sum += np.log((p_w1w2 + eps) / (p_w1 * p_w2))
        n_pairs += 1

    return (
        scores_sum * 2 / (len(words) * (len(words) - 1))
        if n_pairs > 0 else 0.0
    )
# Uci npmi
def uci_npmi_coherence(words, eps=1e-12):
    scores_sum = 0.0
    n_pairs = 0

    for w1, w2 in combinations(words, 2):
        p_w1 = doc_freq(w1) / N
        p_w2 = doc_freq(w2) / N
        p_w1w2 = co_doc_freq(w1, w2) / N

        if p_w1w2 == 0:
            continue

        pmi = np.log((p_w1w2 + eps) / (p_w1 * p_w2))
        npmi = pmi / (-np.log(p_w1w2 + eps))

        scores_sum += npmi
        n_pairs += 1

    return (
        scores_sum * 2 / (len(words) * (len(words) - 1))
        if n_pairs > 0 else 0.0
    )



In [None]:
import time
import pickle
import numpy as np
from itertools import combinations

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from umap import UMAP
import hdbscan
from nltk.tokenize import RegexpTokenizer


start = time.time()
results = []

# --------------------------------------------------
# Components initialised (same as professor)
# --------------------------------------------------
tokenizer = RegexpTokenizer(r'\w+')
ctfidf_model = ClassTfidfTransformer()
representation_model = KeyBERTInspired()

# --------------------------------------------------
# Documents (your data)
# --------------------------------------------------
docs = chunks_zh + chunks_indo

# prepare documents as sets of words (for coherence)
docs_as_sets = [set(tokenizer.tokenize(doc.lower())) for doc in docs]
N = len(docs_as_sets)


# --------------------------------------------------
# Embedding models (adapted to multilingual setting)
# --------------------------------------------------
for model in [
    "paraphrase-multilingual-MiniLM-L12-v2"
]:
    embedding_model = SentenceTransformer(model)

    for ngram_range in [1]:
        for max_df in [1.0]:
            for min_df in [1, 2]:
                for vectoriser in ['tf', 'tfidf']:

                    # --------------------------------------
                    # Vectorizer
                    # --------------------------------------
                    if vectoriser == 'tf':
                        vectorizer = CountVectorizer(
                            encoding='utf-8',
                            decode_error='strict',
                            strip_accents=None,
                            lowercase=True,
                            ngram_range=(1, ngram_range),
                            max_df=max_df,
                            min_df=min_df,
                            max_features=None,
                            tokenizer=tokenizer.tokenize
                        )

                    if vectoriser == 'tfidf':
                        vectorizer = TfidfVectorizer(
                            encoding='utf-8',
                            decode_error='strict',
                            strip_accents=None,
                            lowercase=True,
                            ngram_range=(1, ngram_range),
                            max_df=max_df,
                            min_df=min_df,
                            max_features=None,
                            tokenizer=tokenizer.tokenize
                        )

                    fit = vectorizer.fit_transform(docs)

                    # --------------------------------------
                    # UMAP
                    # --------------------------------------
                    for n_neighbors in [15]:
                        for n_components in [5]:
                            for metric_umap in ['cosine']:
                                for min_dist in [0.0, 0.05]:

                                    umap_model = UMAP(
                                        n_neighbors=n_neighbors,
                                        n_components=n_components,
                                        metric=metric_umap,
                                        min_dist=min_dist,
                                        random_state=42
                                    )

                                    # --------------------------------------
                                    # HDBSCAN
                                    # --------------------------------------
                                    for min_cluster_size in [2, 5]:
                                        for metric_hdbscan in ['euclidean']:
                                            for cluster_selection_method in ['eom']:

                                                hdbscan_model = hdbscan.HDBSCAN(
                                                    min_cluster_size=min_cluster_size,
                                                    metric=metric_hdbscan,
                                                    cluster_selection_method=cluster_selection_method,
                                                    prediction_data=True
                                                )

                                                # --------------------------------------
                                                # Optional topic merging
                                                # --------------------------------------
                                                for nr_topics in [True, False]:

                                                    if nr_topics:
                                                        topic_model = BERTopic(
                                                            embedding_model=embedding_model,
                                                            umap_model=umap_model,
                                                            hdbscan_model=hdbscan_model,
                                                            vectorizer_model=vectorizer,
                                                            ctfidf_model=ctfidf_model,
                                                            representation_model=representation_model,
                                                            language="multilingual",
                                                            top_n_words=10,
                                                            calculate_probabilities=True,
                                                            nr_topics="auto"
                                                        )
                                                    else:
                                                        topic_model = BERTopic(
                                                            embedding_model=embedding_model,
                                                            umap_model=umap_model,
                                                            hdbscan_model=hdbscan_model,
                                                            vectorizer_model=vectorizer,
                                                            ctfidf_model=ctfidf_model,
                                                            representation_model=representation_model,
                                                            language="multilingual",
                                                            top_n_words=10,
                                                            calculate_probabilities=True
                                                        )

                                                    try:
                                                        # --------------------------------------
                                                        # Fit model
                                                        # --------------------------------------
                                                        topics, probabilities = topic_model.fit_transform(docs)

                                                        topic_info = topic_model.get_topic_info()
                                                        topics_list = topic_info.Topic[
                                                            topic_info.Topic != -1
                                                        ]

                                                        n_topics = len(topics_list)

                                                        topic_words = {}
                                                        for t in topics_list:
                                                            words = [w for w, _ in topic_model.get_topic(t)]
                                                            topic_words[t] = words

                                                        # --------------------------------------
                                                        # Coherence computation
                                                        # --------------------------------------
                                                        for t, words in topic_words.items():
                                                            umass = umass_coherence(words)
                                                            uci = uci_coherence(words)
                                                            npmi = uci_npmi_coherence(words)

                                                        results.append([
                                                            model, ngram_range, max_df, min_df, vectoriser,
                                                            n_neighbors, n_components, metric_umap, min_dist,
                                                            min_cluster_size, metric_hdbscan, cluster_selection_method,
                                                            nr_topics, umass, uci, npmi, n_topics
                                                        ])

                                                        with open(
                                                            "outputs/bertopic_multilingual_tuning_results.pkl",
                                                            "wb"
                                                        ) as f:
                                                            pickle.dump(results, f)

                                                        print(
                                                            model, ngram_range, max_df, min_df, vectoriser,
                                                            n_neighbors, n_components, metric_umap, min_dist,
                                                            min_cluster_size, metric_hdbscan, cluster_selection_method,
                                                            nr_topics, umass, uci, npmi, n_topics
                                                        )

                                                    except Exception:
                                                        continue


end = time.time()
elapsed_min = (end - start) / 60
print(f"Elapsed time: {elapsed_min:.2f} minutes")


Elapsed time: 10.15 minutes


In [None]:
# count the number of documents containing a given word
def doc_freq(word):
    return sum(1 for doc in docs_as_sets if word in doc)

# count the number of documents containing both w1 and w2
def co_doc_freq(w1, w2):
    return sum(1 for doc in docs_as_sets if w1 in doc and w2 in doc)
topic_model.get_topic_info()


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,11,-1_民主臺灣_demokrasi_bangsa_presiden,"[民主臺灣, demokrasi, bangsa, presiden, 未來, pimpin...","[tapi begitu beliau menang, beliau mengajak sa..."
1,0,46,0_politik_demokrasi_bangsa_negara,"[politik, demokrasi, bangsa, negara, menjadi, ...","[saudara-saudara sekalian, kita harus berani m..."
2,1,33,1_bangsa_pemerintah_negara_kemerdekaan,"[bangsa, pemerintah, negara, kemerdekaan, indo...","[saudara-saudara sekalian, mereka semua dengan..."
3,2,25,2_民主臺灣_未來_三黨不過半_,"[民主臺灣, 未來, 三黨不過半, , , , , , , ]",[未來，政府會跟產業界密切合作，把握三大方向，推動臺灣的發展 第一個方向是，「前瞻未來，智慧...


In [None]:
columns = [
    "model",
    "ngram_range",
    "max_df",
    "min_df",
    "vectoriser",
    "n_neighbors",
    "n_components",
    "metric_umap",
    "min_dist",
    "min_cluster_size",
    "metric_hdbscan",
    "cluster_selection_method",
    "nr_topics",
    "umass",
    "uci",
    "npmi",
    "n_topics"
]

results_df  = pd.DataFrame(results, columns=columns)

results_df

Unnamed: 0,model,ngram_range,max_df,min_df,vectoriser,n_neighbors,n_components,metric_umap,min_dist,min_cluster_size,metric_hdbscan,cluster_selection_method,nr_topics,umass,uci,npmi,n_topics
0,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.0,2,euclidean,eom,True,-5.766274,-2.18487,0.322243,12
1,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.0,2,euclidean,eom,False,-6.733007,-3.062819,0.304116,12
2,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.0,5,euclidean,eom,True,-26.541605,-17.401256,0.041198,3
3,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.0,5,euclidean,eom,False,-25.236543,-16.245467,0.088889,3
4,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.05,2,euclidean,eom,True,-18.980562,-11.469959,0.263125,5
5,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.05,2,euclidean,eom,False,-3.603823,-0.625707,0.357433,10
6,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.05,5,euclidean,eom,True,-26.541605,-17.401256,0.041198,3
7,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.05,5,euclidean,eom,False,-25.236543,-16.245467,0.088889,3
8,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tfidf,15,5,cosine,0.0,2,euclidean,eom,True,-5.766274,-2.18487,0.322243,12
9,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tfidf,15,5,cosine,0.0,2,euclidean,eom,False,-6.733007,-3.062819,0.304116,12


In [None]:
# select the last three columns as metrics
metrics = ['umass','uci','npmi']

# normalize each metric using min-max normalization
normalized = results_df[metrics].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

# create a new column as the sum of the normalized metrics
results_df['sum_normalised'] = normalized.sum(axis=1)

results_df.sort_values('sum_normalised', ascending=False).head(5)

Unnamed: 0,model,ngram_range,max_df,min_df,vectoriser,n_neighbors,n_components,metric_umap,min_dist,min_cluster_size,metric_hdbscan,cluster_selection_method,nr_topics,umass,uci,npmi,n_topics,sum_normalised
5,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tf,15,5,cosine,0.05,2,euclidean,eom,False,-3.603823,-0.625707,0.357433,10,2.916703
13,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tfidf,15,5,cosine,0.05,2,euclidean,eom,False,-3.603823,-0.625707,0.357433,10,2.916703
25,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,2,tfidf,15,5,cosine,0.0,2,euclidean,eom,False,-3.300558,-0.382478,0.331345,12,2.870318
17,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,2,tf,15,5,cosine,0.0,2,euclidean,eom,False,-3.300558,-0.382478,0.331345,12,2.870318
8,paraphrase-multilingual-MiniLM-L12-v2,1,1.0,1,tfidf,15,5,cosine,0.0,2,euclidean,eom,True,-5.766274,-2.18487,0.322243,12,2.638869


### **Parameter tuning result**


To select the best topic model configuration, we compare different parameter settings using three coherence metrics: **UMass**, **UCI**, and **NPMI**. Because these metrics are on different scales, we apply **min–max normalization** to each metric and then compute a combined score by summing the normalized values. This allows fair comparison across models.


Although the absolute coherence values remain low, this is expected for multilingual political speech data. The relative comparison shows that these configurations perform better than alternatives under the same evaluation framework. Therefore, model selection is based on **relative improvement and stability**, rather than absolute coherence scores.

# **8. Model training with the optimal parameter**

In [None]:
from bertopic import BERTopic
from umap import UMAP
import hdbscan
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

embedding_model = SentenceTransformer(
    "paraphrase-multilingual-MiniLM-L12-v2"
)

vectorizer_model = CountVectorizer(
    ngram_range=(1, 1),
    min_df=1,
    max_df=1.0
)

umap_model = UMAP(
    n_neighbors=15,
    n_components=5,
    min_dist=0.05,
    metric="cosine",
    random_state=42
)

hdbscan_model = hdbscan.HDBSCAN(
    min_cluster_size=2,
    min_samples=2,
    metric="euclidean",
    cluster_selection_method="eom",
    prediction_data=True
)

topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    language="multilingual",
    calculate_probabilities=True,
    nr_topics="auto"     # seperti di "optimal specification"

)

topics, probabilities = topic_model.fit_transform(docs)

topic_model.get_topic_info()


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,14,-1_demokrasi_kita_harus_yang,"[demokrasi, kita, harus, yang, bagaimana, masi...","[dalam keadaan ketegangan, dalam keadaan kemun..."
1,0,50,0_yang_kita_dan_indonesia,"[yang, kita, dan, indonesia, saudara, di, deng...",[tapi mereka yang membayar saham kemerdekaan d...
2,1,28,1_kita_yang_saudara_dari,"[kita, yang, saudara, dari, bangsa, pemimpin, ...",[sekarang saya mengajak saudara-saudara teruta...
3,2,10,2_未來的臺灣_布局全球_行銷全世界_未來,"[未來的臺灣, 布局全球, 行銷全世界, 未來, ai浪潮席捲而來, 不只是我們國家的未來,...",[未來的臺灣，會有更多元發展的創新經濟，會有更普及的數位科技應用，會有更好的競爭力和雙語力，...
4,3,9,3_kita_yang_itu_harus,"[kita, yang, itu, harus, tanaman, energi, subs...",[karena itu kita harus swasembada energi dan k...
5,4,4,4_saya_beliau_lockdown_menang,"[saya, beliau, lockdown, menang, bertanding, t...","[bertanding semangat, sesudah bertanding mari ..."


In [None]:
df = pd.DataFrame({
    "topic": topics,
    "lang": [m["lang"] for m in meta]
})

pd.crosstab(df["topic"], df["lang"], normalize="index")


lang,id,zh
topic,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,0.5,0.5
0,0.6,0.4
1,0.928571,0.071429
2,0.0,1.0
3,1.0,0.0
4,0.75,0.25


In [None]:
result = (
    topic_info
    .merge(
        lang_dist.reset_index(),
        left_on="Topic",
        right_on="topic",
        how="left"
    )
    .drop(columns="topic")
)
result

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs,index,id,zh
0,-1,11,-1_民主臺灣_demokrasi_bangsa_presiden,"[民主臺灣, demokrasi, bangsa, presiden, 未來, pimpin...","[tapi begitu beliau menang, beliau mengajak sa...",0,0.5,0.5
1,0,46,0_politik_demokrasi_bangsa_negara,"[politik, demokrasi, bangsa, negara, menjadi, ...","[saudara-saudara sekalian, kita harus berani m...",1,0.6,0.4
2,1,33,1_bangsa_pemerintah_negara_kemerdekaan,"[bangsa, pemerintah, negara, kemerdekaan, indo...","[saudara-saudara sekalian, mereka semua dengan...",2,0.928571,0.071429
3,2,25,2_民主臺灣_未來_三黨不過半_,"[民主臺灣, 未來, 三黨不過半, , , , , , , ]",[未來，政府會跟產業界密切合作，把握三大方向，推動臺灣的發展 第一個方向是，「前瞻未來，智慧...,3,0.0,1.0


### Cross-language Topic Distribution Result

An interesting result from the combined-corpus analysis is that topics are not evenly shared between languages. Some topics appear to be **language-specific**, while others are **shared across Chinese (ZH) and Indonesian (ID)**.

From the topic–language distribution table, we can observe three patterns:
1. **Chinese-only topics**: Certain topics are composed entirely of Chinese chunks, indicating themes that are specific to Taiwanese political discourse (for example, Taiwan’s future, global positioning, or domestic policy context).
2. **Indonesian-only topics**: Some topics are dominated by Indonesian chunks, often reflecting local rhetoric, leadership style, or country-specific political narratives.
3. **Shared topics**: A few topics contain a mix of Chinese and Indonesian chunks, suggesting common political themes such as democracy, addressing the people, national unity, or leadership responsibility.

This result shows that multilingual topic modeling can capture both **shared political language** and **country-specific discourse** at the same time. It also supports the validity of using a combined corpus: the model does not force all texts into the same topics, but instead allows natural overlap where themes are truly similar, while keeping distinct topics separate when the discourse differs.

## **Conclusion**

This study uses multilingual topic modeling to compare presidential inauguration speeches from Taiwan and Indonesia. The results show that while both countries share some common political themes, such as democracy, leadership, and national unity, they also maintain distinct, country-specific topics. Some topics appear only in Chinese or only in Indonesian, while others are clearly shared across languages.

These findings suggest that multilingual BERT-based topic modeling is effective for cross-country political analysis. It can identify shared discourse without removing local political and cultural differences. Overall, the results highlight both similarities and differences in how Taiwan and Indonesia present their visions for the future, showing that political language can be comparable across countries while still reflecting national context.