# **Taiwanese and Indonesian Presidential Inauguration Speech: What Does the Future Hold?**

**Inaugurational Speech: the case of Taiwan and Indonesia**

Course Instructor: Dr. Maciej Światła


> I Putu Agastya Harta Pratama - 472876; Si Tang Lin - 476912  
Faculty of Economic Sciences  
University of Warsaw  
Poland  
2026


## **Introduction**

This project analyzes presidential inauguration speeches from Taiwan and Indonesia. Both countries are Asian democracies, but they differ in political systems, culture, and historical background. By comparing the underlying topics from each candidates' inauguration inauguration speeches, we aim to understand whether they share similar political themes or express different visions for the future.

We use a topic modeling approach based on multilingual BERT, which allows texts in different languages to be analyzed within the same semantic space. This makes cross-country comparison of political discourse possible.

## **Specific Objective on Topic Similarity**


In this part of the project, we want to see whether topic extracted from these two different corpus of presidential speeches (Indonesian and Taiwan) would yield any similarities.

we first load pre-trained BERTopic models for each language and extract the semantic embeddings representing their respective topics. We then compute the cosine similarity between all pairs of Indonesian and Taiwanese topic embeddings, forming a comprehensive similarity matrix. To establish the most optimal one-to-one correspondence between topics, we apply the Hungarian algorithm, which ensures that each topic from one corpus is uniquely matched with its most similar counterpart in the other. Finally, we use a dynamic threshold based on the similarity scores to differentiate truly 'shared' topics, indicating strong semantic overlap, from weaker or 'forced' matches.

## Library Imports

In [None]:
import numpy as np
import pandas as pd
from bertopic import BERTopic
from sklearn.metrics.pairwise import cosine_similarity
from scipy.optimize import linear_sum_assignment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

import jieba

In [None]:
# 1. Define the missing function exactly as it was during training
def jieba_tokenizer(text):
    return jieba.lcut(text)

# --- Load the two saved models ---
tw_model = BERTopic.load("topic_model_tw_v2")
id_model = BERTopic.load("topic_model_id_v3")

def extract_topic_embeddings(model):
    info = model.get_topic_info()
    topic_ids = info.loc[info["Topic"] != -1, "Topic"].to_list()

    embs, kept = [], []
    for t in topic_ids:
        vec = model.topic_embeddings_[t]
        if vec is None:
            continue
        embs.append(vec)
        kept.append(t)

    embs = np.vstack(embs)
    # normalise for cosine stability
    embs = embs / (np.linalg.norm(embs, axis=1, keepdims=True) + 1e-12)
    return kept, embs

# --- Topic embedding matrices ---
id_topics, id_embs = extract_topic_embeddings(id_model)
tw_topics, tw_embs = extract_topic_embeddings(tw_model)

# --- Similarity matrix (ID topics x TW topics) ---
S = cosine_similarity(id_embs, tw_embs)

# --- One-to-one global best matching (Hungarian algorithm) ---
ri, ci = linear_sum_assignment(-S)

rows = []
for i, j in zip(ri, ci):
    t_id, t_tw = id_topics[i], tw_topics[j]
    sim = float(S[i, j])
    rows.append({
        "topic_ID": t_id,
        "topic_TW": t_tw,
        "similarity": sim,
        "ID_top_words": ", ".join([w for w, _ in (id_model.get_topic(t_id) or [])][:10]),
        "TW_top_words": ", ".join([w for w, _ in (tw_model.get_topic(t_tw) or [])][:10]),
    })

match_1to1 = pd.DataFrame(rows).sort_values("similarity", ascending=False)

# --- Thresholding to separate real matches vs forced matches ---
s = match_1to1["similarity"].to_numpy()
med = float(np.median(s))
mad = float(np.median(np.abs(s - med)) + 1e-9)

# sane default band; adjust later if needed
threshold = float(np.clip(med + 1.0 * mad, 0.55, 0.75))

shared = match_1to1[match_1to1["similarity"] >= threshold].copy()
forced = match_1to1[match_1to1["similarity"] <  threshold].copy()

print("Indonesia topics:", len(id_topics), " | Taiwan topics:", len(tw_topics))
print("Threshold used:", threshold)
print("Shared pairs:", len(shared), " | Forced/weak pairs:", len(forced))

shared.head(15)


In [None]:
shared_id = set(shared["topic_ID"])
shared_tw = set(shared["topic_TW"])

id_only = [t for t in id_topics if t not in shared_id]
tw_only = [t for t in tw_topics if t not in shared_tw]

id_only_table = pd.DataFrame({
    "topic_ID": id_only,
    "ID_top_words": [
        ", ".join([w for w, _ in (id_model.get_topic(t) or [])][:10]) for t in id_only
    ]
})

tw_only_table = pd.DataFrame({
    "topic_TW": tw_only,
    "TW_top_words": [
        ", ".join([w for w, _ in (tw_model.get_topic(t) or [])][:10]) for t in tw_only
    ]
})

id_only_table.head(15), tw_only_table.head(15)


In [None]:
match_1to1["similarity"].describe()

In [None]:
match_1to1

In [None]:
tw_model = BERTopic.load("topic_model_tw_v2")
id_model = BERTopic.load("topic_model_id_v3")

def get_topic_vectors(model):
    # Retrieve topic information from the BERTopic model
    info = model.get_topic_info()
    # Exclude the outlier topic (-1) from the topic information
    info = info[info.Topic != -1].copy()

    # Get the raw topic embeddings from the model
    E = model.topic_embeddings_
    if E is None:
        raise ValueError("model.topic_embeddings_ is None. Check model loading.")

    # Create a mapping from topic ID to its row index in the full embeddings array
    full_info = model.get_topic_info()
    topic_to_row = {t: i for i, t in enumerate(full_info.Topic.tolist())}

    # Get the list of topic IDs to be included
    topic_ids = info.Topic.tolist()
    # Map these topic IDs to their corresponding row indices in the embeddings array
    rows = [topic_to_row[t] for t in topic_ids]
    # Extract the relevant embeddings using the mapped row indices
    V = E[rows]

    # Normalize the topic vectors to ensure stable cosine similarity calculations
    V = V / np.linalg.norm(V, axis=1, keepdims=True)
    # Return the topic IDs, normalized vectors, and a dictionary of topic sizes
    return topic_ids, V, info.set_index("Topic")["Count"].to_dict()

# Extract topic IDs, vectors, and sizes for the Taiwanese model
tw_ids, tw_V, tw_sizes = get_topic_vectors(tw_model)
# Extract topic IDs, vectors, and sizes for the Indonesian model
id_ids, id_V, id_sizes = get_topic_vectors(id_model)

# Calculate the cosine similarity matrix between Indonesian and Taiwanese topic vectors
# The shape of S will be (number_of_id_topics, number_of_tw_topics)
S = cosine_similarity(id_V, tw_V)

# For each Indonesian topic, find the Taiwanese topic with the highest similarity
best_match = np.argmax(S, axis=1)
# Get the similarity score for each best match
best_sim = S[np.arange(S.shape[0]), best_match]

# Create a list of (Indonesian topic ID, Taiwanese topic ID, similarity) tuples
pairs = sorted(
    [(id_ids[i], tw_ids[best_match[i]], float(best_sim[i])) for i in range(len(id_ids))],
    key=lambda x: x[2],
    reverse=True
)

# Print the top 20 strongest topic overlaps with their keywords and sizes
for id_t, tw_t, sim in pairs[:20]:
    print(f"ID topic {id_t} (n={id_sizes[id_t]})  ↔  TW topic {tw_t} (n={tw_sizes[tw_t]})  sim={sim:.3f}")
    print("  ID keywords:", [w for w, _ in id_model.get_topic(id_t)[:10]])
    print("  TW keywords:", [w for w, _ in tw_model.get_topic(tw_t)[:10]])
    print()

ID topic 2 (n=10)  ↔  TW topic 6 (n=6)  sim=0.279
  ID keywords: ['kedaulatan', 'berdaulat', 'bangsa', 'negara', 'rakyatnya', 'kemerdekaan', 'menjadi', 'berkuasa', 'kekuasaan', 'rakyat']
  TW keywords: ['的 檢驗', '魔咒', '作主', '人民 作主', '檢驗', '最', '民主 國家', '每', '。   臺', '共同']

ID topic 0 (n=32)  ↔  TW topic 4 (n=6)  sim=0.279
  ID keywords: ['indonesia', 'bersyukur', 'presiden', 'bangsa', 'republik', 'bagi', 'negara', 'untuk', 'kepemimpinan', 'bapak']
  TW keywords: ['， 臺 灣是', '灣是', '臺 灣是', '今天', '南', '第一', '年 ，', '臺 南', '戰爭', '的 今天 ，']

ID topic 5 (n=6)  ↔  TW topic 6 (n=6)  sim=0.246
  ID keywords: ['pemimpin', 'kepemimpinan', 'pimpinan', 'pemerintahan', 'dipimpin', 'pejabat', 'perwakilan', 'diri', 'harus', 'contoh']
  TW keywords: ['的 檢驗', '魔咒', '作主', '人民 作主', '檢驗', '最', '民主 國家', '每', '。   臺', '共同']

ID topic 7 (n=4)  ↔  TW topic 6 (n=6)  sim=0.196
  ID keywords: ['pengorbanan', 'kemerdekaan', 'hadiah', 'memberi', 'karunia', 'mahakuasa', 'rakyat', 'diberikan', 'sangat', 'mereka']
  TW ke

In [None]:
def top_words(model, topic_id, n=10):
    return [w for w, _ in model.get_topic(topic_id)[:n]]

rows = []
for id_t, tw_t, sim in pairs:
    rows.append({
        "id_topic": id_t,
        "id_n": int(id_sizes.get(id_t, 0)),
        "tw_topic": tw_t,
        "tw_n": int(tw_sizes.get(tw_t, 0)),
        "sim": float(sim),
        "id_keywords": ", ".join(top_words(id_model, id_t, 10)),
        "tw_keywords": ", ".join(top_words(tw_model, tw_t, 10)),
    })

df_align = pd.DataFrame(rows).sort_values("sim", ascending=False).reset_index(drop=True)
df_align

Unnamed: 0,id_topic,id_n,tw_topic,tw_n,sim,id_keywords,tw_keywords
0,2,10,6,6,0.279192,"kedaulatan, berdaulat, bangsa, negara, rakyatn...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
1,0,32,4,6,0.278848,"indonesia, bersyukur, presiden, bangsa, republ...","， 臺 灣是, 灣是, 臺 灣是, 今天, 南, 第一, 年 ，, 臺 南, 戰爭, 的 今天 ，"
2,5,6,6,6,0.246083,"pemimpin, kepemimpinan, pimpinan, pemerintahan...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
3,7,4,6,6,0.195758,"pengorbanan, kemerdekaan, hadiah, memberi, kar...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
4,6,5,6,6,0.182481,"berdemokrasi, demokrasi, bangsa, kehendak, men...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
5,3,6,6,6,0.156256,"kemiskinan, penyimpangan, penyelewengan, korup...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
6,8,4,8,5,0.155424,"kemakmuran, kekayaan, ekonomi, penting, komodi...","世界 ，, 鏈 」 的, ai, 要 向, 的 中心, 們, 你 們, 鏈 」, 你, 中心"
7,4,6,6,6,0.143695,"melihat, merasakan, tapi, realita, sadar, sung...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
8,1,14,6,6,0.127112,"berani, menghadapi, tantangan, marilah, rintan...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"
9,9,4,6,6,0.073926,"energi, geotermal, mampu, harus, karena, menca...","的 檢驗, 魔咒, 作主, 人民 作主, 檢驗, 最, 民主 國家, 每, 。 臺, 共同"


Based on the topic alignment analysis, the overall similarity between the top topics from Indonesian and Taiwanese inauguration speeches appears to be relatively low, with the highest cosine similarity score being around 0.28. This suggests that while there are statistical matches, the semantic overlap between the dominant themes in the two countries' speeches is not very strong.

Notably, many Indonesian topics, such as those related to 'sovereignty' (ID topic 2), 'leadership' (ID topic 5), and 'democracy' (ID topic 6), found their best, albeit weak, match with a single Taiwanese topic (TW topic 6), which includes keywords related to 'democracy' and 'people as masters'.