# **Taiwanese and Indonesian Presidential Inauguration Speech: What Does the Future Hold?**

## **Traditional Chinese- Model H384**



## **Introduction**

This project analyzes presidential inauguration speeches from Taiwan and Indonesia. Both countries are Asian democracies, but they differ in political systems, culture, and historical background. By comparing their inauguration speeches, we aim to understand whether they share similar political themes or express different visions for the future.

We use a topic modeling approach based on multilingual BERT, which allows texts in different languages to be analyzed within the same semantic space. This makes cross-country comparison of political discourse possible.

## **Purpose of the project**

- To apply topic modeling separately to Taiwanese and Indonesian inauguration speeches in order to identify the main themes in each country
- To select the better-performing model and then combine both corpora to compare topics using cosine similarity, measuring how similar or different the themes are across countries.

To ensure model quality, we test two configurations, v2 and H384. For each model, we conduct parameter tuning and diagnostic evaluation to examine trade-offs between coherence, topic separation, and topic diversity. Based on these results, we select the more suitable model for cross-country analysis.

In this notebook, we present the full workflow, including text preprocessing, topic modeling with the H384 model, parameter tuning, and diagnostic analysis. This workflow forms the basis for the final comparative study.


# **1. Import Libraries**

In [None]:
!pip install requests nltk spacy bertopic gensim

Collecting bertopic
  Downloading bertopic-0.17.4-py3-none-any.whl.metadata (24 kB)
Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading bertopic-0.17.4-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim, bertopic
Successfully installed bertopic-0.17.4 gensim-4.4.0


In [None]:
# for scrapper
import requests
from bs4 import BeautifulSoup
import os
import re

# others
from collections import Counter
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import math
import statistics
import pickle

# topic modelling

from nltk.tokenize import RegexpTokenizer
from itertools import combinations
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance



# **2. Import Data**

- Dataset: Preseident inauguration speech(2024)


Key Details

- Source: Office of the President Republic of China(Taiwan)

- URL: https://www.president.gov.tw/Page/700

- Mothod of collection : Scrapping

- Number of tokens: 5297(each token is a single character in tradiyion chinese)

- Feature types: Texts

- Subject area: Political Science

In [None]:
#load files

import requests
from bs4 import BeautifulSoup
url = "https://www.president.gov.tw/NEWS/28428"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
response.encoding = "utf-8"

soup = BeautifulSoup(response.text, "html.parser")

speech_div = soup.find("div", class_="article1")

if speech_div is None:
    raise ValueError(" missing article1")

df = speech_div.get_text(separator="\n", strip=True)

with open("full_speech_chinese.txt", "w", encoding="utf-8") as f:
    f.write(df)
unwanted = "中華民國第16任總統賴清德伉儷及副總統蕭美琴今（20）日上午參加在總統府府前廣場舉行的就職慶祝大會，總統並以「打造民主和平繁榮的新臺灣」為題，發表就職演說，演說全文為："

df = df.replace(unwanted, "").strip()

df

'蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看直播的好朋友，全體國人同胞，大家好：\n我年輕的時候，立志行醫救人。我從政的時候，立志改變臺灣。現在，站在這裡，我立志壯大國家！\n我以無比堅定的心情，接受人民的託付，就職中華民國第十六任總統，我將依據中華民國憲政體制，肩負起帶領國家勇往前進的重責大任。\n回想1949年的今天，臺灣實施戒嚴，全面進入專制的黑暗年代。\n1996年的今天，臺灣第一位民選總統宣誓就職，向國際社會傳達，中華民國臺灣是一個主權獨立的國家、主權在民。\n2024年的今天，臺灣在完成三次政黨輪替之後，第一次同一政黨連續執政，正式展開第三任期！臺灣也揚帆進入一個充滿挑戰，又孕育無限希望的新時代。\n這段歷程，是這塊土地上的人們，前仆後繼、犧牲奉獻所帶來的結果。雖然艱辛，但我們做到了！\n此時此刻，我們不只見證新政府的開始，也是再一次迎接得來不易的民主勝利！\n許多人將我和蕭美琴副總統的當選，解讀為「打破八年政黨輪替魔咒」。事實上，民主就是人民作主，每一次的選舉，虛幻的魔咒並不存在，只有人民對執政黨最嚴格的檢驗、對國家未來最真實的選擇。\n我要感謝，過去八年來，蔡英文前總統、陳建仁前副總統和行政團隊的努力，為臺灣的發展，打下堅實的基礎。也請大家一起給他們一個最熱烈的掌聲！\n我也要感謝國人同胞大家的支持，不受外來勢力的影響，堅定守護民主，向前走；不回頭，為臺灣翻開歷史的新頁！\n在未來任期的每一天，我將「行公義，好憐憫，存謙卑的心」，「視民如親」，不愧於每一分信賴與託付。新政府也將兢兢業業，拿出最好的表現，來接受全民的檢驗。我們的施政更要不斷革新，開創臺灣政治的新風貌。\n一、行政立法協調合作，共同推動國政\n今年二月上任的新國會，是臺灣時隔十六年後，再次出現「三黨不過半」的立法院。面對這個政治新局，有些人抱持期待，也有些人感到憂心。\n我要告訴大家，這是全民選擇的新模式，當我們以新思維看待「三黨不過半」，這代表著，朝野政黨都能分享各自的理念，也將共同承擔國家的種種挑戰。\n然而，全民對於政黨的理性問政，也有很大的期待。政黨在競爭之外，也應該有合作的信念，國家才能踏出穩健的步伐。\n立法院的議事運作，應該遵守程序正義，多數尊重少數，少數服從多數，才能避免衝突，維持社會的安定和諧。\n在民主社會，人民的利益

In [None]:

#! pip install transformers
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "bert-base-chinese",
    use_fast=True
)
text = df

token_count = len(tokenizer.tokenize(text))

print("Token count =", token_count)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (5297 > 512). Running this sequence through the model will result in indexing errors


Token count = 5297


# **3. Text Cleaning**

- Removed URLs and HTML tags from the text.
- Normalized whitespace by collapsing multiple spaces and trimming leading.
- Applied cleaning function to all text entries

In [None]:
# Precleaning, save. for data splitting, cause chinese dont have clear . will delete after splitting

df = pd.DataFrame({
    "text": [df]
})

texts = df["text"].tolist()

def clean_text(text):
    text = re.sub(r"http\S+|www\.\S+", "", text)
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["cleaned"] = df["text"].apply(clean_text)
print(df.head(5))

df["cleaned"] = df["cleaned"].astype(str)
df = df[df["cleaned"].str.strip() != ""]
df = df[df["cleaned"].str.lower() != "nan"]

                                                text  \
0  蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看...   

                                             cleaned  
0  蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看...  


In [None]:
print("df type:", type(df))
print("texts type:", type(texts))
print("first element type:", type(texts[0]))
print("sample text preview:", texts[0][:50])

df type: <class 'pandas.core.frame.DataFrame'>
texts type: <class 'list'>
first element type: <class 'str'>
sample text preview: 蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看直播的好


# **4. Splitting into Documents**

### What was done

- Trimmed leading and trailing whitespace for each paragraph.
- Filtered out very short paragraphs.
- Computed token counts check wheter it exceed the limit size for bert model.

### Why this step is needed

For Traditional Chinese text, tokenization behavior differs significantly from Latin alphabet languages, making it necessary to control input length and chunking structure before downstream processing. Splitting long text into paragraph-level documents helps ensure that each unit remains within practical token limits while preserving coherent semantic boundaries.

In [None]:
def split_paragraphs(text, min_len=20):
    """
    Split text into natural paragraphs using line breaks.
    Filter out very short paragraphs.
    """
    paras = [p.strip() for p in text.split("\n") if len(p.strip()) >= min_len]
    return paras

documents = []
for t in texts:
    documents.extend(split_paragraphs(t))

In [None]:
print("Number of documents (paragraphs):", len(documents))
for i, d in enumerate(documents[:5]):
    print(f"\n--- Paragraph {i} ---")
    print(d[:200])

Number of documents (paragraphs): 79

--- Paragraph 0 ---
蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收看直播的好朋友，全體國人同胞，大家好：

--- Paragraph 1 ---
我年輕的時候，立志行醫救人。我從政的時候，立志改變臺灣。現在，站在這裡，我立志壯大國家！

--- Paragraph 2 ---
我以無比堅定的心情，接受人民的託付，就職中華民國第十六任總統，我將依據中華民國憲政體制，肩負起帶領國家勇往前進的重責大任。

--- Paragraph 3 ---
回想1949年的今天，臺灣實施戒嚴，全面進入專制的黑暗年代。

--- Paragraph 4 ---
1996年的今天，臺灣第一位民選總統宣誓就職，向國際社會傳達，中華民國臺灣是一個主權獨立的國家、主權在民。


In [None]:

doc_token_counts = [
    len(tokenizer.encode(doc, add_special_tokens=True))
    for doc in documents
]

df_token_stats = pd.DataFrame({
    "document": documents,
    "token_count": doc_token_counts
})

In [None]:
df_token_stats["token_count"].describe()

Unnamed: 0,token_count
count,79.0
mean,66.924051
std,25.685361
min,22.0
25%,49.0
50%,66.0
75%,79.5
max,154.0


# **5.Model Training - MiniLM-L12-H384**

In [None]:
MODEL_NAME_H = "microsoft/Multilingual-MiniLM-L12-H384"

tokenizer_H = AutoTokenizer.from_pretrained(MODEL_NAME_H, use_fast=False)
model_H = AutoModel.from_pretrained(MODEL_NAME_H)

def encode_H(texts):
    inputs = tokenizer_H(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model_H(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.cpu().numpy()

tokenizer_config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/471M [00:00<?, ?B/s]

In [None]:
#pip install jieba
import jieba

def jieba_tokenizer(text):
    return list(jieba.cut(text))

  re_han_default = re.compile("([\u4E00-\u9FD5a-zA-Z0-9+#&\._%\-]+)", re.U)
  re_skip_default = re.compile("(\r\n|\s)", re.U)
  re_skip = re.compile("([a-zA-Z0-9]+(?:\.\d+)?%?)")


### Jieba Tokenization and Stop Words

Chinese text does not contain clear word boundaries, so tokenization must be handled explicitly before vectorization. Jieba is used as a dictionary-based Chinese tokenizer to segment raw text into meaningful lexical units. This step allows downstream components such as `CountVectorizer` to operate on word-level or phrase-level features instead of individual characters.

A custom stop word list is applied to remove pronouns, demonstratives, and connective terms.

In [None]:
umap_model_H = UMAP(
    n_neighbors=10,
    n_components=5,
    min_dist=0.1,
    metric="cosine",
    random_state=42
)

hdbscan_model_H = HDBSCAN(
    min_cluster_size=2,
    min_samples=1,
    metric="euclidean",
    cluster_selection_method="leaf",
    prediction_data=True
)
tf_vectorizer_H = CountVectorizer(
    tokenizer=jieba_tokenizer,
    ngram_range=(1, 2),
    min_df=1,
    max_df=0.5,
    max_features=5000
)

rep_model_H = KeyBERTInspired()
mmr_H = MaximalMarginalRelevance(diversity=0.3)

representation_model_H = {
    "KeyBERT": rep_model_H,
    "MMR": mmr_H
}
topic_model_H384 = BERTopic(
    embedding_model=encode_H,
    representation_model=representation_model_H,
    umap_model=umap_model_H,
    hdbscan_model=hdbscan_model_H,
    vectorizer_model=tf_vectorizer_H,
    calculate_probabilities=True,
    verbose=True
)

topics_H, probs_H = topic_model_H384.fit_transform(documents)
topic_info_H = topic_model_H384.get_topic_info()


2026-01-20 18:27:33,682 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

2026-01-20 18:27:45,930 - BERTopic - Embedding - Completed ✓
2026-01-20 18:27:45,932 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-20 18:27:54,530 - BERTopic - Dimensionality - Completed ✓
2026-01-20 18:27:54,531 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-20 18:27:54,552 - BERTopic - Cluster - Completed ✓
2026-01-20 18:27:54,568 - BERTopic - Representation - Fine-tuning topics using representation models.
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
DEBUG:jieba:Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.813 seconds.
DEBUG:jieba:Loading model cost 0.813 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.
2026-01-20 18:28:10,977 - BERTopic - Representation - Completed ✓


In [None]:
topic_info_H

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,Representative_Docs
0,-1,10,-1_中華民國_， 都_都_國家,"[中華民國, ， 都, 都, 國家, 人民, 為, 和平, 是, 有, 灣 ，]","[根據 中華民國, 就 職中華民國, 憲政體制 ，, 有主權 才, 是 中華民國, 愛護國家...","[中華民國, ， 都, 國家, 人民, 灣 ，, 力量 ，, 能力 ，, 健康, 民主, 國民]",[全體國人不分族群，也不論先來後到，只要認同臺灣，都是這個國家的主人。無論是中華民國、中華民...
1,0,5,0_和平 穩定_的 和平_安全_穩定,"[和平 穩定, 的 和平, 安全, 穩定, 文化, 海 的, 好, 臺 海, 海, 穩定 ，]","[文攻武 嚇, 文化 永續, 印太區域 額外, 文化 ，, 撥款 法案, 的 文化, 、 文...","[和平 穩定, 的 和平, 安全, 文化, 臺 海, 代表 、, 也 完成, 不可或缺 的,...",[蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收...
2,1,5,1_會 有_會_， 會_有 更,"[會 有, 會, ， 會, 有 更, ；, 保障, 有, 更, 安全, 社會]","[公共 支持, 、 社會, 受到 保障, 保障 ，, 公共, 互相合作 、, 協定 ，, 社...","[會 有, 保障, 安全, 社會, 安全 、, 運作 ，, 加入, 、 社會, 和, ！ ]",[臺灣已經申請加入CPTPP，我們會積極爭取加入區域經濟整合；跟世界民主國家簽訂雙邊投資保障...
3,2,4,2_， 臺_在_世界_灣是,"[， 臺, 在, 世界, 灣是, 臺 灣是, 年, 第一, 年 ，, 臺 南, 南]","[灣的 國際, 國內 外賓客, 世界 迎接, 各國 已, 世界 和平, 當 世界, 國際 情...","[世界, 年 ，, 臺 南, 南, 國際, 」 的, 先生, 世界 迎接, 世界 和平, 今...",[當世界上有越來越多國家，公開支持臺灣的國際參與，在在證明了，臺灣是世界的臺灣，臺灣是全球和...
4,3,4,3_對_」_「_人民,"[對, 」, 「, 人民, 取代, 政府 ，, 作主, 存在, 人民 作主, 拿出]","[對 執政黨, 政府 ，, 接受 全民, 臺 灣民選, 就是 人民, 與 公共政策, 視民如...","[取代, 政府 ，, 作主, 人民 作主, 事實, ， 拿出, 政府, 新政府 也, 新 風...",[在未來任期的每一天，我將「行公義，好憐憫，存謙卑的心」，「視民如親」，不愧於每一分信賴與託...
5,4,3,4_立志_我_可以_時候 ，,"[立志, 我, 可以, 時候 ，, 時候, 的 時候, ， 立志, 我, 在, 人]","[我 從政, 從政 的, 政府 的, 我要 建立, 從政, 政府, 政府 都, 全民 所, ...","[我, 的 時候, ， 立志, 政府, 人生 的, 可以 擁有, 全力以赴 。, 了解 國人...",[我了解國人對生活的煩惱和期待，凡是各位關心的議題、社會需要的改革，政府都會積極以對，全力以...
6,5,3,5_新政府_」 ，_「 四個_一次 迎接,"[新政府, 」 ，, 「 四個, 一次 迎接, 不 只, 也將, 也將 共同, 以 新思維,...","[承接 民主化, 見證 新政府, 這是 全民, 新政府 將, 全民 選擇, 新政府 的, 得...","[新政府, 「 四個, 一次 迎接, 也將 共同, 以 新思維, 不卑不亢 ，, 勝利 ！,...","[此時此刻，我們不只見證新政府的開始，也是再一次迎接得來不易的民主勝利！, 由於兩岸的未來，..."
7,6,3,6_她_我_她 ，_國人 ，,"[她, 我, 她 ，, 國人 ，, 國人, 和, 再次, 我 也, 面對, 國際]","[偉大國家 ！, 感謝 國際, 國際 社會, 國家 可以, 一位 國人, 一個 國家, 偉大...","[她 ，, 國人 ，, 我 也, 面對, 國際, 社會, 0403 災後的, 一位 國人, ...",[親愛的國人同胞，國家的未來發展，需要每一分力量。面對全球化、全面性競爭的時代，沒有一個國家...
8,7,3,7_更好_我會_， 我會_ 大家,"[更好, 我會, ， 我會, 大家, 大家, 中央政府, 285.5, 、 槍, 285...","[中央政府 已經, 中央政府, 幫助 花蓮民眾, 大家 期待, 大家 希望, 花蓮民眾 可以...","[， 我會, 中央政府, 285.5 億元, 大家 希望, 億元 ，, 幫助 花蓮民眾, 中...","[大家希望收入更高，我會推動產業升級，創造更好的薪資環境。, 大家期待治安更好，我會積極打擊..."
9,8,3,8_掌聲_的 掌聲_一步_感謝,"[掌聲, 的 掌聲, 一步, 感謝, 總統, ！ , 前 總統, 一步 ，, 一個 最熱烈...","[大展身手 ，, 我們 是不是, 過去 八年, 的 推手, 的 僑胞, 的 掌聲, 的 努力...","[的 掌聲, 今天 現場, 世界 繁榮, 一步 ！, 以實際 行動, 他們 ！, 八年 來,...","[我們要走對的路，產業要大展身手，做世界繁榮的推手，讓臺灣每前進一步，世界就向前一步！, 今..."


### **Short Conclusion**

The topic model extracts a set of coherent, policy-oriented themes dominated by governance, democracy, national identity, economic development, technology (AI, semiconductors), social welfare, and international relations.
High-frequency topics emphasize government action, democratic values, and future-oriented economic/industrial transformation, while lower-weight topics capture historical references, national security, and innovation narratives.

Overall, the results suggest a discourse structure centered on state leadership and democratic legitimacy, with strong secondary focus on economic modernization and technological competitiveness.


# **6. Topic Result Diagnostic**
### **Topic Coherence**

In [None]:

top_words_dict_H = {}

for topic_id, words_scores in topic_model_H384.get_topics().items():
    if topic_id == -1:
        continue
    top_words_dict_H[topic_id] = [w for w, _ in words_scores]

In [None]:
vectorizer_H = topic_model_H384.vectorizer_model

texts_cleaned_H = [
    vectorizer_H.build_tokenizer()(doc)
    for doc in documents
]
from gensim.corpora import Dictionary

dictionary_H = Dictionary(texts_cleaned_H)
dictionary_H.filter_extremes(no_below=1, no_above=0.9)

dictionary_vocab_H = set(dictionary_H.token2id.keys())
topics_for_gensim_H = []
topic_ids_for_gensim_H = []

for topic_id, words_scores in topic_model_H384.get_topics().items():
    if topic_id == -1:
        continue

    words_H = [w for w, _ in words_scores][:10]
    words_H = [w for w in words_H if w in dictionary_vocab_H]

    if len(words_H) < 2:
        continue

    topic_ids_for_gensim_H.append(topic_id)
    topics_for_gensim_H.append(words_H)

In [None]:
from gensim.models import CoherenceModel
import pandas as pd
cm_umass_H = CoherenceModel(
    topics=topics_for_gensim_H,
    texts=texts_cleaned_H,
    dictionary=dictionary_H,
    coherence="u_mass"
)

cm_uci_H = CoherenceModel(
    topics=topics_for_gensim_H,
    texts=texts_cleaned_H,
    dictionary=dictionary_H,
    coherence="c_uci"
)

cm_npmi_H = CoherenceModel(
    topics=topics_for_gensim_H,
    texts=texts_cleaned_H,
    dictionary=dictionary_H,
    coherence="c_npmi"
)

In [None]:
coherence_df_H = (
    pd.DataFrame({
        "topic_id": topic_ids_for_gensim_H,
        "umass": cm_umass_H.get_coherence_per_topic(),
        "uci": cm_uci_H.get_coherence_per_topic(),
        "npmi": cm_npmi_H.get_coherence_per_topic(),
    })
    .set_index("topic_id")
    .sort_index()
)

print(coherence_df_H)
avg_umass_H = coherence_df_H["umass"].mean()
avg_uci_H = coherence_df_H["uci"].mean()
avg_npmi_H = coherence_df_H["npmi"].mean()

print(f"[H] Average UMass coherence: {avg_umass_H:.4f}")
print(f"[H] Average UCI coherence: {avg_uci_H:.4f}")
print(f"[H] Average NPMI coherence: {avg_npmi_H:.4f}")

              umass        uci      npmi
topic_id                                
0        -17.172901 -11.533358 -0.288517
1         -7.023148  -6.680094 -0.112552
2         -4.918803  -2.593899  0.067044
3         -5.095375 -10.387327 -0.297733
4         -7.299734  -9.259645 -0.199511
5        -24.647868 -17.524556 -0.634235
6         -5.735372 -11.283855 -0.318646
7        -15.098943  -8.286370 -0.149408
8        -12.391511 -11.215555 -0.340490
9         -0.753772   0.833703  0.184688
10        -6.389306  -9.307432 -0.284931
11        -4.335365  -8.487702 -0.139808
12        -4.376218 -11.542899 -0.280454
13       -15.507716 -15.743408 -0.569773
14        -7.955445  -7.990797 -0.155481
15        -5.761428  -3.360637  0.141933
16        -0.767934  -5.974519 -0.013592
17       -16.702222 -17.843330 -0.645772
18       -24.647868 -16.893929 -0.611412
19        -1.666239  -6.028172 -0.121835
20        -1.535502  -6.822344 -0.126340
21        -7.525132 -11.342034 -0.307155
22        -2.806

## **Topic Similarity**

In [None]:
topics_H = {
    topic_id: words
    for topic_id, words in topic_model_H384.get_topics().items()
    if topic_id != -1
}

In [None]:
from sklearn.preprocessing import normalize
import numpy as np

vocab_H = list(dictionary_vocab_H)
vocab_index_H = {w: i for i, w in enumerate(vocab_H)}
def topic_word_matrix_H(topic_dict, vocab_index):
    mat, topic_ids = [], []

    for topic_id, words in topic_dict.items():
        vec = np.zeros(len(vocab_index))
        for w, weight in words:
            if w in vocab_index:
                vec[vocab_index[w]] = weight
        mat.append(vec)
        topic_ids.append(topic_id)

    mat = normalize(np.array(mat), norm="l1")
    return mat, topic_ids


topic_matrix_H, topic_ids_H = topic_word_matrix_H(
    topics_H, vocab_index_H
)

In [None]:
from itertools import combinations
from scipy.spatial.distance import cosine
import numpy as np

def hellinger(p, q):
    return np.sqrt(0.5 * np.sum((np.sqrt(p) - np.sqrt(q)) ** 2))

def kl_divergence(p, q, eps=1e-12):
    p = np.clip(p, eps, None)
    q = np.clip(q, eps, None)
    return np.sum(p * np.log(p / q))

def js_divergence(p, q, eps=1e-12):
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m, eps) + 0.5 * kl_divergence(q, m, eps)
similarity_results_H = []

for i, j in combinations(range(topic_matrix_H.shape[0]), 2):
    p = topic_matrix_H[i]
    q = topic_matrix_H[j]

    similarity_results_H.append({
        "topic_1": topic_ids_H[i],
        "topic_2": topic_ids_H[j],
        "hellinger": hellinger(p, q),
        "kl": kl_divergence(p, q),
        "js": js_divergence(p, q),
        "cosine": 1 - cosine(p, q)
    })
similarity_df_H = pd.DataFrame(similarity_results_H)
similarity_df_H
avg_hellinger_H = similarity_df_H["hellinger"].mean()
avg_kl_H = similarity_df_H["kl"].mean()
avg_js_H = similarity_df_H["js"].mean()
avg_cosine_H = similarity_df_H["cosine"].mean()

print(f"[H] Average Hellinger distance: {avg_hellinger_H:.4f}")
print(f"[H] Average KL divergence: {avg_kl_H:.4f}")
print(f"[H] Average JS divergence: {avg_js_H:.4f}")
print(f"[H] Average cosine similarity: {avg_cosine_H:.4f}")

[H] Average Hellinger distance: 0.9906
[H] Average KL divergence: 25.6223
[H] Average JS divergence: 0.6811
[H] Average cosine similarity: 0.0162


## **Topic diversity**

In [None]:
def topic_diversity_H(topic_model_H384, top_n=10):
    """
    Topic diversity = unique words / total words
    Computed from topic_model_H384 (exclude outlier topic -1)
    """
    topics_H = []

    for topic_id in topic_model_H384.get_topics().keys():
        if topic_id == -1:
            continue

        topic_words = topic_model_H384.get_topic(topic_id)
        if topic_words is None:
            continue

        words = [w for w, _ in topic_words[:top_n]]
        if not words:
            continue

        topics_H.append(words)

    if not topics_H:
        return np.nan

    all_words = sum(topics_H, [])
    return len(set(all_words)) / len(all_words)
diversity_H = topic_diversity_H(topic_model_H384, top_n=10)
print(f"[H] Topic diversity (top-10): {diversity_H:.4f}")

[H] Topic diversity (top-10): 0.9000


### **Quantitative Evaluation Summary**

All coherence scores (UMass, UCI, NPMI) are low and negative, indicating weak cohesion within topics. Topics are interpretable at a high level, but word co-occurrence strength is limited.
Very high Hellinger distance, high KL/JS divergence, and near-zero cosine similarity indicate strong separation between topics. Overlap is minimal.
0.90 (top-10) suggests low redundancy across topics; the model covers a broad vocabulary space.

Net assessment:
The model prioritizes distinctness and coverage over intra-topic coherence. Suitable for mapping broad thematic structure, less suitable for fine-grained semantic interpretation without further tuning.

# **7. Parameter Tuning**

Hyperparameter tuning to check whether it yields better result.

In [None]:
#Model embedding

MODEL_NAME = "microsoft/Multilingual-MiniLM-L12-H384"

tokenizer_H = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
model_H = AutoModel.from_pretrained(MODEL_NAME)
model_H.eval()

def encode_H(texts):
    inputs = tokenizer_H(
        texts,
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    with torch.no_grad():
        outputs = model_H(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.cpu().numpy()

In [None]:
# Coherence
tokenizer_coh = RegexpTokenizer(r"\w+")

docs_as_sets = [
    set(tokenizer_coh.tokenize(doc.lower()))
    for doc in documents
]
N = len(docs_as_sets)
def doc_freq(word):
    return sum(1 for doc in docs_as_sets if word in doc)

def co_doc_freq(w1, w2):
    return sum(1 for doc in docs_as_sets if w1 in doc and w2 in doc)

def umass_coherence(words, eps=1e-12):
    scores = []
    for w1, w2 in combinations(words, 2):
        D_w1w2 = co_doc_freq(w1, w2)
        D_w2 = doc_freq(w2)
        if D_w2 > 0:
            scores.append(np.log((D_w1w2 + eps) / D_w2))
    return np.mean(scores) if scores else np.nan

def uci_coherence(words, eps=1e-12):
    scores = []
    for w1, w2 in combinations(words, 2):
        p_w1 = doc_freq(w1) / N
        p_w2 = doc_freq(w2) / N
        p_w1w2 = co_doc_freq(w1, w2) / N
        if p_w1 > 0 and p_w2 > 0:
            scores.append(np.log((p_w1w2 + eps) / (p_w1 * p_w2)))
    return np.mean(scores) if scores else np.nan




In [None]:
import jieba

def doc_to_set_zh(doc: str):
    return set([w for w in jieba.cut(doc) if w.strip()])

docs_as_sets = [doc_to_set_zh(d) for d in documents]   # 注意：documents（段落）
N = len(docs_as_sets)

def doc_freq(word):
    return sum(1 for doc in docs_as_sets if word in doc)

def co_doc_freq(w1, w2):
    return sum(1 for doc in docs_as_sets if w1 in doc and w2 in doc)

import numpy as np
from itertools import combinations

def npmi_coherence(words, eps=1e-12, min_cooccur=1):
    scores = []
    for w1, w2 in combinations(words, 2):
        D_w1w2 = co_doc_freq(w1, w2)
        if D_w1w2 < min_cooccur:
            continue

        p_w1 = doc_freq(w1) / N
        p_w2 = doc_freq(w2) / N
        p_w1w2 = D_w1w2 / N

        pmi = np.log((p_w1w2 + eps) / (p_w1 * p_w2 + eps))
        scores.append(pmi / (-np.log(p_w1w2 + eps)))

    return float(np.mean(scores)) if scores else np.nan

In [None]:
# Parameter grid
UMAP_GRID = {
    "n_neighbors": [10, 15],
    "min_dist": [0.0, 0.1]
}

HDBSCAN_GRID = {
    "min_cluster_size": [2, 3, 5]
}

VECTORIZER_GRID = {
    "ngram_range": [(1, 1)],
    "min_df": [1],
    "max_df": 1.0
}

REPRESENTATION_GRID = [
    {"KeyBERT": KeyBERTInspired()},
    {
        "KeyBERT": KeyBERTInspired(),
        "MMR": MaximalMarginalRelevance(diversity=0.3)
    }
]

In [None]:
results = []

for n_neighbors in UMAP_GRID["n_neighbors"]:
    for min_dist in UMAP_GRID["min_dist"]:

        umap_model = UMAP(
            n_neighbors=n_neighbors,
            n_components=5,
            min_dist=min_dist,
            metric="cosine",
            random_state=42
        )

        for min_cluster_size in HDBSCAN_GRID["min_cluster_size"]:

            hdbscan_model = HDBSCAN(
                min_cluster_size=min_cluster_size,
                min_samples=1,
                metric="euclidean",
                cluster_selection_method="leaf",
                prediction_data=True
            )

            for ngram_range in VECTORIZER_GRID["ngram_range"]:
                for min_df in VECTORIZER_GRID["min_df"]:

                    vectorizer = CountVectorizer(
                        tokenizer=lambda x: list(jieba.cut(x)),
                        ngram_range=ngram_range,
                        min_df=min_df,
                        max_df=VECTORIZER_GRID["max_df"],
                        max_features=5000
                    )

                    for representation_model in REPRESENTATION_GRID:

                        try:
                            topic_model = BERTopic(
                                embedding_model=encode_H,
                                umap_model=umap_model,
                                hdbscan_model=hdbscan_model,
                                vectorizer_model=vectorizer,
                                representation_model=representation_model,
                                calculate_probabilities=False,
                                verbose=False
                            )

                            topics, _ = topic_model.fit_transform(documents)

                            topic_info = topic_model.get_topic_info()
                            topic_ids = topic_info.Topic[topic_info.Topic != -1]
                            n_topics = len(topic_ids)

                            umass, uci, npmi = [], [], []

                            for t in topic_ids:
                                words = [w for w, _ in topic_model.get_topic(t)]
                                umass.append(umass_coherence(words))
                                uci.append(uci_coherence(words))
                                npmi.append(npmi_coherence(words))

                            results.append({
                                "model": "H384",
                                "n_neighbors": n_neighbors,
                                "min_dist": min_dist,
                                "min_cluster_size": min_cluster_size,
                                "ngram_range": ngram_range,
                                "min_df": min_df,
                                "representation": list(representation_model.keys()),
                                "n_topics": n_topics,
                                "umass": np.nanmean(umass),
                                "uci": np.nanmean(uci),
                                "npmi": np.nanmean(npmi)
                            })

                            print("OK →",
                                  n_neighbors, min_dist, min_cluster_size,
                                  ngram_range, min_df,
                                  list(representation_model.keys()),
                                  "topics:", n_topics)

                        except Exception:
                            continue



OK → 10 0.0 2 (1, 1) 1 ['KeyBERT'] topics: 24
OK → 10 0.0 2 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 24
OK → 10 0.0 3 (1, 1) 1 ['KeyBERT'] topics: 17
OK → 10 0.0 3 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 17
OK → 10 0.0 5 (1, 1) 1 ['KeyBERT'] topics: 10
OK → 10 0.0 5 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 10
OK → 10 0.1 2 (1, 1) 1 ['KeyBERT'] topics: 26
OK → 10 0.1 2 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 26
OK → 10 0.1 3 (1, 1) 1 ['KeyBERT'] topics: 16
OK → 10 0.1 3 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 16
OK → 10 0.1 5 (1, 1) 1 ['KeyBERT'] topics: 10
OK → 10 0.1 5 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 10
OK → 15 0.0 2 (1, 1) 1 ['KeyBERT'] topics: 27
OK → 15 0.0 2 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 27
OK → 15 0.0 3 (1, 1) 1 ['KeyBERT'] topics: 15
OK → 15 0.0 3 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 15
OK → 15 0.0 5 (1, 1) 1 ['KeyBERT'] topics: 10
OK → 15 0.0 5 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 10
OK → 15 0.1 2 (1, 1) 1 ['KeyBERT'] topics: 27
OK → 15 0.1 2 (1, 1) 1 ['KeyBERT', 'MMR'] topics: 27
OK → 15 0.

In [None]:
results_df_H3 = pd.DataFrame(results)
print(results_df_H3.columns)
results_df_H3.head()

Index(['model', 'n_neighbors', 'min_dist', 'min_cluster_size', 'ngram_range',
       'min_df', 'representation', 'n_topics', 'umass', 'uci', 'npmi'],
      dtype='object')


Unnamed: 0,model,n_neighbors,min_dist,min_cluster_size,ngram_range,min_df,representation,n_topics,umass,uci,npmi
0,H384,10,0.0,2,"(1, 1)",1,[KeyBERT],24,-9.963171,-5.107034,0.543443
1,H384,10,0.0,2,"(1, 1)",1,"[KeyBERT, MMR]",24,-9.963171,-5.107034,0.543443
2,H384,10,0.0,3,"(1, 1)",1,[KeyBERT],17,-6.639072,-3.142447,0.244403
3,H384,10,0.0,3,"(1, 1)",1,"[KeyBERT, MMR]",17,-6.639072,-3.142447,0.244403
4,H384,10,0.0,5,"(1, 1)",1,[KeyBERT],10,-4.615859,-2.05107,0.118986


In [None]:
# Normalize parameter
results_df_H3 = pd.DataFrame(results)
print(results_df_H3.columns)
results_df_H3.head()

metrics = ['umass','uci','npmi']
normalized = results_df_H3[metrics].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

results_df_H3['sum_normalised'] = normalized.sum(axis=1)

results_df_H3.sort_values('sum_normalised', ascending=False).head(5)

Index(['model', 'n_neighbors', 'min_dist', 'min_cluster_size', 'ngram_range',
       'min_df', 'representation', 'n_topics', 'umass', 'uci', 'npmi'],
      dtype='object')


Unnamed: 0,model,n_neighbors,min_dist,min_cluster_size,ngram_range,min_df,representation,n_topics,umass,uci,npmi,sum_normalised
22,H384,15,0.1,5,"(1, 1)",1,[KeyBERT],9,-3.783597,-1.733191,0.127563,2.025434
23,H384,15,0.1,5,"(1, 1)",1,"[KeyBERT, MMR]",9,-3.783597,-1.733191,0.127563,2.025434
10,H384,10,0.1,5,"(1, 1)",1,[KeyBERT],10,-4.051509,-1.645768,0.102008,1.957082
11,H384,10,0.1,5,"(1, 1)",1,"[KeyBERT, MMR]",10,-4.051509,-1.645768,0.102008,1.957082
16,H384,15,0.0,5,"(1, 1)",1,[KeyBERT],10,-4.353723,-1.908712,0.119112,1.866628


### **Parameter Tuning and Model Selection**

Model diagnostic & selection takeaway with model H384
Best-performing region (by summed normalized coherence):
- n_neighbors = 15
- min_dist = 0.1
- min_cluster_size = 5
- ngram_range = (1,1)
- min_df = 1
- Representation: KeyBERT (MMR has no material effect here)
- Resulting n_topics ≈ 9–10

Parameter tuning result implies increasing min_cluster_size from very small values materially improves coherence Higher n_neighbors (15 vs 10) stabilizes topic formation and reduces noise. With fewer topics (≈9–10) yields cleaner and more interpretable.


# **8. Model training with the optimal parameter**-

In [None]:
umap_best = UMAP(
    n_neighbors=10,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=42
)

hdbscan_best = HDBSCAN(
    min_cluster_size=5,
    min_samples=1,
    metric="euclidean",
    cluster_selection_method="leaf",
    prediction_data=True
)

vectorizer_best = CountVectorizer(
    tokenizer=lambda x: list(jieba.cut(x)),
    ngram_range=(1, 1),
    min_df=1,
    max_df=1.0,
    max_features=5000
)

representation_best = KeyBERTInspired()

final_topic_model = BERTopic(
    embedding_model=encode_H,
    umap_model=umap_best,
    hdbscan_model=hdbscan_best,
    vectorizer_model=vectorizer_best,
    representation_model=representation_best,
    calculate_probabilities=True,
    verbose=True
)

final_topics, final_probs = final_topic_model.fit_transform(documents)

2026-01-20 18:33:54,703 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

2026-01-20 18:34:01,247 - BERTopic - Embedding - Completed ✓
2026-01-20 18:34:01,248 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2026-01-20 18:34:01,407 - BERTopic - Dimensionality - Completed ✓
2026-01-20 18:34:01,409 - BERTopic - Cluster - Start clustering the reduced embeddings
2026-01-20 18:34:01,424 - BERTopic - Cluster - Completed ✓
2026-01-20 18:34:01,430 - BERTopic - Representation - Fine-tuning topics using representation models.
2026-01-20 18:34:03,461 - BERTopic - Representation - Completed ✓


In [None]:
final_topic_info = final_topic_model.get_topic_info()
final_topic_info

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,5,-1_憲政體制_職中華民國_政府_領國家勇,"[憲政體制, 職中華民國, 政府, 領國家勇, 守護國家, 立足, 壯志, 自中國, 鼎力相...",[我們立定目標，要讓臺灣成為無人機民主供應鏈的亞洲中心，也要發展下一個世代通訊的中低軌道衛星...
1,0,15,0_民主自由_灣的_住民_我們,"[民主自由, 灣的, 住民, 我們, 評比, 全面, 人類, 人權, 回想, 今晚]",[臺灣位居「第一島鏈」的戰略位置，牽動著世界地緣政治的發展。早在1921年，蔣渭水先生就指出...
2,1,9,1_執政黨_勤政_問政_公共政策,"[執政黨, 勤政, 問政, 公共政策, 政黨, 政府, 新政府, 全民, 創新思維, 作主]",[許多人將我和蕭美琴副總統的當選，解讀為「打破八年政黨輪替魔咒」。事實上，民主就是人民作主，...
3,2,8,2_政府_灣社會_社會_令人尊敬,"[政府, 灣社會, 社會, 令人尊敬, 偉大國家, 壯大國家, 全力以赴, 全面性, 用行,...",[我也要邀請每一位國人，和我一起為孕育你我的母親臺灣喝采，我們一起用行動守護她、榮耀她，讓世...
4,3,7,3_有歸國_人們_交通安全_是不是,"[有歸國, 人們, 交通安全, 是不是, 建仁前, 惡名, 我們, 大展身手, 符合, 更新]",[迎向未來，我們都期待一個更強韌的臺灣，可以妥善因應傳染病、天災地變等各類型災害，以及加速都...
5,4,7,4_護主權_主權屬_政權_走向世界,"[護主權, 主權屬, 政權, 走向世界, 犧牲國, 或國際, 國民, 愛護國家, 並不會, ...",[各位國人同胞，當我們主張，中華民國臺灣的未來，由兩千三百萬人民共同決定。我們決定的未來，不...
6,5,6,5_穩定_安控_統的_矽島,"[穩定, 安控, 統的, 矽島, 人工智慧, 面對, 促成, 第二次, 最大, 算力]",[我們也必須發展創新驅動的經濟模式，透過數位轉型，以及淨零轉型的雙軸力量，來協助中小企業升級...
7,6,6,6_文攻武_文化_不可或缺_印太區域,"[文攻武, 文化, 不可或缺, 印太區域, 還有線, 完成, 各國, 所有, 國際間, 全世界]",[蕭美琴副總統、各位友邦的元首與貴賓、各國駐臺使節代表、現場所有的嘉賓，電視機前、還有線上收...
8,7,6,7_活得長_互相合作_強大_平等,"[活得長, 互相合作, 強大, 平等, 還有, 協定, 發展部, 公共, 和諧, 食品安全]",[臺灣已經申請加入CPTPP，我們會積極爭取加入區域經濟整合；跟世界民主國家簽訂雙邊投資保障...
9,8,5,8_執政_政黨_民主化_新政府,"[執政, 政黨, 民主化, 新政府, 新思維, 完成, 當我們, 立法院, 將是, 灣也]",[2024年的今天，臺灣在完成三次政黨輪替之後，第一次同一政黨連續執政，正式展開第三任期！臺...


### **Interpretation of Tuned Topic Modeling Results**

- Topic structure: The model converges to ~10 substantive topics, each with clear semantic anchors and reasonable document counts, indicating stable clustering after tuning.


1.Democracy & civil rights (民主自由、人權、人民、憲政)

2.Governance & public policy execution (執政、政府、公共政策、行政協調)

3.National identity & sovereignty (主權、國家、走向世界、國際)

4.Security & stability (安控、穩定、韌性、科技/AI)

5.Social development & welfare (社會、交通安全、民生)

6.Culture & diplomacy (文化、文攻武嚇、印太區域、國際互動)


Bottom line
This tuned model is suitable for thematic analysis and discourse framing, rather than micro-level lexical inference.

# **9. Topic Diagnostic after Tuning**

### **Topic Coherence**

In [None]:
# the one from tunned parameter
top_words_dict = {}

for topic_id, words_scores in final_topic_model.get_topics().items():
    if topic_id == -1:
        continue
    top_words_dict[topic_id] = [w for w, _ in words_scores]

In [None]:
vectorizer = final_topic_model.vectorizer_model

texts_cleaned_H384 = [
    vectorizer.build_tokenizer()(doc)
    for doc in documents
]

dictionary_H384 = Dictionary(texts_cleaned_H384)
dictionary_H384.filter_extremes(no_below=1, no_above=0.9)

In [None]:
topics_for_gensim_H384 = []
topic_ids_for_gensim_H384 = []

dictionary_vocab_H384 = set(dictionary_H384.token2id.keys())

for topic_id, words_scores in final_topic_model.get_topics().items():
    if topic_id == -1:
        continue

    words_H384 = [w for w, _ in words_scores][:10]
    words_H384 = [w for w in words_H384 if w in dictionary_vocab_H384]

    if len(words_H384) < 2:
        continue

    topic_ids_for_gensim_H384.append(topic_id)
    topics_for_gensim_H384.append(words_H384)

In [None]:
cm_umass_H384 = CoherenceModel(
    topics=topics_for_gensim_H384,
    texts=texts_cleaned_H384,
    dictionary=dictionary_H384,
    coherence="u_mass"
)

cm_uci_H384 = CoherenceModel(
    topics=topics_for_gensim_H384,
    texts=texts_cleaned_H384,
    dictionary=dictionary_H384,
    coherence="c_uci"
)

cm_npmi_H384 = CoherenceModel(
    topics=topics_for_gensim_H384,
    texts=texts_cleaned_H384,
    dictionary=dictionary_H384,
    coherence="c_npmi"
)

In [None]:
coherence_df_H384 = (
    pd.DataFrame({
        "topic_id": topic_ids_for_gensim_H384,
        "umass": cm_umass_H384.get_coherence_per_topic(),
        "uci": cm_uci_H384.get_coherence_per_topic(),
        "npmi": cm_npmi_H384.get_coherence_per_topic(),
    })
    .set_index("topic_id")
    .sort_index()
)

print(coherence_df_H384)

              umass        uci      npmi
topic_id                                
0        -19.625017 -14.329037 -0.487998
1        -16.565213 -15.701420 -0.538534
2        -19.410229 -15.196751 -0.522209
3        -18.828981 -15.112915 -0.524023
4        -14.003154 -13.516868 -0.416207
5        -15.629043 -14.769229 -0.485088
6        -20.892600 -15.772562 -0.558601
7        -20.676954 -15.624473 -0.549227
8        -17.700491 -15.056345 -0.488627
9        -14.608313 -13.077001 -0.417530


In [None]:
avg_umass_H384 = coherence_df_H384["umass"].mean()
avg_uci_H384 = coherence_df_H384["uci"].mean()
avg_npmi_H384 = coherence_df_H384["npmi"].mean()

print(f"[H384] Average UMass coherence: {avg_umass_H384:.4f}")
print(f"[H384] Average UCI coherence: {avg_uci_H384:.4f}")
print(f"[H384] Average NPMI-UCI coherence: {avg_npmi_H384:.4f}")


[H384] Average UMass coherence: -17.7940
[H384] Average UCI coherence: -14.8157
[H384] Average NPMI-UCI coherence: -0.4988


### **Topic Similarity**

In [None]:
topics_H384 = {
    topic_id: words
    for topic_id, words in final_topic_model.get_topics().items()
    if topic_id != -1
}

In [None]:
from sklearn.preprocessing import normalize
import numpy as np

vocab_H384 = list(dictionary_vocab_H384)
vocab_index_H384 = {w: i for i, w in enumerate(vocab_H384)}

def topic_word_matrix_H384(topic_dict, vocab_index):
    mat = []
    topic_ids = []

    for topic_id, words in topic_dict.items():
        vec = np.zeros(len(vocab_index))
        for w, weight in words:
            if w in vocab_index:
                vec[vocab_index[w]] = weight
        mat.append(vec)
        topic_ids.append(topic_id)

    mat = normalize(np.array(mat), norm="l1")
    return mat, topic_ids
topic_matrix_H384, topic_ids_H384 = topic_word_matrix_H384(
    topics_H384, vocab_index_H384
)

In [None]:
from itertools import combinations
from scipy.spatial.distance import cosine

def hellinger(p, q):
    return np.sqrt(0.5 * np.sum((np.sqrt(p) - np.sqrt(q)) ** 2))

def kl_divergence(p, q, eps=1e-12):
    p = np.clip(p, eps, None)
    q = np.clip(q, eps, None)
    return np.sum(p * np.log(p / q))

def js_divergence(p, q, eps=1e-12):
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m, eps) + 0.5 * kl_divergence(q, m, eps)
similarity_results_H384 = []

for i, j in combinations(range(topic_matrix_H384.shape[0]), 2):
    p = topic_matrix_H384[i]
    q = topic_matrix_H384[j]

    similarity_results_H384.append({
        "topic_1": topic_ids_H384[i],
        "topic_2": topic_ids_H384[j],
        "hellinger": hellinger(p, q),
        "kl": kl_divergence(p, q),
        "js": js_divergence(p, q),
        "cosine": 1 - cosine(p, q)
    })

similarity_df_H384 = pd.DataFrame(similarity_results_H384)
similarity_df_H384

Unnamed: 0,topic_1,topic_2,hellinger,kl,js,cosine
0,0,1,1.0,25.331068,0.693147,0.0
1,0,2,1.0,25.331068,0.693147,0.0
2,0,3,0.95054,22.776253,0.62631,0.092471
3,0,4,1.0,25.331068,0.693147,0.0
4,0,5,1.0,25.331068,0.693147,0.0
5,0,6,1.0,25.331068,0.693147,0.0
6,0,7,1.0,25.331068,0.693147,0.0
7,0,8,1.0,25.331068,0.693147,0.0
8,0,9,0.947465,22.514783,0.622347,0.104032
9,1,2,0.941213,22.696223,0.614226,0.12885


In [None]:
avg_hellinger_H384 = similarity_df_H384["hellinger"].mean()
avg_kl_H384 = similarity_df_H384["kl"].mean()
avg_js_H384 = similarity_df_H384["js"].mean()
avg_cosine_H384 = similarity_df_H384["cosine"].mean()

print(f"[H384] Average Hellinger distance: {avg_hellinger_H384:.4f}")
print(f"[H384] Average Kullback-Leibler divergence: {avg_kl_H384:.4f}")
print(f"[H384] Average Jensen-Shannon divergence: {avg_js_H384:.4f}")
print(f"[H384] Average cosine similarity: {avg_cosine_H384:.4f}")


[H384] Average Hellinger distance: 0.9904
[H384] Average Kullback-Leibler divergence: 24.8563
[H384] Average Jensen-Shannon divergence: 0.6803
[H384] Average cosine similarity: 0.0193


### **Topic Diversity**

In [None]:
def topic_diversity_H384(final_topic_model, top_n=10):
    """
    Topic diversity = unique words / total words
    Computed from final_topic_model (exclude outlier topic -1)
    """
    topics_H384 = []

    for topic_id in final_topic_model.get_topics().keys():
        if topic_id == -1:
            continue

        topic_words = final_topic_model.get_topic(topic_id)
        if topic_words is None:
            continue

        words = [w for w, _ in topic_words[:top_n]]
        if len(words) == 0:
            continue

        topics_H384.append(words)

    if not topics_H384:
        return np.nan

    all_words = sum(topics_H384, [])
    return len(set(all_words)) / len(all_words)
diversity_H384 = topic_diversity_H384(final_topic_model, top_n=10)
print(f"[H384] Topic diversity (top-10): {diversity_H384:.4f}")


[H384] Topic diversity (top-10): 0.9300


### **Interpretation of the New Results and Degradation After Tuning**


1. Topic Coherence - worse after tuning

- Political speech data is inherently repetitive but context-shifting, which penalizes coherence metrics that rely on local co-occurrence windows.
- Using KeyBERT-style representations favors semantic coverage, not tight co-occurrence, further depressing coherence scores.


2. Topic separation — essentially unchanged (still strong)


- Parameter tuning did not collapse clusters or increase overlap.
- Topics remain well-isolated in embedding space.


3. Topic diversity — better after tuning

- Fewer topics + larger clusters reduce keyword reuse.
- Key terms are distributed more cleanly across topics.
- Less fragmentation means fewer near-duplicate topics.



## **Conclusion**

Overall, this project shows a clear trade-off in topic modeling.
Before tuning, the model had more topics and slightly better coherence scores, but the topics were scattered and harder to understand. After tuning (H384), the number of topics became smaller and the themes were clearer and more stable. Even though coherence scores dropped, the topics stayed well separated and became less repetitive.

This means that coherence scores alone are not enough to judge topic quality, especially for political speeches where the same words are used in different contexts. The tuned model works better for understanding main themes and overall narratives, even if the numbers look worse. In short, the final model is more useful for analysis, while the earlier model mainly looks better on metrics.
