# YouTube 熱門影片主題建模（BERTopic）Pipeline

以中文分析為核心，使用 `shibing624/text2vec-base-chinese` 產生向量嵌入，
並輸出主題結構（Hierarchy / Bubble）與 Sankey（Topic ↔ Search keyword）。

In [46]:
%load_ext autoreload
%autoreload 2
from path_setup import setup_project_root
root = setup_project_root()

import os, sys, jieba, hdbscan
import pandas as pd
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Local modules
from etl_showcase.infrastructure.cleaning.data_preprocessor import (
    google_sheet_to_dataframe, 
    preprocess_dataframe,
)
from etl_showcase.infrastructure.cleaning.text_cleaner import (
    remove_all_punctuation
)
from etl_showcase.infrastructure.cleaning.text_tokenizer import (
    jieba_tokenizer
)
from etl_showcase.infrastructure.reporting.topic_visualizer import (
    visualize_topics_bubble, 
    visualize_hierarchical_clustering, 
    visualize_sankey, 
    save_plotly_html,
)
from etl_showcase.config.youtube import (
    YOUTUBE_SPREADSHEET_ID,
    YOUTUBE_SEARCH_VIDEOS_FUNCTION_NAME,
    VIDEO_COLUMN_ORDER,
)
from etl_showcase.infrastructure.datasource.google_sheets_api import (
    write_secret_json,
    delete_secret_json,
    get_full_google_sheet,
)

def data_source_text(keywords:str):
    return f"資料來源：蒐集 YouTube 上相關關鍵字的前 300 筆影片標題與描述，並使用 BERTopic 生成熱門議題。<br />本次搜尋關鍵字為：{keywords}。"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Using project root: C:\My data\0.change jobs\data_science_practice


In [40]:
youtube_videos = [[]]
write_secret_json()
try:
    youtube_videos = get_full_google_sheet(
        spreadsheet_id=YOUTUBE_SPREADSHEET_ID,
        sheet_name=YOUTUBE_SEARCH_VIDEOS_FUNCTION_NAME
    )
finally:
    delete_secret_json()
df_raw = google_sheet_to_dataframe(youtube_videos, VIDEO_COLUMN_ORDER)
print("Loaded rows:", len(df_raw))
df_raw.head()

Loaded rows: 3018


Unnamed: 0,Category,Search keyword,Video ID,Video title,Video description,Publish datetime,Channel name,Channel ID,Thumbnail URL
0,心理領域,心理學,HzcS6Vv4iFQ,🦡心理学上有个词叫：蜜獾效应｜你越是不怕失去，世界越拿你没办法,视频简介（中英文双语） 简介（中文）： 在这个看似和平，实则暗潮汹涌的世界里，你越是在意、越...,2025-08-09T12:00:32Z,超级语录-Quotes,UCOrNp79J26wFtULRo9T17ag,https://i.ytimg.com/vi/HzcS6Vv4iFQ/hqdefault.jpg
1,心理領域,心理學,fO31IpWhyy8,第一句話就讓人聽喜歡聽下去！高情商說話技巧！非暴力溝通的心理學【人際心理學】| 維思維,想學好英文，但面對開口說英文的障礙？試看看： #speaker 專屬下載連結➡️ https...,2025-08-23T12:01:14Z,維思維WeisWay,UCcU6CC2Gkc18aBUfEtdjaAA,https://i.ytimg.com/vi/fO31IpWhyy8/hqdefault.jpg
2,心理領域,心理學,boEjQfZD_DE,🐜 心理学上有个词叫：蚂蚁效应｜所有的积累都在别人看不见的时候发生,成为此频道的会员即可获享以下福利：https://www.youtube.com/chann...,2025-08-21T23:05:07Z,超级语录-Quotes,UCOrNp79J26wFtULRo9T17ag,https://i.ytimg.com/vi/boEjQfZD_DE/hqdefault.jpg
3,心理領域,心理學,GRWaSdjufdU,荣格心理学，中年后如何找回自己？,歡迎訂閱我們的頻道： https://www.youtube.com/@Reading_Wi...,2025-08-17T01:17:38Z,閱讀智慧,UChg0dXcajouLjSoJf5F8oOg,https://i.ytimg.com/vi/GRWaSdjufdU/hqdefault.jpg
4,心理領域,心理學,oWMXCziVF6E,這不是童話，是一趟心理自救！憂鬱和心理諮商的真相《蛤蟆先生去看心理師》EP1｜自繪動畫版｜心...,如果你也準備旅行，可以試試Saily，出國旅行真的不能沒網路！ 下載Saily 應用程式或前...,2025-08-17T04:17:01Z,維思維WeisWay,UCcU6CC2Gkc18aBUfEtdjaAA,https://i.ytimg.com/vi/oWMXCziVF6E/hqdefault.jpg


In [5]:
# ==== 前處理 ====
# Search keyword加入主題建模，內容會過於重複
df_raw['Empty keywords'] = ''
df = preprocess_dataframe(
    df_raw, 
    target_variant="zh-TW",
    merge_fields=["Video title", "Video description", "Empty keywords"]
)
df = df.dropna(subset=["text_translated"]).reset_index(drop=True)
# 此次分析，不想留逗號、句號等有語意的符號
df['text_translated'] = df['text_translated'].apply(remove_all_punctuation)
print(df.head()['text_translated'])

開始合併文字欄位，共 3018 筆資料
開始清理雜訊
開始批次翻譯
偵測所有文本所屬語言
分離目標語言文本和非目標語言文本


批量翻譯非目標語言文本 (共 1418 筆):   0%|          | 0/29 [00:00<?, ?it/s]

開始統一中文變體（繁體/簡體）
0    心理學上有個詞叫 蜜獾效應 你越是不怕失去 世界越拿你沒辦法 視頻簡介 中英文雙語 簡介 中...
1    第一句話就讓人聽喜歡聽下去 高情商說話技巧 非暴力溝通的心理學 人際心理學 維思維 想學好英...
2    心理學上有個詞叫 螞蟻效應 所有的積累都在別人看不見的時候發生 成爲此頻道的會員即可獲享以下...
3                            榮格心理學 中年後如何找回自己 歡迎訂閱我們的頻道
4    這不是童話 是一趟心理自救 憂鬱和心理諮商的真相 蛤蟆先生去看心理師 EP1 自繪動畫版 心...
Name: text_translated, dtype: object


In [52]:
embedding_model = SentenceTransformer("shibing624/text2vec-base-chinese")
vectorizer = CountVectorizer(tokenizer=jieba_tokenizer, lowercase=False, min_df=2)   
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=5,   # 一個群集最少需具備樣本數
)

# ==== 依 Topic 分組建模，並產生各自報表 ==== 
grouped_by_topic = df.groupby('Category')
for category_name, group_df in grouped_by_topic: 
    print(f"--- 處理領域: {category_name} ---")
    
    docs = group_df["text_translated"].tolist()
    if not docs:
        print(f"領域 '{category_name}' 沒有資料，跳過。")
        continue

    # ==== 所有圖表共通變數與前置處理 ====
    # 複製df，以避免 SettingWithCopyWarning 錯誤
    group_df = group_df[['Category', 'Search keyword', 'merged_text', 'cleaned_text', 'text_translated']].copy()
    search_keywords = '、'.join(group_df['Search keyword'].unique())

    # 依每圈資料訓練新模型
    topic_model = BERTopic(
        embedding_model=embedding_model,
        language="chinese",
        vectorizer_model=vectorizer,
        hdbscan_model=clusterer,
        verbose=True
    )
    topics, probs = topic_model.fit_transform(docs)
    group_df["topic_id"] = topics
    # 移除未在任何文本出現的領域
    group_df = group_df[group_df['topic_id'] > -1]

    topic_info = topic_model.get_topic_info()    
    topic_info = topic_info[topic_info['Topic'] > -1]

    docs_dir = os.path.join(os.getcwd(), '..', 'docs', category_name)
    os.makedirs(docs_dir, exist_ok=True)
    bubble_path = os.path.join(docs_dir, "topics_bubble.html")
    hier_path = os.path.join(docs_dir, "topics_hierarchy.html")
    sankey_path = os.path.join(docs_dir, "topic_keyword_sankey.html")

    print(len(topic_info))
    if len(topic_info) < 3:
        print('領域數量太少，無法繪圖')

        warning_html = """
        <!DOCTYPE html>
        <html>
        <head>
            <meta charset="UTF-8">
            <title>資料不足</title>
            <style>
                body { font-family: sans-serif; display: flex; justify-content: center; align-items: center; height: 100vh; }
                p { font-size: 1rem; color: gray; }
            </style>
        </head>
        <body>
            <p>該領域分析出來的議題數量過少，無法視覺化。</p>
        </body>
        </html>
        """
        for path in [bubble_path, hier_path, sankey_path]:
            with open(path, "w", encoding="utf-8") as f:
                f.write(warning_html)
        print("Saved:", bubble_path, hier_path, sankey_path)
        
        continue

    # ==== Bubble 圖 ====
    fig_bubble = visualize_topics_bubble(topic_model)
    fig_bubble.update_layout(title="")
    save_plotly_html(fig_bubble, bubble_path)
    
    # 準備html中 議題ID vs 議題內容 對照表資料
    topic_table_df = pd.DataFrame({
        'Topic': topic_info['Topic'],
        'Keywords': topic_info['Name'].apply(lambda x: ' | '.join(x.split('_')[:5]))
    })
    table_data = go.Table(
        header=dict(values=["議題代號", "議題內容"],
                    fill_color='paleturquoise',
                    align='left'),
        cells=dict(values=[topic_table_df.Topic, topic_table_df.Keywords],
                    fill_color='lavender',
                    align='left')
    )
    # 準備html中 Bubble Chart 資料
    bubble_data = fig_bubble.to_json()
    # 整合並寫入新html內容
    html_content = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Topic Visualization</title>
        <script src="https://cdn.plot.ly/plotly-latest.min.js"></script>
        <style>
            body {{ font-family: sans-serif; }}
            .container {{ display: flex; width: 100%; height: 80vh; }}
            #bubble-chart {{ flex: 2; border: 1px solid #ccc; box-sizing: border-box; }}
            #topic-table {{ flex: 1; overflow-y: auto; box-sizing: border-box; }}
            .hover-cell {{ cursor: pointer; font-weight: bold; text-decoration: underline; }}
            h3.title {{ margin-block-end: .1em; font-weight: 320; }}
            p.subtitle {{ margin-block-start: 0em; margin-block-end: .4em; font-size: .9rem; font-weight: 300; }}
            p.source {{ font-size: .8rem; font-weight: 300; color: gray; }}
        </style>
    </head>
    <body>
        <h3 class="title">熱門議題聲量與相似度分佈（Bubble Chart）</h3>
        <p class="subtitle">議題氣泡愈大，代表相關影片數量愈多；議題氣泡間愈近，代表相似性愈高。</p>
        <div class="container">
            <div id="bubble-chart"></div>
            <div id="topic-table"></div>
        </div>
        <div><p class="source">{data_source_text(search_keywords)}</p></div>
        <script>
            document.addEventListener('DOMContentLoaded', function() {{
                var bubbleData = {bubble_data};
                var tableData = {table_data.to_json()};
                
                // Fix for topic_id being null
                for (var i = 0; i < bubbleData.data.length; i++) {{
                    if (bubbleData.data[i].topic_id === undefined) {{
                        bubbleData.data[i].topic_id = bubbleData.data[i].name;
                    }}
                }}
    
                Plotly.newPlot('bubble-chart', bubbleData.data, bubbleData.layout);
    
                var tableDiv = document.getElementById('topic-table');
                Plotly.newPlot(tableDiv, [tableData], {{}});
    
                tableDiv.on('plotly_hover', function(data) {{
                    if (data.points.length > 0) {{
                        var hoveredTopicId = data.points[0].cells.values[0][data.points[0].pointIndex];
                        var bubblePlot = document.getElementById('bubble-chart');
                        var traceIndex = bubblePlot.data.findIndex(d => d.topic_id === hoveredTopicId);
                        
                        if (traceIndex !== -1) {{
                            Plotly.restyle(bubblePlot, {{
                                'marker.line.width': [3],
                                'marker.line.color': ['black']
                            }}, [traceIndex]);
                        }}
                    }}
                }});
    
                tableDiv.on('plotly_unhover', function(data) {{
                    if (data.points.length > 0) {{
                        var unhoveredTopicId = data.points[0].cells.values[0][data.points[0].pointIndex];
                        var bubblePlot = document.getElementById('bubble-chart');
                        var traceIndex = bubblePlot.data.findIndex(d => d.topic_id === unhoveredTopicId);
                        
                        if (traceIndex !== -1) {{
                            Plotly.restyle(bubblePlot, {{
                                'marker.line.width': [1],
                                'marker.line.color': ['rgba(0,0,0,0)']
                            }}, [traceIndex]);
                        }}
                    }}
                }});
            }});
        </script>
    </body>
    </html>
    """
    with open(bubble_path, "w", encoding="utf-8") as f:
        f.write(html_content)
        
    # ==== Hierarchy 圖 ====
    fig_h = visualize_hierarchical_clustering(topic_model)
    fig_h.update_layout(
        title="熱門議題層級關聯（Hierarchical Clustering）",
        annotations=[dict(
            text=data_source_text(search_keywords),
            x=0.5, y=-0.15 if len(topic_info)<30 else -0.1,
            xref="paper", yref="paper",
            showarrow=False,
            font=dict(size=12, color="gray")
        )]
    )
    save_plotly_html(fig_h, hier_path)
    
    # ==== Sankey 圖 ====
    # 建立圖中領域格式: "ID_關鍵字1_關鍵字2..."
    topic_label_map = {}
    for index, row in topic_info.iterrows():
        topic_id = row['Topic']
        keywords = '_'.join(row['Name'].split('_')[1:])
        topic_label_map[topic_id] = f"{topic_id}_{keywords}"
    group_df["topic_label"] = group_df["topic_id"].map(topic_label_map)    
    fig_sk = visualize_sankey(
        group_df, 
        topic_column="topic_label", 
        keyword_column="Search keyword",
        topic_prefix="",
    )
    fig_sk.update_layout(
        title=go.layout.Title(
            text='熱門議題與搜尋關鍵字之關聯（Sankey Diagram）<br /><sub>左側為熱門議題，右側為搜尋關鍵字</sub>'
        ),
        annotations=[dict(
            text=data_source_text(search_keywords),
            x=0.5, y=-0.1,
            xref="paper", yref="paper",
            showarrow=False,
            font=dict(size=12, color="gray")
        )],
    )
    save_plotly_html(fig_sk, sankey_path)
    
    # ==== 儲存模型與標註資料 ====
    topic_model.save(os.path.join(docs_dir, "bertopic_model"), serialization="safetensors")
    group_df.to_csv(os.path.join(docs_dir, "docs_with_topics.csv"), index=False, encoding="utf-8-sig")
    
    print("Saved:", bubble_path, hier_path, sankey_path)

2025-09-04 05:54:28,244 - BERTopic - Embedding - Transforming documents to embeddings.


--- 處理領域: 心理領域 ---


Batches:   0%|          | 0/29 [00:00<?, ?it/s]

2025-09-04 05:54:52,086 - BERTopic - Embedding - Completed ✓
2025-09-04 05:54:52,086 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-04 05:54:52,921 - BERTopic - Dimensionality - Completed ✓
2025-09-04 05:54:52,921 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-04 05:54:52,931 - BERTopic - Cluster - Completed ✓
2025-09-04 05:54:52,931 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-09-04 05:54:53,170 - BERTopic - Representation - Completed ✓
2025-09-04 05:54:53,565 - BERTopic - Embedding - Transforming documents to embeddings.


41
Saved: C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\心理領域\topics_bubble.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\心理領域\topics_hierarchy.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\心理領域\topic_keyword_sankey.html
--- 處理領域: 社會弱勢領域 ---


Batches:   0%|          | 0/29 [00:00<?, ?it/s]

2025-09-04 05:55:26,165 - BERTopic - Embedding - Completed ✓
2025-09-04 05:55:26,165 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-04 05:55:26,985 - BERTopic - Dimensionality - Completed ✓
2025-09-04 05:55:26,985 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-04 05:55:27,008 - BERTopic - Cluster - Completed ✓
2025-09-04 05:55:27,010 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-09-04 05:55:27,343 - BERTopic - Representation - Completed ✓
2025-09-04 05:55:27,820 - BERTopic - Embedding - Transforming documents to embeddings.


49
Saved: C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\社會弱勢領域\topics_bubble.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\社會弱勢領域\topics_hierarchy.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\社會弱勢領域\topic_keyword_sankey.html
--- 處理領域: 社會領域 ---


Batches:   0%|          | 0/19 [00:00<?, ?it/s]

2025-09-04 05:55:45,438 - BERTopic - Embedding - Completed ✓
2025-09-04 05:55:45,451 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-04 05:55:45,873 - BERTopic - Dimensionality - Completed ✓
2025-09-04 05:55:45,873 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-04 05:55:45,881 - BERTopic - Cluster - Completed ✓
2025-09-04 05:55:45,881 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-09-04 05:55:46,058 - BERTopic - Representation - Completed ✓
2025-09-04 05:55:46,360 - BERTopic - Embedding - Transforming documents to embeddings.


41
Saved: C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\社會領域\topics_bubble.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\社會領域\topics_hierarchy.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\社會領域\topic_keyword_sankey.html
--- 處理領域: 科技領域 ---


Batches:   0%|          | 0/19 [00:00<?, ?it/s]

2025-09-04 05:56:03,704 - BERTopic - Embedding - Completed ✓
2025-09-04 05:56:03,704 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-09-04 05:56:04,183 - BERTopic - Dimensionality - Completed ✓
2025-09-04 05:56:04,184 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-09-04 05:56:04,195 - BERTopic - Cluster - Completed ✓
2025-09-04 05:56:04,197 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-09-04 05:56:04,374 - BERTopic - Representation - Completed ✓


27
Saved: C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\科技領域\topics_bubble.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\科技領域\topics_hierarchy.html C:\My data\0.change jobs\data_science_practice\etl_showcase\application\..\docs\科技領域\topic_keyword_sankey.html
