<a href="https://colab.research.google.com/github/ayakow1/ttic31220-japanparliament-analysis/blob/main/BERTopic_DTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic DTM analysis for parties

## Setup

In [3]:
%%capture
!pip install bertopic

In [4]:
%%capture
!pip install googletrans==3.1.0a0

In [1]:
from googletrans import Translator

In [27]:
import pandas as pd
from typing import List
import plotly.graph_objects as go
from sklearn.preprocessing import normalize


def visualize_topics_over_time(topic_model,
                               topics_over_time: pd.DataFrame,
                               top_n_topics: int = None,
                               topics: List[int] = None,
                               normalize_frequency: bool = False,
                               custom_labels: bool = False,
                               title: str = "<b>Topics over Time</b>",
                               topic_lst: List[str] = None, 
                               width: int = 1250,
                               height: int = 450) -> go.Figure:
    """ Visualize topics over time

    Arguments:
        topic_model: A fitted BERTopic instance.
        topics_over_time: The topics you would like to be visualized with the
                          corresponding topic representation
        top_n_topics: To visualize the most frequent topics instead of all
        topics: Select which topics you would like to be visualized
        normalize_frequency: Whether to normalize each topic's frequency individually
        custom_labels: Whether to use custom topic labels that were defined using 
                       `topic_model.set_topic_labels`.
        title: Title of the plot.
        width: The width of the figure.
        height: The height of the figure.

    Returns:
        A plotly.graph_objects.Figure including all traces

    Examples:

    To visualize the topics over time, simply run:

    ```python
    topics_over_time = topic_model.topics_over_time(docs, timestamps)
    topic_model.visualize_topics_over_time(topics_over_time)
    ```

    Or if you want to save the resulting figure:

    ```python
    fig = topic_model.visualize_topics_over_time(topics_over_time)
    fig.write_html("path/to/file.html")
    ```
    <iframe src="../../getting_started/visualization/trump.html"
    style="width:1000px; height: 680px; border: 0px;""></iframe>
    """
    colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99']


    # Select topics based on top_n and topics args
    freq_df = topic_model.get_topic_freq()
    freq_df = freq_df.loc[freq_df.Topic != -1, :]
    if topics is not None:
        selected_topics = list(topics)
    elif top_n_topics is not None:
        selected_topics = sorted(freq_df.Topic.to_list()[:top_n_topics])
    else:
        selected_topics = sorted(freq_df.Topic.to_list())

    # Prepare data
    if topic_model.custom_labels_ is not None and custom_labels:
        topic_names = {key: topic_model.custom_labels_[key + topic_model._outliers] for key, _ in topic_model.topic_labels_.items()}
    else:
        topic_names = {key: value[:40] + "..." if len(value) > 40 else value
                       for key, value in topic_model.topic_labels_.items()}
    topics_over_time["Name"] = topics_over_time.Topic.map(topic_names)
    data = topics_over_time.loc[topics_over_time.Topic.isin(selected_topics), :].sort_values(["Topic", "Timestamp"])

    translator = Translator()
    # Add traces
    fig = go.Figure()
    for index, topic in enumerate(data.Topic.unique()):
        trace_data = data.loc[data.Topic == topic, :]
        if not topic_lst:
            topic_name = trace_data.Name.values[0]
            topic_name = topic_name.split('_')[1:]
            topic_name = [translator.translate(k).text for k in topic_name][0]
        else:
            topic_name = topic_lst[index]
        words = trace_data.Words.values
        if normalize_frequency:
            y = normalize(trace_data.Frequency.values.reshape(1, -1))[0]
        else:
            y = trace_data.Frequency
        fig.add_trace(go.Scatter(x=trace_data.Timestamp, y=y,
                                 mode='lines',
                                 marker_color=colors[index % 5],
                                 hoverinfo="text",
                                 name=topic_name,
                                 hovertext=[f'<b>Topic {topic}</b><br>Words: {word}' for word in words]))

    # Styling of the visualization
    fig.update_xaxes(showgrid=True)
    fig.update_yaxes(showgrid=True)
    fig.update_layout(
        yaxis_title="Normalized Frequency" if normalize_frequency else "Frequency",
        title={
            'text': f"{title}",
            'y': .95,
            'x': 0.40,
            'xanchor': 'center',
            'yanchor': 'top',
            'font': dict(
                size=22,
                color="Black")
        },
        template="simple_white",
        width=width,
        height=height,
        hoverlabel=dict(
            bgcolor="white",
            font_size=16,
            font_family="Rockwell"
        ),
        legend=dict(
            title="<b>Global Topic Representation",
        )
    )
    return fig

## Restart the Notebook
After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook.

From the Menu:

Runtime → Restart Runtime

## Data

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [4]:
# import gensim
# import gensim.corpora as corpora
# from gensim.models.coherencemodel import CoherenceModel
import sqlite3
import pandas as pd
from bertopic import BERTopic
import numpy as np


In [5]:
# Import Data
use_raw = True
if use_raw:
    name = "raw_speech"
else:
    name = "speech"

## Liberal Democratic Party 自由民主党

### Topic modeling

In [6]:
conn = sqlite3.connect(f'/content/drive/MyDrive/議事録/{name}.db')
all = pd.read_sql_query(f'''SELECT * FROM {name} WHERE house ='衆議院' AND party like '%自由民主党%' ''', conn)
docs = all['speech'].to_list()
conn.close()


In [None]:
# all = all[all['speech_date']>='2012-01-01']
all.head()

Unnamed: 0,id,house,committee,vol,speech_date,speaker,party,speech,morpheme
0,115405254X00320020129_003,衆議院,本会議,第3号,2002-01-29,津島雄二,自由民主党,この 補正予算 二 案 は 、 去る 一月二十一日 本 委員会 に 付託 さ れ 、 一月二...,連体詞* 名詞固有名詞 名詞数 名詞接尾 助詞係助詞 記号読点 連体詞* 名詞固有名詞 接頭...
1,115405254X00320020129_010,衆議院,本会議,第3号,2002-01-29,坂本剛二,自由民主党,本案 は 、 先般 、 政府 により 策定 さ れ た 緊急 対応 プログラム において 推...,名詞一般 助詞係助詞 記号読点 名詞副詞可能 記号読点 名詞一般 助詞格助詞 名詞サ変接続 ...
2,115405261X00520020128_201,衆議院,予算委員会,第5号,2002-01-28,北村直人,自由民主党,両 案 に対する 質疑 は 終局 し 、 直ちに 採決 さ れん こと を 望み ます 。,接頭詞名詞接続 名詞一般 助詞格助詞 名詞サ変接続 助詞係助詞 名詞一般 動詞自立 記号読点...
3,115404376X00220020128_301,衆議院,財務金融委員会,第2号,2002-01-28,山本幸三,自由民主党,ただいま 議題 と なっ て おり ます 日本電信電話株式会社 の 株式 の 売 払 収入 ...,感動詞* 名詞一般 助詞格助詞 動詞自立 助詞接続助詞 動詞非自立 助動詞* 名詞固有名詞 ...
4,115404376X00120020125_147,衆議院,財務金融委員会,第1号,2002-01-25,増原義剛,自由民主党,先 ほど 財務大臣 より お話 ござい まし た 、 締めて 九 十 四 本 で ござい ま...,名詞一般 助詞副助詞 名詞固有名詞 助詞格助詞 名詞サ変接続 助動詞* 助動詞* 助動詞* ...


In [None]:
docs = all['speech'].to_list()
len(docs)

92755

In [None]:
topic_model = BERTopic(language="multilingual",
                       verbose=True,
                       embedding_model="paraphrase-multilingual-mpnet-base-v2") #https://tech.yellowback.net/posts/sentence-transformers-japanese-models

topics, probs = topic_model.fit_transform(docs) # Input is list type
topic_model.save(f"/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_jimin")
# https://github.com/UKPLab/sentence-transformers/issues/1915 multilingual はエラー

Batches:   0%|          | 0/2899 [00:00<?, ?it/s]

2023-05-18 19:17:58,838 - BERTopic - Transformed documents to Embeddings
2023-05-18 19:20:08,705 - BERTopic - Reduced dimensionality
2023-05-18 19:20:17,064 - BERTopic - Clustered reduced embeddings
  self._set_arrayXarray(i, j, x)


In [None]:
freq = topic_model.get_topic_info(); freq.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,49173,-1_です_という_ない_こと
1,0,2074,0_銀行_金融機関_金融_金利
2,1,1344,1_教育_学校_教員_教育基本法
3,2,1058,2_年金_社会保障_保険料_給付
4,3,1054,3_感謝_皆様_質問_機会
5,4,983,4_北朝鮮_韓国_拉致問題_拉致
6,5,849,5____
7,6,837,6_道路_高速道路_国道_交通
8,7,613,7_雇用_労働者_労働_派遣
9,8,485,8_合併_市町村_地方_道州制


-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('銀行', 0.006093827701037269),
 ('金融機関', 0.004965408647422861),
 ('金融', 0.004804234079759255),
 ('金利', 0.004027780144501547),
 ('国債', 0.003811807440085479),
 ('デフレ', 0.003757975705095921),
 ('金融庁', 0.0032315158430564163),
 ('日銀', 0.003178906341293303),
 ('株式', 0.0030534388106305843),
 ('融資', 0.0029848166101351468)]

### Time series

In [7]:
topic_model = BERTopic.load("/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_jimin")

In [8]:
timestamps = all["speech_date"].to_list()
len(docs)

92755

In [None]:
# timestamps_ym = []
# for d in timestamps:
#     timestamps_ym.append("-".join(d.split("-")[:-1]))


In [9]:
timestamps_y = []
for d in timestamps:
    timestamps_y.append(d[:4])


In [10]:
topics_over_time_y = topic_model.topics_over_time(docs, timestamps_y)

22it [01:14,  3.37s/it]


In [28]:
visualize_topics_over_time(topic_model,
                           topics_over_time_y, 
                           topics=[i for i in range(7) if i!=5 and i!=3 and i!=10],
                           topic_lst = ["Bank","Education","Pension","North Korea","Road"],
                           title="Liberal Democratic Party",
                           width=800)

## Check which parties are relevent 

In [29]:
conn = sqlite3.connect(f'/content/drive/MyDrive/議事録/{name}.db')
all2 = pd.read_sql_query(f'''SELECT * FROM {name} WHERE house ='衆議院' AND party not like '%自由民主党%' ''', conn)
conn.close()

In [30]:
all2["year"] = all2["speech_date"].str.slice(start=0,stop=4)

In [31]:
all2[["party","year","id"]].groupby(["party","year"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,id
party,year,Unnamed: 2_level_1
おおさか維新の会,2015,7
おおさか維新の会,2016,1506
たちあがれ日本,2011,3
たちあがれ日本,2012,2
たちあがれ日本・新党改革,2011,3
...,...,...
自由党,2002,3654
自由党,2003,2777
自由党,2016,27
自由党,2017,245


In [None]:
all2[["party","id"]].groupby("party").count().sort_values(by="id",ascending=False)

Unnamed: 0_level_0,id
party,Unnamed: 1_level_1
民主党・無所属クラブ,134720
日本共産党,63669
公明党,42286
日本維新の会,23664
立憲民主党・無所属,21490
...,...
日本を元気にする会・無所属会,2
民進党・新緑風会,2
新党改革・無所属の会,1
立憲民主党・民友会,1


## New Komeito 公明党

In [32]:
conn = sqlite3.connect(f'/content/drive/MyDrive/議事録/{name}.db')
all3 = pd.read_sql_query(f'''SELECT * FROM {name} WHERE house ='衆議院' AND party = '公明党' ''', conn)
conn.close()

In [33]:
docs = all3['speech'].to_list()
len(docs)

42286

In [None]:
topic_model3 = BERTopic(language="multilingual",
                       verbose=True,
                       embedding_model="paraphrase-multilingual-mpnet-base-v2") #https://tech.yellowback.net/posts/sentence-transformers-japanese-models

topics3, probs3 = topic_model3.fit_transform(docs) # Input is list type
topic_model3.save(f"/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_koumei")
# https://github.com/UKPLab/sentence-transformers/issues/1915 multilingual はエラー

Downloading (…)9e268/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)f2cd19e268/README.md:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading (…)cd19e268/config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)9e268/tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading (…)d19e268/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/1322 [00:00<?, ?it/s]

2023-05-19 14:57:59,683 - BERTopic - Transformed documents to Embeddings
2023-05-19 14:58:55,914 - BERTopic - Reduced dimensionality
2023-05-19 14:59:00,259 - BERTopic - Clustered reduced embeddings

Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



In [None]:
freq = topic_model3.get_topic_info(); freq.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,20225,-1_という_ます_です_こと
1,0,876,0_金融機関_銀行_金融_金利
2,1,734,1_災害_地震_被害_避難
3,2,730,2_学校_教育_子供たち_教員
4,3,681,3_病院_患者_医療_医師
5,4,601,4_年金_保険料_制度_社会保障
6,5,599,5_ありがとう_ござい_まし_拍手
7,6,591,6_市町村_地方_地域_自治体
8,7,561,7_参考人_先生_貴重_意見
9,8,458,8_原子力_原発_規制_事故


In [34]:
topic_model3 = BERTopic.load("/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_koumei")

In [35]:
timestamps = all3["speech_date"].to_list()
len(docs)

42286

In [36]:
timestamps_y = []
for d in timestamps:
    timestamps_y.append(d[:4])

In [37]:
topics_over_time_y = topic_model3.topics_over_time(docs, timestamps_y)

22it [00:29,  1.34s/it]


In [42]:
visualize_topics_over_time(topic_model3,
                           topics_over_time_y, 
                           topics=[i for i in range(5)],
                           title="New Komeito",
                           topic_lst = ["Bank", "Disaster", "School", "Hospital", "Pension"],
                           width=800)

## Japanese Communist Party 日本共産党

In [43]:
import gc

del all
del all2
del all3

gc.collect()

358

In [44]:
conn = sqlite3.connect(f'/content/drive/MyDrive/議事録/{name}.db')
all4 = pd.read_sql_query(f'''SELECT * FROM {name} WHERE house ='衆議院' AND party = '日本共産党' ''', conn)
conn.close()

In [45]:
docs = all4['speech'].to_list()
len(docs)

63669

In [None]:
topic_model4 = BERTopic(language="multilingual",
                       verbose=True,
                       embedding_model="paraphrase-multilingual-mpnet-base-v2") #https://tech.yellowback.net/posts/sentence-transformers-japanese-models

topics4, probs4 = topic_model4.fit_transform(docs) # Input is list type
topic_model4.save(f"/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_kyousan")
# https://github.com/UKPLab/sentence-transformers/issues/1915 multilingual はエラー

Batches:   0%|          | 0/1990 [00:00<?, ?it/s]

2023-05-19 15:11:11,996 - BERTopic - Transformed documents to Embeddings
2023-05-19 15:12:05,018 - BERTopic - Reduced dimensionality
2023-05-19 15:12:10,586 - BERTopic - Clustered reduced embeddings

Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



In [None]:
freq = topic_model4.get_topic_info(); freq.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,33989,-1_です_という_ない_こと
1,0,2165,0_労働者_雇用_労働_賃金
2,1,1799,1_言っ_大臣_です_ない
3,2,917,2_消費税_増税_減税_転嫁
4,3,913,3_学校_教育_学級_教員
5,4,748,4_保育_子供_保育所_保育士
6,5,726,5_飛行_訓練_米軍機_米軍
7,6,641,6_事件_裁判_警察_犯罪
8,7,600,7_銀行_金融機関_金利_融資
9,8,480,8_住宅_家賃_公営住宅_入居


In [46]:
topic_model4 = BERTopic.load("/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_kyousan")

In [47]:
timestamps = all4["speech_date"].to_list()
len(docs)

63669

In [48]:
timestamps_y = []
for d in timestamps:
    timestamps_y.append(d[:4])

In [49]:
topics_over_time_y = topic_model4.topics_over_time(docs, timestamps_y)

22it [00:33,  1.54s/it]


In [52]:
visualize_topics_over_time(topic_model4, topics_over_time_y, 
                           topics_over_time_y, 
                           topics=[i for i in range(6) if i!=1 and i!=10 and i!=11],
                           topic_lst=["Labor", "Sales Tax", "School", "Daycare", "U.S. Army"],
                           title="Japanese Communist Party",
                           width=800)

## Democratic Party of Japan + Constitutional Democratic Party of Japan 民主党+立憲民主党

In [53]:
conn = sqlite3.connect(f'/content/drive/MyDrive/議事録/{name}.db')
all5 = pd.read_sql_query(f'''SELECT * FROM {name} WHERE house ='衆議院' AND (party = '民主党・無所属クラブ' OR party = '立憲民主党・無所属')''', conn)
conn.close()

In [54]:
docs = all5['speech'].to_list()
len(docs)

156210

In [None]:
topic_model5 = BERTopic(language="multilingual",
                       verbose=True,
                       embedding_model="paraphrase-multilingual-mpnet-base-v2") #https://tech.yellowback.net/posts/sentence-transformers-japanese-models

topics5, probs5 = topic_model5.fit_transform(docs) # Input is list type
topic_model5.save(f"/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_minshu")
# https://github.com/UKPLab/sentence-transformers/issues/1915 multilingual はエラー

In [55]:
topic_model5 = BERTopic.load("/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_minshu")

In [59]:
freq = topic_model5.get_topic_info(); freq.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,90112,-1_です_という_ない_こと
1,0,4210,0_警察_裁判_裁判官_捜査
2,1,3948,1_農業_農家_農地_農協
3,2,2067,2_消費税_税制_課税_交付税
4,3,1822,3_年金_国民年金_厚生年金_保険料
5,4,1650,4_雇用_労働_労働者_正社員
6,5,1532,5_北朝鮮_韓国_拉致_拉致問題
7,6,1416,6_学校_教育_教育基本法_教員
8,7,1351,7_ありがとう_ござい_まし_拍手
9,8,1223,8_数字_データ_ぐらい_人口


In [56]:
timestamps = all5["speech_date"].to_list()
len(docs)

156210

In [57]:
timestamps_y = []
for d in timestamps:
    timestamps_y.append(d[:4])

In [58]:
topics_over_time_y = topic_model5.topics_over_time(docs, timestamps_y)

17it [01:24,  4.95s/it]


In [62]:
visualize_topics_over_time(topic_model5, 
                           topics_over_time_y, 
                            topics=[i for i in range(5)],
                            title="Democratic Party of Japan",
                           topic_lst=["Police", "Agriculture", "Sales Tax", "Pension", "Labor"], 
                            width=800)

## Train additional model for next analysis

In [None]:
conn = sqlite3.connect(f'/content/drive/MyDrive/議事録/{name}.db')
all6 = pd.read_sql_query(f'''SELECT * FROM {name} WHERE house ='衆議院' AND speech_date >= '2020-01-01' AND speech_date <= '2023-04-31' ''', conn)
conn.close()

In [None]:
len(all5)

71357

In [None]:
docs = all6['speech'].to_list()

In [None]:
len(docs)

71357

In [None]:
topic_model5 = BERTopic(language="multilingual",
                       verbose=True,
                       embedding_model="paraphrase-multilingual-mpnet-base-v2") #https://tech.yellowback.net/posts/sentence-transformers-japanese-models

topics5, probs5 = topic_model5.fit_transform(docs) # Input is list type
topic_model5.save(f"/content/drive/MyDrive/議事録/BERTopic/model_shuugiin_3y")

Downloading (…)9e268/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)f2cd19e268/README.md:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading (…)cd19e268/config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)9e268/tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading (…)d19e268/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/2230 [00:00<?, ?it/s]

2023-05-19 00:15:13,000 - BERTopic - Transformed documents to Embeddings
2023-05-19 00:17:28,938 - BERTopic - Reduced dimensionality
2023-05-19 00:17:37,998 - BERTopic - Clustered reduced embeddings
  self._set_arrayXarray(i, j, x)


In [None]:
freq = topic_model5.get_topic_info(); freq.head(30)

Unnamed: 0,Topic,Count,Name
0,-1,34792,-1_です_という_こと_ない
1,0,3135,0_ください_でしょ_よろしい_大臣
2,1,1425,1_雇用_賃上げ_賃金_最低賃金
3,2,1386,2_接種_ワクチン_副反応_高齢者
4,3,1006,3_消費税_課税_税制_法人税
5,4,859,4_病院_医療_医師_患者
6,5,725,5_ありがとう_ござい_まし_拍手
7,6,669,6_予算_補正予算_予備費_経費
8,7,641,7_学校_教員_学級_研修
9,8,629,8_原発_原子力_規制_稼働
