### Create yearly csv file
The code file's function is to create csv output files that contain the original file name, candidate's party, candidate's total votes, contents of manifestos, and whether they won the election. The output file excludes individuals who are not considered serious candidates, meaning those who received less than 10,000 votes or were not nominated by major parties.

In [1]:
# Import packages
import pandas as pd
import numpy as np
import os
import csv
import codecs # read japanese files
import chardet # check the encoding
import pykakasi # transfer hiragana and katakana to English
years = [1986, 1990, 1993, 1996, 2000, 2003, 2005, 2009, 2012]

# Create function that transfer Hiragana and Katakana to English
def japanese_to_english(name):
    kks = pykakasi.kakasi()
    new_name = ""
    result = kks.convert(name)
    for item in result:
        new_name += item['hepburn']
    new_name = new_name.strip()
    return new_name

In [2]:
# Search the encoding of txt files
with open('/Users/deankuo/Desktop/python/dissertation_replicate/txt_version/1986/1986_Aichi_愛知県第１区_丹羽章夫.txt', 'rb') as f:
    result = chardet.detect(f.read())
    print(result)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}


In [3]:
# Specify the folder path
folder_path = '/Users/deankuo/Desktop/python/dissertation_replicate/txt_version/'

# Create a new csv file with the 'csv_file' name
for year in years:
    csv_file_path = f'/Users/deankuo/Desktop/python/dissertation_replicate/{year}.csv'
    with open(csv_file_path, mode='w', encoding="utf-8") as csv_file:
        fieldnames = ['file_name', 'content']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for file_name in os.listdir(f"{folder_path}{year}"):
            file_path = os.path.join(f"{folder_path}{year}", file_name)
            if os.path.isfile(file_path):
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                    writer.writerow({'file_name': file_name, 'content': content})

In [4]:
# Read csv file of manifesto contents created previously
dfs = {}

for year in years:
    dfs[f"df_{year}"] = pd.read_csv(f'/Users/deankuo/Desktop/python/dissertation_replicate/{year}.csv')


In [13]:
def divide_file_name(years):
    """
    Divides the file name into year, state, district, and candidate name for each year's dataframe.
    """
    for year in years:
        df_year = dfs[f"df_{year}"]
        for i in df_year.index:
            file_name = df_year.loc[i, 'file_name']
            year_name, state_name, district_name, name = file_name.split('_')[:4]
            df_year.loc[i, 'year'] = year_name
            df_year.loc[i, 'state'] = state_name
            df_year.loc[i, 'ku'] = district_name
            df_year.loc[i, 'name'] = name[:-4]

divide_file_name(years)
dfs['df_1986'].head(5)

Unnamed: 0,file_name,content,year,state,ku,name
0,1986_Okinawa_年沖縄県第１区_せなが亀次郎.txt,ごあいさつ。日本共産党とセナガ亀次郎に日ごろから御支援、御協力いただき、心から感謝申しあげま...,1986,Okinawa,年沖縄県第１区,せなが亀次郎
1,1986_Kagosima_鹿児島県第１区_川崎寛治.txt,まず減税、降灰対策。\n社・公・民が力を合わせ自民党を小さくしよう。中曽根首相は、党利、党略...,1986,Kagosima,鹿児島県第１区,川崎寛治
2,1986_Aichi_愛知第６区_片岡武司.txt,私は急逝された恩師水平豊彦先生の悲しみを越えて、その教えを守り、その政治信条を継承して必死に...,1986,Aichi,愛知第６区,片岡武司
3,1986_Kanagawa_神奈川県第４区_田中けいしゅう.txt,横浜市民の皆さん、この度の衆議院議員選挙に、再び立候補いたしました田中けいしゅうです。前回の...,1986,Kanagawa,神奈川県第４区,田中けいしゅう
4,1986_Iwate_岩手県第１区_斉藤信.txt,悪政と対決する真の革新の党・日本共産党を。\n党利党略・国民だます中曽根自民党。冒頭解散の臨...,1986,Iwate,岩手県第１区,斉藤信


In [18]:
# Define a function to romanize candidate name
def romanize_name(df, column='name'):
    return df.apply(lambda x: japanese_to_english(x[column]), axis=1)

# Romanize each candidate's name 
for year in years:
    df_year = dfs[f"df_{year}"]
    df_year['name_en'] = romanize_name(df_year)

dfs['df_1986'].head(5)

Unnamed: 0,file_name,content,year,state,ku,name,name_en
0,1986_Okinawa_年沖縄県第１区_せなが亀次郎.txt,ごあいさつ。日本共産党とセナガ亀次郎に日ごろから御支援、御協力いただき、心から感謝申しあげま...,1986,Okinawa,年沖縄県第１区,せなが亀次郎,senagakamejirou
1,1986_Kagosima_鹿児島県第１区_川崎寛治.txt,まず減税、降灰対策。\n社・公・民が力を合わせ自民党を小さくしよう。中曽根首相は、党利、党略...,1986,Kagosima,鹿児島県第１区,川崎寛治,kawasakikanji
2,1986_Aichi_愛知第６区_片岡武司.txt,私は急逝された恩師水平豊彦先生の悲しみを越えて、その教えを守り、その政治信条を継承して必死に...,1986,Aichi,愛知第６区,片岡武司,kataokatakeshi
3,1986_Kanagawa_神奈川県第４区_田中けいしゅう.txt,横浜市民の皆さん、この度の衆議院議員選挙に、再び立候補いたしました田中けいしゅうです。前回の...,1986,Kanagawa,神奈川県第４区,田中けいしゅう,tanakakeishuu
4,1986_Iwate_岩手県第１区_斉藤信.txt,悪政と対決する真の革新の党・日本共産党を。\n党利党略・国民だます中曽根自民党。冒頭解散の臨...,1986,Iwate,岩手県第１区,斉藤信,saitoushin


### Candidates' vote data  
The data is from Smith, Daniel M.; Reed, Steven R., 2018, "The Reed-Smith Japanese House of Representatives Elections Dataset", https://doi.org/10.7910/DVN/QFEPXD, Harvard Dataverse.

In [16]:
# Data of each candidates' votes
candidate_df = pd.read_csv('/Users/deankuo/Desktop/python/dissertation_replicate/candidate_votes/Reed-Smith-JHRED-CANDIDATES.csv')
candidate_df.head(5)


Unnamed: 0,pid,name_jp,legis,year,yr,ken,kunr,kuname,kucode,kucoder,...,exp,limit,totexp,cabappt,cabexp,junappt,junexp,pm,speaker,vicespeaker
0,10001,がくし勇三郎,27,1955,5.0,Tokyo,6,Tokyo 6,1306,1306.0,...,,,,0,0,0,0,0,0,0
1,10002,ガッツ石松,41,1996,19.0,Tokyo,9,Tokyo 9,1309,1309.0,...,9633649.0,25019200.0,42691256.0,0,0,0,0,0,0,0
2,10003,さとうふみや,45,2009,23.0,,0,Tokyo bloc,0,0.0,...,,,,0,0,0,0,0,0,0
3,10004,ツルネン・マルテイ,42,2000,20.0,Kanagawa,17,Kanagawa 17,1417,1417.0,...,6841924.0,25402800.0,27775354.0,0,0,0,0,0,0,0
4,10005,トクマ,47,2014,25.0,,0,Tokyo bloc,0,0.0,...,,,0.0,0,0,0,0,0,0,0


In [19]:
# Define a function to select the columns needed
def select_columns(df):
    return df[["name_jp", "year", "kuname", "result", "inc", "party_jp", "party_en", "party_id", "ku_vote", "ku_totvote", "ku_rank", "ku_ncand", "byelection", "totcruns", "female"]]

# Select the columns I need, including: the candidate name, year of election, district name, election result, party name, votes
select_column = select_columns(candidate_df)

# Pick up the year I need (from 1986 to 2009)
for year in years:
    candidate_df_year = select_column[select_column["year"] == year].copy()
    candidate_df_year['name_en'] = romanize_name(candidate_df_year, column='name_jp')
    dfs[f"candidate_df_{year}"] = candidate_df_year

dfs['candidate_df_1986'].head(5)

Unnamed: 0,name_jp,year,kuname,result,inc,party_jp,party_en,party_id,ku_vote,ku_totvote,ku_rank,ku_ncand,byelection,totcruns,female,name_en
64,三ツ林弥太郎,1986,Saitama 4,1,1,自民党,LDP,1.0,120716.0,617622.0,2.0,7.0,0,8,0,santsuhayashiyatarou
103,三原朝彦,1986,Fukuoka 2,1,0,自民党,LDP,1.0,83204.0,567263.0,4.0,6.0,0,1,0,miharaasahiko
143,三塚博,1986,Miyagi 1,1,1,自民党,LDP,1.0,159417.0,765325.0,1.0,8.0,0,6,0,mitsuzukahiroshi
186,三富要,1986,Fukushima 2,0,0,共産党,JCP,5.0,13130.0,432086.0,7.0,8.0,0,4,0,mitomiyou
229,三木武夫,1986,Tokushima 1,1,1,自民党,LDP,1.0,73834.0,426423.0,2.0,8.0,0,16,0,mikitakeo


### Merge data

In [21]:
def merge_dataframes(years):
    """
    Merges two dataframes based on romanization for each year in the list of years.
    Drops duplicate columns, renames columns and prints the shape of the resulting dataframe.
    """
    for year in years:
        df_year = dfs[f"df_{year}"]
        candidate_df_year = dfs[f"candidate_df_{year}"]
        merge_df = pd.merge(df_year, candidate_df_year, on='name_en')
        merge_df.drop(columns=['year_y', 'state', 'ku', 'name', 'name_en'], inplace=True)
        merge_df.rename(columns={'year_x':'year'}, inplace=True)
        dfs[f"merge_df_{year}"] = merge_df
        print(merge_df.shape)

merge_dataframes(years)

(706, 17)
(675, 17)
(705, 17)
(946, 17)
(925, 17)
(844, 17)
(805, 17)
(668, 17)
(998, 17)


In [22]:
# Define major party lists for each electoral system
major_parties = {
    'SNTV-MMD': [1, 2, 3, 4, 5, 12, 13, 14],
    'MMM': [1, 2, 3, 5, 12, 15, 16, 25, 26, 31, 34]
}

# Define the years for each electoral system
sntv_years = [1986, 1990, 1993]
mmm_years = [1996, 2000, 2003, 2005, 2009]

total_candidates = 0

# Define a function to sift serious candidates
def sift_candidates(df, major_party_list):
    return df[(df['party_id'].isin(major_party_list)) | (df['ku_vote'] >= 10000)]

# Sift serious candidate (either nominated by major parties or got more than 10000 votes)
for year in sntv_years:
    merge_df_year = dfs[f"merge_df_{year}"]
    merge_df_year = sift_candidates(merge_df_year, major_parties['SNTV-MMD'])
    dfs[f"merge_df_{year}"] = merge_df_year
    total_candidates += len(merge_df_year.index)
    print(str(year) + ":", end=" ")
    print(merge_df_year.shape)

for year in mmm_years:
    merge_df_year = dfs[f"merge_df_{year}"]
    merge_df_year = sift_candidates(merge_df_year, major_parties['MMM'])
    dfs[f"merge_df_{year}"] = merge_df_year
    total_candidates += len(merge_df_year.index)
    print(str(year) + ":", end=" ")
    print(merge_df_year.shape)

print("The total candidate number collected is:", total_candidates) # 6159


1986: (699, 17)
1990: (666, 17)
1993: (702, 17)
1996: (925, 17)
2000: (896, 17)
2003: (831, 17)
2005: (786, 17)
2009: (654, 17)
The total candidate number collected is: 6159


In [23]:
# Manifestos in by-elections
for year in years:
    merge_df_year = dfs[f"merge_df_{year}"]
    print(str(year) + ": " + str(merge_df_year['byelection'].sum()))

1986: 0
1990: 0
1993: 0
1996: 0
2000: 4
2003: 4
2005: 5
2009: 0
2012: 3


In [24]:
# Reorder the column of the dataframe
for year in years:
    merge_df_year = dfs[f"merge_df_{year}"]
    merge_df_year = merge_df_year[["year", "name_jp", "kuname", "result", "party_jp", "party_en", "party_id", "ku_vote", "ku_totvote", "ku_rank", "inc", "ku_ncand", "byelection", "totcruns", "female", "file_name", "content"]]
    dfs[f"merge_df_{year}"] = merge_df_year

dfs['merge_df_2009'].head(5) # type: ignore


Unnamed: 0,year,name_jp,kuname,result,party_jp,party_en,party_id,ku_vote,ku_totvote,ku_rank,inc,ku_ncand,byelection,totcruns,female,file_name,content
0,2009,斎藤愛子,Aichi 2,0,共産党,JCP,5.0,18908.0,243557.0,3.0,0,4.0,0,2,1,2009_Aichi_愛知県第２区_さいとう愛子.txt,市民運動三十年、暮らしに全力\n私は「中学校に学校給食を」や藤前干潟、海上の森を守る運動、子...
1,2009,山下京子,Osaka 11,0,共産党,JCP,5.0,30680.0,262032.0,3.0,0,4.0,0,4,1,2009_Osaka_大阪府第１１区_山下京子.txt,いのち大切にする政治につくりかえます\n自公政権のあとには「建設的野党」が必要です\nこんに...
2,2009,川条志嘉,Osaka 2,0,自民党,LDP,1.0,35417.0,229171.0,3.0,1,5.0,0,2,1,2009_Osaka_大阪府第２区_川条しか.txt,大阪に活力くらしに全力\n私は、親が政治家でもなければ、大金持ちでもない、普通の家庭から、「...
3,2009,青木愛,Tokyo 12,1,民主党,DPJ,16.0,118753.0,262720.0,1.0,0,4.0,0,3,1,2009_Tokyo_東京都第１２区_青木愛.txt,政権交代で、暮らしを守る。\n今度の総選挙で日本の行く末が決まると言っても過言ではありません...
4,2009,葉梨康弘,Ibaraki 3,0,自民党,LDP,1.0,103228.0,257237.0,2.0,1,3.0,0,3,0,2009_Ibaraki_茨城県第３区_はなし康弘.txt,「与党トップの働きマン」、「４９歳の若手改革派」葉梨康弘が「新しい政治」を創ります\n３区は...


### Save data files

In [14]:
# Merge all dataframe together and export to csv file file
dfs_list = [dfs[f"merge_df_{year}"] for year in years]
final_df = pd.concat(dfs_list, join='inner')
final_df.to_csv(r'/Users/deankuo/Desktop/python/dissertation_replicate/dean_final.csv')


In [15]:
# Merge candidate name and manifestos content to do TDM
final_txt_df = final_df[['name_jp', 'content', 'year', 'kuname']]
final_txt_df.to_csv(r'/Users/deankuo/Desktop/python/dissertation_replicate/dean_final_txt.csv')

In [16]:
# Extract manifesto content of each year
for year in years:
    temp = dfs[f"merge_df_{year}"][['name_jp', 'content', 'year', 'kuname']]
    temp.to_csv(f'/Users/deankuo/Desktop/python/dissertation_replicate/excel_version/{year}.csv')
