### Script purpose: Ming office title coding

1. General principles:
    - A comprehensive ontological structure of office title includes four parts: `Classification + Administrative Unit (optional) + Function (optional) + Title`
    - Each part corresponds to a table.
    - Separate `coding_value` and `raw_value`.
        - `raw_value`: the string appeared in original book text.
        - `coding_value`: the revised string that can be successfully coded.
    - Replace starting from long string to short.
    - Priority: T first, P last.

2. Notes:
    - `Office title by LENGTH` table merges CBDB Ming office title with UCI table. Duplicates in CBDB table are removed in this table, i.e., this is the clean table we are going to use.

### TODO:
- [×] English words;
- [×] Forward slash.

In [1]:
% matplotlib inline
import sqlite3
import pandas as pd
import networkx as nx
import xlrd
import matplotlib.pyplot as plt
import math
import warnings
from tqdm import tqdm
import re
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

### `c_office_chn` from UCI.

In [2]:
df_uci_office_ming=pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSCmhbCk1B-9jjINMhy_VwikM6_Sn7bjdO7b_vaZJkVcYCCYlWVlhYVCFtAs0fPX-UEO62GWxaX1qAS/pub?gid=630627340&single=true&output=tsv',
                                    sep='\t')
df_uci_office_ming=df_uci_office_ming[['c_office_id（Dictionary Ser#)','Institution 1', 'Institution 2', 'Institution 3', 'c_office_chn']].rename(columns={'c_office_id（Dictionary Ser#)':'c_office_id'})
df_uci_office_ming['c_office_chn']=[s.replace('/', '') for s in df_uci_office_ming['c_office_chn']]
df_uci_office_ming.sample(3)

Unnamed: 0,c_office_id,Institution 1,Institution 2,Institution 3,c_office_chn
1280,70581,京衛京營與中央軍事官署類 Central and Capital Militaries,京營門 Military Training Units in the Capital,京營 Capital Training Divisions (1550-),提督總兵官
3249,2522,牧鹽舶政類 Horse/Salt Business and Maritime Trade,鹽課鹽運門 Salt Business,煎鹽提舉司 Salt Boiling Supervisorates,典史
893,1721,中央輔佐官署類 Central Administration Assistance,寺監門 Courts and Directorates,太醫院 The Imperial Academy of Medicine,御藥局藥童


In [3]:
df_uci_office_ming.sample(10)

Unnamed: 0,c_office_id,Institution 1,Institution 2,Institution 3,c_office_chn
3292,186,皇族宮廷類 Imperial Family and Royal Court,宗室門 The Imperial Clan,公主府 Princess Establishment,家令
2182,71273,地方官署類 Regional and Local Governance,府官門 Prefectural Governance,府官 Prefectural Officers,水馬驛驛丞
1490,71264,京衛京營與中央軍事官署類 Central and Capital Militaries,大都督府門 The Chief Military Commissions,左軍都督府 The Chief Military Commission of the Left,行在左軍都督同知
622,1765,中央輔佐官署類 Central Administration Assistance,僧道官門 Buddhist and Daoist Registries,道錄司 The Central Taoist Registry,右至靈
2373,3017,地方軍事與治安機構類 Regional and Local Military Units,(行)都指揮使司門 (Auxiliary) Regional and Local Milit...,衛指揮使司 Guard Military Commands,經歷司吏目
2061,2592,地方官署類 Regional and Local Governance,京府門 Superior Prefectural Governance,應天府 Yingtian Superior Prefecture,京縣縣丞
932,1618,中央輔佐官署類 Central Administration Assistance,寺監門 Courts and Directorates,殿庭儀禮司 The Palace Ceremonial Office (Hongwu 9 (...,右司副
2571,2960,地方軍事與治安機構類 Regional and Local Military Units,地區軍官門 Regional Military Officials,江西軍官 Military Officials in the Jiangxi Region,把總
2291,2676,地方官署類 Regional and Local Governance,省官門 Provincial Governance,提刑按察使司 Provincial Surveillance Commission,試僉事
3798,498,皇族宮廷類 Imperial Family and Royal Court,宦官門 Eunuch Offices,外差宦官 Eunuchs on Secondment,內承運庫監工


In [4]:
df_uci_office_ming['inst_1_chn']=[str(s).split()[0].replace('nan', '') for s in df_uci_office_ming['Institution 1']]
df_uci_office_ming['inst_2_chn']=[str(s).split()[0].replace('nan', '') for s in df_uci_office_ming['Institution 2']]
df_uci_office_ming['inst_3_chn']=[str(s).split()[0].replace('nan', '') for s in df_uci_office_ming['Institution 3']]
df_uci_office_ming['uci_value']=df_uci_office_ming['inst_1_chn']+df_uci_office_ming['inst_2_chn']+'_'+df_uci_office_ming['inst_3_chn']+'_'+df_uci_office_ming['c_office_chn']
df_uci_office_ming['c_office_id']=pd.to_numeric(df_uci_office_ming['c_office_id'], errors='coerce')
df_uci_office_ming.drop(['inst_1_chn', 'inst_2_chn', 'inst_3_chn', 'Institution 1', 'Institution 2', 'Institution 3', 'c_office_chn'], axis=1, inplace=True)

In [5]:
df_uci_office_ming[df_uci_office_ming['c_office_id'].duplicated()]

Unnamed: 0,c_office_id,uci_value
1130,71508.0,中央輔佐官署類秘書門_翰林院_直文淵閣侍講學士
1195,71503.0,中央輔佐官署類考官門_會試官_知貢舉官
1219,72165.0,中央輔佐官署類考官門_鄉試官_順天同考官
1282,,京衛京營與中央軍事官署類京營門_京營_京營總兵官
2314,71504.0,地方官署類省官門_行中書省_理問所知事
2718,71274.0,地方軍事與治安機構類招討經略安撫使門_宣撫司_宣撫司經歷
2821,,文武散階勛爵類勛爵門_伯_平涼伯
2842,,文武散階勛爵類勛爵門_伯_新城伯
2862,,文武散階勛爵類勛爵門_伯_永定伯
2882,,文武散階勛爵類勛爵門_伯_鎮遠伯


In [6]:
df_uci_office_ming['uci_value']=[s.replace('/', '') for s in df_uci_office_ming['uci_value']]
df_uci_office_ming['uci_value']=[s.replace('／', '') for s in df_uci_office_ming['uci_value']]
df_uci_office_ming['uci_value']=[s.replace('、', '') for s in df_uci_office_ming['uci_value']]
df_uci_office_ming['uci_value']=[re.sub(r'[a-zA-Z]', string=s, repl='') for s in df_uci_office_ming['uci_value']]

In [7]:
df_uci_office_ming.drop(df_uci_office_ming[df_uci_office_ming['c_office_id'].duplicated()].index, inplace=True)
df_uci_office_ming.set_index('c_office_id', inplace=True)
df_uci_office_ming.sample(3)

Unnamed: 0_level_0,uci_value
c_office_id,Unnamed: 1_level_1
256.0,皇族宮廷類宦官門_司設監_右少監
1802.0,司法監察機構類監察門_都察院_北平道監察御史
70463.0,中央中樞官署類六部門_戶部_河南清吏司員外郎


### `c_office_chn` from CBDB uncleaned, and merge with UCI.

In [8]:
conn = sqlite3.connect('../../SQL/sqlite_20180302.db')
df_cbdb_office_ming=pd.read_sql_query("SELECT * FROM OFFICE_CODES", conn)[pd.read_sql_query("SELECT * FROM OFFICE_CODES", conn).c_dy==19].set_index('c_office_id')
df_cbdb_office_ming.sample(3)

Unnamed: 0_level_0,tts_sysno,c_dy,c_office_pinyin,c_office_chn,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,c_notes,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
71252,15989.0,19,xing zai xing bu lang zhong,行在刑部郎中,,,Auxiliary Director of a Bureau in the Ministry of,,,,,Auxiliary Director of a Bureau in the Ministry of,,,,0.0
71706,16443.0,19,an yang hou,安陽侯,,,[Not Yet Translated],爵,,,,[Not Yet Translated],,,,0.0
70463,15200.0,19,hu bu he nan si yuan wai lang,戶部河南司員外郎,,,Vice Director of the Henan Bureau of the Ministry,,,,,Vice Director of the Henan Bureau of the Ministry,,,,0.0


In [9]:
for index in tqdm(df_uci_office_ming.index):
    if index in df_cbdb_office_ming.index:
        df_uci_office_ming.loc[index, 'cbdb_value']=df_cbdb_office_ming.loc[index, 'c_office_chn']
        df_uci_office_ming.loc[index, 'tts_sysno']=df_cbdb_office_ming.loc[index, 'tts_sysno']
        df_uci_office_ming.loc[index, 'c_office_pinyin']=df_cbdb_office_ming.loc[index, 'c_office_pinyin']
        df_uci_office_ming.loc[index, 'c_office_pinyin_alt']=df_cbdb_office_ming.loc[index, 'c_office_pinyin_alt']
        df_uci_office_ming.loc[index, 'c_office_chn_alt']=df_cbdb_office_ming.loc[index, 'c_office_chn_alt']
        df_uci_office_ming.loc[index, 'c_office_trans']=df_cbdb_office_ming.loc[index, 'c_office_trans']
        df_uci_office_ming.loc[index, 'c_office_trans_alt']=df_cbdb_office_ming.loc[index, 'c_office_trans_alt']
        df_uci_office_ming.loc[index, 'c_source']=df_cbdb_office_ming.loc[index, 'c_source']
        df_uci_office_ming.loc[index, 'c_pages']=df_cbdb_office_ming.loc[index, 'c_pages']
        df_uci_office_ming.loc[index, 'c_notes']=df_cbdb_office_ming.loc[index, 'c_notes']
        df_uci_office_ming.loc[index, 'c_category_1']=df_cbdb_office_ming.loc[index, 'c_category_1']
        df_uci_office_ming.loc[index, 'c_category_2']=df_cbdb_office_ming.loc[index, 'c_category_2']
        df_uci_office_ming.loc[index, 'c_category_3']=df_cbdb_office_ming.loc[index, 'c_category_3']
        df_uci_office_ming.loc[index, 'c_category_4']=df_cbdb_office_ming.loc[index, 'c_category_4']
        df_uci_office_ming.loc[index, 'c_office_id_old']=df_cbdb_office_ming.loc[index, 'c_office_id_old']
df_uci_office_ming.loc[index, 'c_dy']=19

100%|██████████| 4304/4304 [01:02<00:00, 69.24it/s]


In [10]:
df_office_ming_merged=df_uci_office_ming
df_office_ming_merged.sample(3)

Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,c_notes,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
72144.0,中央輔佐官署類僧道官門_僧錄司_左覺義,左覺義,16880.0,zuo jue yi,,,Buddhist Rectifier of the Left,,,,,Buddhist Rectifier of the Left,,,,0.0,
2259.0,南京官署類南京六部門_南京兵部_會同館大使,,,,,,,,,,,,,,,,
71815.0,文武散階勛爵類勛爵門_伯_海寧伯,海寧伯,16552.0,hai ning bo,,,[Not Yet Translated],爵,,,,[Not Yet Translated],,,,0.0,


### Coding `c_office_chn`.

### TODO: - Done.
    - [×] Subtract titles from right.
    - [×] Add appointment type.
    - [×] Use online revised CLS table.

In [11]:
df_adm=pd.read_csv('../data_output/C_OT_ADM.tsv', sep='\t').set_index('c_ot_adm_id')
df_cls=pd.read_csv('../data_output/C_OT_CLS.tsv', sep='\t').set_index('c_ot_cls_id')
df_tit=pd.read_csv('../data_output/C_OT_TIT.tsv', sep='\t').set_index('c_ot_tit_id')
df_func=pd.read_csv('../data_output/C_OT_FUNC.tsv', sep='\t').set_index('c_ot_func_id')
df_app_ty=pd.read_csv('../data_output/APPOINTMENT_TYPE_CODES.tsv', sep='\t').set_index('c_appt_type_code')
df_txt_code=pd.read_csv('../data_output/TEXT_CODES.tsv', sep='\t').set_index('c_textid')

In [12]:
df_tit.sample(3)

Unnamed: 0_level_0,c_ot_tit_chinm,value_to_run,c_ot_tit_desc,c_ot_tit_start,c_ot_tit_end,length
c_ot_tit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
783,賓客,1.0,,,,2
327,大學士,1.0,,,,3
241,掌寺事,1.0,,,,3


### Choose either one.

#### Use coding_value.

In [13]:
df_coding_value=pd.read_excel('https://docs.google.com/spreadsheets/d/e/2PACX-1vQwXjRmlMR9w2ZV2tcenPSz9UgE7WAgeumGxxCJlceQOZRQFgm6_mgMCAlC_GzM0yxxNsDOlU1-5aH-/pub?output=xlsx',
                              sheetname='merged_tbl_coding'
                             )[['c_office_id', 'coding_value']].set_index('c_office_id')

In [14]:
df_coding_value.sample(3)

Unnamed: 0_level_0,coding_value
c_office_id,Unnamed: 1_level_1
148.0,皇族宮廷類宗室門王府長史司典祠
1666.0,中央輔佐官署類寺監門司天監保章正
70477.0,中央中樞官署類六部門戶部尚書


In [15]:
for c_office_id in tqdm(df_office_ming_merged.index):
    df_office_ming_merged.loc[c_office_id, 'c_ot_coding']=df_coding_value.loc[c_office_id, 'coding_value']

100%|██████████| 4304/4304 [00:05<00:00, 828.80it/s]


#### Use UCI value.

In [16]:
df_office_ming_merged['c_ot_coding']=df_office_ming_merged['uci_value']

### Begin to replace.

In [17]:
# Replace titles (only one title in an office title string).
for ming_ot_index in tqdm(df_office_ming_merged.index):
    ming_ot = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']
    ming_ot_done=[]
    for tit_index in df_tit.index:
        tit=df_tit.loc[tit_index, 'c_ot_tit_chinm']
        if ming_ot.endswith(tit) and ming_ot not in ming_ot_done:
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_tit_chinm']=tit
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=re.findall(r'(.+)'+tit, ming_ot)[0]+'T'+str(tit_index)
            ming_ot_done.append(ming_ot)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [04:43<00:00, 15.16it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,c_notes,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
352.0,皇族宮廷類宦官門_都知監_掌印太監,,,,,,,,,,,,,,,,,皇族宮廷類宦官門_都知監_T1381,掌印太監
885.0,中央中樞官署類中書省門_左司_檢校,,,,,,,,,,,,,,,,,中央中樞官署類中書省門_左司_T924,檢校
2612.0,地方官署類京府門_應天府_關副使,,,,,,,,,,,,,,,,,地方官署類京府門_應天府_關T1047,副使


#### Run on first part.

In [18]:
# Replace Classifications (can have multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_merged.index):
    cls_list=[]
    for cls_index in df_cls.index:
        cls=df_cls.loc[cls_index, 'c_ot_cls_chinm']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding'].split('_') # Only use the first part, i.e., classifications.
        if cls in c_ot_coding[0]:
            cls_list.append(cls)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding[0].replace(cls, 'C'+str(cls_index))+'_'+c_ot_coding[1]+'_'+c_ot_coding[2] # Add left parts.
    if cls_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_cls_chinm']='#'.join(cls_list)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [01:09<00:00, 61.75it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,c_notes,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm,c_ot_cls_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
744.0,皇族宮廷類宦官門_王府官（洪武舊制）_司藥,,,,,,,,,,,,,,,,,C23C47_王府官（洪武舊制）_T1330,司藥,皇族宮廷類#宦官門
1729.0,中央輔佐官署類寺監門_醫學提舉司_醫學教授,,,,,,,,,,,,,,,,,C8C44_醫學提舉司_T1549,醫學教授,中央輔佐官署類#寺監門
70084.0,京衛京營與中央軍事官署類京營門_五軍營_參將,參將銜,14821.0,can jiang xian,,,Nominal Assistant Regional Commander,,,,,Nominal Assistant Regional Commander,,,,0.0,,C0C40_五軍營_T960,參將,京衛京營與中央軍事官署類#京營門


#### Run on second part.

In [19]:
# Replace admin units (can have multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_merged.index):
    adm_list=[]
    for adm_index in df_adm.index:
        adm=df_adm.loc[adm_index, 'c_ot_adm_chinm']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding'].split('_')
        if adm in c_ot_coding[1]:
            adm_list.append(adm)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding[0]+'_'+c_ot_coding[1].replace(adm, 'A'+str(adm_index))+'_'+c_ot_coding[2]
    if adm_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_adm_chinm']='#'.join(adm_list)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [03:47<00:00, 18.94it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,...,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm,c_ot_cls_chinm,c_ot_adm_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1315.0,中央中樞官署類六部門_工部_水部郎中,,,,,,,,,,...,,,,,,,C11C45_A760_水部T1326,郎中,中央中樞官署類#六部門,工部
174.0,皇族宮廷類宗室門_王相府_典膳副,,,,,,,,,,...,,,,,,,C23C55_A1023_T438,典膳副,皇族宮廷類#宗室門,王相府
169.0,皇族宮廷類宗室門_王相府_典寶正,,,,,,,,,,...,,,,,,,C23C55_A1023_T325,典寶正,皇族宮廷類#宗室門,王相府


In [20]:
# Run Classifications on second part.
for ming_ot_index in tqdm(df_office_ming_merged.index):
    cls_list=[]
    for cls_index in df_cls.index:
        cls=df_cls.loc[cls_index, 'c_ot_cls_chinm']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding'].split('_') # Only use the first part, i.e., classifications.
        if cls in c_ot_coding[1]:
            cls_list.append(cls)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding[0]+'_'+c_ot_coding[1].replace(cls, 'C'+str(cls_index))+'_'+c_ot_coding[2] # Add left parts.
    if cls_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_cls_chinm']='#'.join(cls_list)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [00:44<00:00, 96.67it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,...,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm,c_ot_cls_chinm,c_ot_adm_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
70556.0,地方軍事與治安機構類地區軍官門_福建軍官_建總兵官,建總兵官,15293.0,jian zong bing guan,,,[Not Yet Translated],,,,...,[Not Yet Translated],,,,0.0,,C1C14_C103_建T258,總兵官,福建軍官,
70174.0,中央中樞官署類內閣門__殿大學士,殿大學士,14911.0,dian da xue shi,,,Grand Secretary of a Hall,,,,...,Grand Secretary of a Hall,,,,0.0,,C11C53__殿T327,大學士,中央中樞官署類#內閣門,
1239.0,中央中樞官署類六部門_刑部_司獄司司獄,,,,,,,,,,...,,,,,,,C11C45_A692_司獄司T1106,司獄,中央中樞官署類#六部門,刑部


#### Run on third part.

In [21]:
# Replace admin units (can have multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_merged.index):
    adm_list=[]
    for adm_index in df_adm.index:
        adm=df_adm.loc[adm_index, 'c_ot_adm_chinm']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding'].split('_')
        if adm in c_ot_coding[2]:
            adm_list.append(adm)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding[0]+'_'+c_ot_coding[1]+'_'+c_ot_coding[2].replace(adm, 'A'+str(adm_index))
    if adm_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_adm_chinm']='#'.join(adm_list)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [03:38<00:00, 19.74it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,...,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm,c_ot_cls_chinm,c_ot_adm_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
70116.0,中央輔佐官署類秘書門_典籍實錄修纂_承天大誌總裁官,承天大誌總裁官,14853.0,cheng tian da zhi zong cai guan,,,Director-general,,,,...,[Not Yet Translated],,,,0.0,,C8C48_典籍實錄修纂_承天大誌T144,總裁官,中央輔佐官署類#秘書門,
318.0,皇族宮廷類宦官門_印綬監_僉書太監,,,,,,,,,,...,,,,,,,C23C47_A1042_T1385,僉書太監,皇族宮廷類#宦官門,印綬監
71631.0,中央輔佐官署類秘書門_翰林院_纂修兼校正官,纂修兼校正官,16368.0,zuan xiu jian jiao zheng guan,,,Compiler and Editor,,,,...,Compiler and Editor,,,,0.0,,C8C48_A347_纂修兼T563,校正官,中央輔佐官署類#秘書門,翰林院


In [22]:
# Replace functional units (can have multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_merged.index):
    func_list=[]
    for func_index in df_func.index:
        func=df_func.loc[func_index, 'c_ot_func_chinm']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding'].split('_')
        if func in c_ot_coding[2]:
            func_list.append(func)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding[0]+'_'+c_ot_coding[1]+'_'+c_ot_coding[2].replace(func, 'F'+str(func_index))
    if func_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_func_chinm']='#'.join(func_list)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [00:32<00:00, 131.67it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,...,c_category_2,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm,c_ot_cls_chinm,c_ot_adm_chinm,c_ot_func_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1726.0,中央輔佐官署類寺監門_醫學提舉司_提舉,,,,,,,,,,...,,,,,,C8C44_A1214_T1228,提舉,中央輔佐官署類#寺監門,醫學提舉司,
70278.0,地方軍事與治安機構類地區軍官門_掛印將軍_奉命大將軍,奉命大將軍,15015.0,feng ming da jiang jun,,,General-in-Chief by Order,,,,...,,,,0.0,,C1C14_C81_T1436,奉命大將軍,掛印將軍,,
2880.0,地方軍事與治安機構類地區軍官門_山西軍官_遊擊將軍,,,,,,,,,,...,,,,,,C1C14_C92_T1682,遊擊將軍,山西軍官,,


In [None]:
# Replace text code.
for ming_ot_index in tqdm(df_office_ming_merged.index):
    txt_list=[]
    for txt_index in df_txt_code.index:
        txt=df_txt_code.loc[txt_index, 'c_title_chn']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']
        if txt in c_ot_coding:
            txt_list.append(txt)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding.replace(txt, 'B'+str(txt_index))
    if txt_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_func_chinm']='#'.join(txt_list)
df_office_ming_merged.sample(3)

In [23]:
# Replace appointment type.
for ming_ot_index in tqdm(df_office_ming_merged.index):
    app_ty_list=[]
    for app_ty_index in df_app_ty.index:
        app_ty=df_app_ty.loc[app_ty_index, 'c_appt_type_desc_chn']
        c_ot_coding = df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding'].split('_')
        if app_ty in c_ot_coding[2]:
            app_ty_list.append(app_ty)
            df_office_ming_merged.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding[0]+'_'+c_ot_coding[1]+'_'+c_ot_coding[2].replace(app_ty, 'P'+str(app_ty_index))
    if app_ty_list!=[]:
        df_office_ming_merged.loc[ming_ot_index, 'c_ot_app_chinm']='#'.join(app_ty_list)
df_office_ming_merged.sample(3)

100%|██████████| 4304/4304 [00:30<00:00, 140.07it/s]


Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,...,c_category_3,c_category_4,c_office_id_old,c_dy,c_ot_coding,c_ot_tit_chinm,c_ot_cls_chinm,c_ot_adm_chinm,c_ot_func_chinm,c_ot_app_chinm
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1907.0,司法監察機構類司法門_大理寺_右司都評事,,,,,,,,,,...,,,,,C7C46_A278_A1196T471,都評事,司法監察機構類#司法門,右司,,
2625.0,地方官署類省官門_承宣布政使司_照磨所檢校,,,,,,,,,,...,,,,,C17C54_A1132_A562T924,檢校,地方官署類#省官門,照磨所,,
70283.0,牧鹽舶政類市舶司門_市舶提舉司_副提舉,巿舶司副提舉,15020.0,po bo si fu ti ju,,,Vice Maritime Trade Superisorate,,,,...,,,0.0,,C21C29_A1131_T732,副提舉,牧鹽舶政類#市舶司門,市舶提舉司,,


In [24]:
for index in tqdm(df_office_ming_merged.index):
    c_ot_coding=df_office_ming_merged.loc[index, 'c_ot_coding']
    if re.sub(r'A|C|_|T|F|P|（|）|B|\d', '', string=c_ot_coding)!='':
        df_office_ming_merged.loc[index, 'pass']='F'
    else:
        df_office_ming_merged.loc[index, 'pass']='T'

100%|██████████| 4304/4304 [00:07<00:00, 545.48it/s]


In [25]:
# Retain the 'type' column online.
df_office_ming_merged_coded=pd.read_excel('https://docs.google.com/spreadsheets/d/e/2PACX-1vQwXjRmlMR9w2ZV2tcenPSz9UgE7WAgeumGxxCJlceQOZRQFgm6_mgMCAlC_GzM0yxxNsDOlU1-5aH-/pub?output=xlsx',
                                          sheetname='merged_tbl_coding'
                                         )
df_office_ming_merged_coded.set_index('c_office_id', inplace=True)
for c_office_id in df_office_ming_merged.index:
    df_office_ming_merged.loc[c_office_id, 'type']=df_office_ming_merged_coded.loc[c_office_id, 'type']
    df_office_ming_merged.loc[c_office_id, 'c_title_chn']=df_office_ming_merged_coded.loc[c_office_id, 'c_title_chn']
    df_office_ming_merged.loc[c_office_id, 'book_raw']=df_office_ming_merged_coded.loc[c_office_id, 'book_raw']
    df_office_ming_merged.loc[c_office_id, 'c_textid']=df_office_ming_merged_coded.loc[c_office_id, 'c_textid']

In [26]:
df_office_ming_merged.to_excel('../data_output/ming_office_title_merged_coding.xlsx', encoding='utf8')