### Script purpose: Ming office title coding

1. General principles:
    - A comprehensive ontological structure of office title includes four parts: `Classification + Administrative Unit (optional) + Function (optional) + Title`
    - Each part corresponds to a table.
    - Separate `coding_value` and `raw_value`.
        - `raw_value`: the string appeared in original book text.
        - `coding_value`: the revised string that can be successfully coded.

2. Notes:
    - `Office title by LENGTH` table merges CBDB Ming office title with UCI table. Duplicates in CBDB table are removed in this table, i.e., this is the clean table we are going to use.

In [1]:
% matplotlib inline
import sqlite3
import pandas as pd
import networkx as nx
import xlrd
import matplotlib.pyplot as plt
import math
import warnings
from tqdm import tqdm
import re
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

### `c_office_chn` from UCI.

In [132]:
df_uci_office_ming=pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSCmhbCk1B-9jjINMhy_VwikM6_Sn7bjdO7b_vaZJkVcYCCYlWVlhYVCFtAs0fPX-UEO62GWxaX1qAS/pub?gid=630627340&single=true&output=tsv',
                                    sep='\t')
df_uci_office_ming=df_uci_office_ming[['c_office_id（Dictionary Ser#)','Institution 1', 'Institution 2', 'Institution 3', 'c_office_chn']].rename(columns={'c_office_id（Dictionary Ser#)':'c_office_id'})
df_uci_office_ming['c_office_chn']=[s.replace('/', '') for s in df_uci_office_ming['c_office_chn']]
df_uci_office_ming.sample(3)

Unnamed: 0,c_office_id,Institution 1,Institution 2,Institution 3,c_office_chn
164,71153,中央中樞官署類 The Central Government,六部門 Six Ministries,刑部 The Ministry of Justice,浙江清吏司員外郎
981,1418,中央輔佐官署類 Central Administration Assistance,秘書門 Secretary Offices,中書科 The Central Drafting Office,內閣誥敕房中書舍人
1926,71056,司法監察機構類 Legislation and Censorship,監察門 Censorate,總督巡撫官 Supreme Commanders and Grand Coordinators,五省總督


In [133]:
df_uci_office_ming['inst_1_chn']=[str(s).split()[0].replace('nan', '') for s in df_uci_office_ming['Institution 1']]
df_uci_office_ming['inst_2_chn']=[str(s).split()[0].replace('nan', '') for s in df_uci_office_ming['Institution 2']]
df_uci_office_ming['inst_3_chn']=[str(s).split()[0].replace('nan', '') for s in df_uci_office_ming['Institution 3']]
df_uci_office_ming['uci_value']=df_uci_office_ming['inst_1_chn']+df_uci_office_ming['inst_2_chn']+df_uci_office_ming['inst_3_chn']+df_uci_office_ming['c_office_chn']
df_uci_office_ming['c_office_id']=pd.to_numeric(df_uci_office_ming['c_office_id'], errors='coerce')
df_uci_office_ming.drop(['inst_1_chn', 'inst_2_chn', 'inst_3_chn', 'Institution 1', 'Institution 2', 'Institution 3', 'c_office_chn'], axis=1, inplace=True)

In [134]:
df_uci_office_ming[df_uci_office_ming['c_office_id'].duplicated()]

Unnamed: 0,c_office_id,uci_value
1130,71508.0,中央輔佐官署類秘書門翰林院直文淵閣侍講學士
1195,71503.0,中央輔佐官署類考官門會試官知貢舉官
1219,72165.0,中央輔佐官署類考官門鄉試官順天同考官
1282,,京衛京營與中央軍事官署類京營門京營京營總兵官
2314,71504.0,地方官署類省官門行中書省理問所知事
2718,71274.0,地方軍事與治安機構類招討經略安撫使門宣撫司宣撫司經歷
2821,,文武散階勛爵類勛爵門伯平涼伯
2842,,文武散階勛爵類勛爵門伯新城伯
2862,,文武散階勛爵類勛爵門伯永定伯
2882,,文武散階勛爵類勛爵門伯鎮遠伯


In [135]:
df_uci_office_ming.drop(df_uci_office_ming[df_uci_office_ming['c_office_id'].duplicated()].index, inplace=True)
df_uci_office_ming.set_index('c_office_id', inplace=True)
df_uci_office_ming.sample(3)

Unnamed: 0_level_0,uci_value
c_office_id,Unnamed: 1_level_1
70306.0,地方官署類府官門府官同知
1886.0,司法監察機構類監察門總督巡撫官鄖陽撫治
71913.0,文武散階勛爵類勛爵門公寧國公


### `c_office_chn` from CBDB uncleaned.

In [137]:
conn = sqlite3.connect('../../SQL/20170424CBDBauUserSqlite.db')
df_cbdb_office_ming=pd.read_sql_query("SELECT * FROM OFFICE_CODES", conn)[pd.read_sql_query("SELECT * FROM OFFICE_CODES", conn).c_dy==19].set_index('c_office_id')
df_cbdb_office_ming.sample(3)

Unnamed: 0_level_0,tts_sysno,c_dy,c_office_pinyin,c_office_chn,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,c_notes,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
70230,16743,19.0,du zhi hui qian shi,都指揮僉事,,,Regional Assistant Military Commisioner,,,,,Regional Assistant Military Commisioner,,,,0
70899,17412,19.0,tai shi ling,太史令,,,Director of the Directorate of Astronomy,,,,,Director of the Directorate of Astronomy,,,,0
70418,16963,19.0,han lin yuan zuan xiu jian jiao zheng guan,翰林院纂修兼校正官,,,Compiler and Editor of the Hanlin Academy,,,,,Compiler and Editor of the Hanlin Academy,,,,0


In [140]:
for index in df_uci_office_ming.index:
    if index in df_cbdb_office_ming.index:
        df_uci_office_ming.loc[index, 'cbdb_value']=df_cbdb_office_ming.loc[index, 'c_office_chn']
        df_uci_office_ming.loc[index, 'tts_sysno']=df_cbdb_office_ming.loc[index, 'tts_sysno']
        df_uci_office_ming.loc[index, 'c_office_pinyin']=df_cbdb_office_ming.loc[index, 'c_office_pinyin']
        df_uci_office_ming.loc[index, 'c_office_pinyin_alt']=df_cbdb_office_ming.loc[index, 'c_office_pinyin_alt']
        df_uci_office_ming.loc[index, 'c_office_chn_alt']=df_cbdb_office_ming.loc[index, 'c_office_chn_alt']
        df_uci_office_ming.loc[index, 'c_office_trans']=df_cbdb_office_ming.loc[index, 'c_office_trans']
        df_uci_office_ming.loc[index, 'c_office_trans_alt']=df_cbdb_office_ming.loc[index, 'c_office_trans_alt']
        df_uci_office_ming.loc[index, 'c_source']=df_cbdb_office_ming.loc[index, 'c_source']
        df_uci_office_ming.loc[index, 'c_pages']=df_cbdb_office_ming.loc[index, 'c_pages']
        df_uci_office_ming.loc[index, 'c_notes']=df_cbdb_office_ming.loc[index, 'c_notes']
        df_uci_office_ming.loc[index, 'c_category_1']=df_cbdb_office_ming.loc[index, 'c_category_1']
        df_uci_office_ming.loc[index, 'c_category_2']=df_cbdb_office_ming.loc[index, 'c_category_2']
        df_uci_office_ming.loc[index, 'c_category_3']=df_cbdb_office_ming.loc[index, 'c_category_3']
        df_uci_office_ming.loc[index, 'c_category_4']=df_cbdb_office_ming.loc[index, 'c_category_4']
        df_uci_office_ming.loc[index, 'c_office_id_old']=df_cbdb_office_ming.loc[index, 'c_office_id_old']
df_uci_office_ming.loc[index, 'c_dy']=19

In [143]:
df_office_ming_merged=df_uci_office_ming

In [145]:
df_office_ming_merged.sample(3)

Unnamed: 0_level_0,uci_value,cbdb_value,tts_sysno,c_dy,c_office_pinyin,c_office_pinyin_alt,c_office_chn_alt,c_office_trans,c_office_trans_alt,c_source,c_pages,c_notes,c_category_1,c_category_2,c_category_3,c_category_4,c_office_id_old
c_office_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
70997.0,文武散階勛爵類尉世職,尉世職,17510.0,19.0,wei shi zhi,,,Commandant by Heritage,,,,,Commandant by Heritage,,,,0.0
70511.0,皇族宮廷類帝后門后皇后,皇后,17056.0,19.0,huang hou,,,Empress,,,,,Empress,,,,0.0
766.0,皇族宮廷類女官門尚儀局司籍司掌籍,,,,,,,,,,,,,,,,


In [146]:
df_office_ming_merged.to_excel('../data_output/df_office_ming_merged.xlsx', encoding='utf8')

### Coding `c_office_chn`.

In [4]:
df_adm=pd.read_csv('../data_dict/C_OT_ADM.tsv', sep='\t').set_index('c_ot_adm_id')
df_cls=pd.read_csv('../data_dict/C_OT_CLS.tsv', sep='\t').set_index('c_ot_cls_id')
df_tit=pd.read_csv('../data_dict/C_OT_TIT.tsv', sep='\t').set_index('c_ot_tit_id')

In [5]:
df_tit.sample(3)

Unnamed: 0_level_0,c_ot_tit_chinm,c_ot_tit_engnm,c_ot_tit_desc,c_ot_tit_start,c_ot_tit_end
c_ot_tit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
106,漕運使,,,,
1092,前鋒,,,,
387,左監正,,,,


In [6]:
df_office_ming_drop_col['c_ot_coding']=df_office_ming_drop_col['c_office_chn']

In [7]:
# Replace titles (only one title in an office title string).
for ming_ot_index in tqdm(df_office_ming_drop_col.index):
    ming_ot = df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']
    ming_ot_done=[]
    for tit_index in df_tit.index:
        tit=df_tit.loc[tit_index, 'c_ot_tit_chinm']
        if ming_ot.endswith(tit) and ming_ot not in ming_ot_done:
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_tit_chinm']=tit
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']=ming_ot.replace(tit, 'T'+str(tit_index))
            ming_ot_done.append(ming_ot)
df_office_ming_drop_col.sample(5)

100%|██████████| 4318/4318 [02:17<00:00, 31.46it/s]


Unnamed: 0,c_office_id,c_office_chn,c_ot_coding,c_ot_tit_chinm
2256,70071,右參議,T212,右參議
2628,2813,右參將,T711,右參將
1388,2150,儀鸞司大使,儀鸞司T1214,大使
2090,2600,都稅司副使,都稅司T1018,副使
4176,389,右司副,T739,右司副


In [8]:
# Replace admin units (can be multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_drop_col.index):
    adm_list=[]
    for adm_index in df_adm.index:
        adm=df_adm.loc[adm_index, 'c_ot_adm_chinm']
        c_ot_coding = df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']
        if adm in c_ot_coding:
            adm_list.append(adm)
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding.replace(adm, 'A'+str(adm_index))
    if adm_list!=[]:
        df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_adm_chinm']='#'.join(adm_list)
df_office_ming_drop_col.sample(5)

100%|██████████| 4318/4318 [03:03<00:00, 23.54it/s]


Unnamed: 0,c_office_id,c_office_chn,c_ot_coding,c_ot_tit_chinm,c_ot_adm_chinm
364,989,交阯清吏司員外郎,A76T608,員外郎,交阯清吏司
1470,1982,大都督,T300,大都督,
2803,72024,咸寧伯,T664,咸寧伯,
1429,71544,左都督,T462,左都督,
2600,70718,南路參將,A837T931,參將,南路


In [9]:
# Replace Classifications (can be multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_drop_col.index):
    cls_list=[]
    for cls_index in df_cls.index:
        cls=df_cls.loc[cls_index, 'c_ot_cls_chinm']
        c_ot_coding = df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']
        if cls in c_ot_coding:
            cls_list.append(cls)
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding.replace(cls, 'C'+str(cls_index))
    if cls_list!=[]:
        df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_cls_chinm']='#'.join(cls_list)

100%|██████████| 4318/4318 [00:11<00:00, 364.82it/s]


In [10]:
df_office_ming_drop_col.to_excel('../dump/ming_office_title_coding_UCI.xlsx', encoding='utf8')