### Script purpose: Ming office title coding

1. General principles:
    - A comprehensive ontological structure of office title includes four parts: `Classification + Administrative Unit (optional) + Function (optional) + Title`
    - Each part corresponds to a table.
    - Separate `coding_value` and `raw_value`.
        - `raw_value`: the string appeared in original book text.
        - `coding_value`: the revised string that can be successfully coded.


In [1]:
% matplotlib inline
import sqlite3
import pandas as pd
import networkx as nx
import xlrd
import matplotlib.pyplot as plt
import math
import warnings
from tqdm import tqdm
import re
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

### `c_office_chn` from UCI.

In [13]:
df_uci_office_ming=pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vSCmhbCk1B-9jjINMhy_VwikM6_Sn7bjdO7b_vaZJkVcYCCYlWVlhYVCFtAs0fPX-UEO62GWxaX1qAS/pub?gid=630627340&single=true&output=tsv',
                                    sep='\t')
df_uci_office_ming=df_uci_office_ming[['c_office_id（Dictionary Ser#)','Institution 1', 'Institution 2', 'Institution 3', 'c_office_chn']].rename(columns={'c_office_id（Dictionary Ser#)':'c_office_id'})
df_uci_office_ming['c_office_chn']=[s.replace('/', '') for s in df_uci_office_ming['c_office_chn']]
df_uci_office_ming.sample(3)

Unnamed: 0,c_office_id,Institution 1,Institution 2,Institution 3,c_office_chn
1916,1834,司法監察機構類 Legislation and Censorship,監察門 Censorate,御史臺 The Censorate (1367-1380),察院經歷
2058,2593,地方官署類 Regional and Local Governance,京府門 Superior Prefectural Governance,應天府 Yingtian Superior Prefecture,京縣主簿
2290,70943,地方官署類 Regional and Local Governance,省官門 Provincial Governance,提刑按察使司 Provincial Surveillance Commission,經歷司經歷


### `c_office_chn` from CBDB uncleaned.

In [12]:
conn = sqlite3.connect('../../SQL/20170424CBDBauUserSqlite.db')
df_cbdb_office_ming=pd.read_sql_query("SELECT * FROM OFFICE_CODES", conn)[pd.read_sql_query("SELECT * FROM OFFICE_CODES", conn).c_dy==19]

### Coding `c_office_chn`.

In [4]:
df_adm=pd.read_csv('../data_dict/C_OT_ADM.tsv', sep='\t').set_index('c_ot_adm_id')
df_cls=pd.read_csv('../data_dict/C_OT_CLS.tsv', sep='\t').set_index('c_ot_cls_id')
df_tit=pd.read_csv('../data_dict/C_OT_TIT.tsv', sep='\t').set_index('c_ot_tit_id')

In [5]:
df_tit.sample(3)

Unnamed: 0_level_0,c_ot_tit_chinm,c_ot_tit_engnm,c_ot_tit_desc,c_ot_tit_start,c_ot_tit_end
c_ot_tit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
106,漕運使,,,,
1092,前鋒,,,,
387,左監正,,,,


In [6]:
df_office_ming_drop_col['c_ot_coding']=df_office_ming_drop_col['c_office_chn']

In [7]:
# Replace titles (only one title in an office title string).
for ming_ot_index in tqdm(df_office_ming_drop_col.index):
    ming_ot = df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']
    ming_ot_done=[]
    for tit_index in df_tit.index:
        tit=df_tit.loc[tit_index, 'c_ot_tit_chinm']
        if ming_ot.endswith(tit) and ming_ot not in ming_ot_done:
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_tit_chinm']=tit
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']=ming_ot.replace(tit, 'T'+str(tit_index))
            ming_ot_done.append(ming_ot)
df_office_ming_drop_col.sample(5)

100%|██████████| 4318/4318 [02:17<00:00, 31.46it/s]


Unnamed: 0,c_office_id,c_office_chn,c_ot_coding,c_ot_tit_chinm
2256,70071,右參議,T212,右參議
2628,2813,右參將,T711,右參將
1388,2150,儀鸞司大使,儀鸞司T1214,大使
2090,2600,都稅司副使,都稅司T1018,副使
4176,389,右司副,T739,右司副


In [8]:
# Replace admin units (can be multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_drop_col.index):
    adm_list=[]
    for adm_index in df_adm.index:
        adm=df_adm.loc[adm_index, 'c_ot_adm_chinm']
        c_ot_coding = df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']
        if adm in c_ot_coding:
            adm_list.append(adm)
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding.replace(adm, 'A'+str(adm_index))
    if adm_list!=[]:
        df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_adm_chinm']='#'.join(adm_list)
df_office_ming_drop_col.sample(5)

100%|██████████| 4318/4318 [03:03<00:00, 23.54it/s]


Unnamed: 0,c_office_id,c_office_chn,c_ot_coding,c_ot_tit_chinm,c_ot_adm_chinm
364,989,交阯清吏司員外郎,A76T608,員外郎,交阯清吏司
1470,1982,大都督,T300,大都督,
2803,72024,咸寧伯,T664,咸寧伯,
1429,71544,左都督,T462,左都督,
2600,70718,南路參將,A837T931,參將,南路


In [9]:
# Replace Classifications (can be multiple units in an office title string).
for ming_ot_index in tqdm(df_office_ming_drop_col.index):
    cls_list=[]
    for cls_index in df_cls.index:
        cls=df_cls.loc[cls_index, 'c_ot_cls_chinm']
        c_ot_coding = df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']
        if cls in c_ot_coding:
            cls_list.append(cls)
            df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_coding']=c_ot_coding.replace(cls, 'C'+str(cls_index))
    if cls_list!=[]:
        df_office_ming_drop_col.loc[ming_ot_index, 'c_ot_cls_chinm']='#'.join(cls_list)

100%|██████████| 4318/4318 [00:11<00:00, 364.82it/s]


In [10]:
df_office_ming_drop_col.to_excel('../dump/ming_office_title_coding_UCI.xlsx', encoding='utf8')