### Workflow: Extracting information by subtracting.

1. First, we need to decrease the "noise" as much as possible.
    - Cut content into sentences.
    - Mark sentences with kinship dictionary.
    - drop sentences without kinship words.
2. Sentence compression.
    - Some sentences have office title or place name before person name.
    - Extract office titles (marked as `"/no_noc/"`) and places (marked as `"/ns/"`) to seperate columns.
    - Compress/subtract the sentences to only include kinship titles and person names.
3. Kinship information extraction.
    - Run regular expressions on compressed sentences.
    
#### Problems and TODOs:
- [X] - "Include code IDs": 
    - Not for the relationship extraction step, since there may be duplicates of office titles/place names, complicate current tasks.
    - Therefore, no need to put the IDs in dictionaries.
- [X] - Words to be tagged:
    - Office title and place names used in Song dynasty.
    - Office titles and kinship names are overlapped, e.g., 子, 伯: Should be solved by using refined titles.
- [X] - Order effect:
    - e.g., 昭武軍節度，if meet 節度 first, will be 昭武軍/no/ after subtraction.
    - Solution: order the dictionary by length of characters, therefore, 昭武軍節度 will be matched first.
- [X] - Tagging more words:
    - [x] Update dictionaries (`no_noc`, `nk`, `vno`).
    - [x] Tag exact positions instead of combining, e.g.:
        - sent_comp: 解州解/no_noc//wsep/累贈/no_noc/諱禹偁/wsep/追封彭城/no_noc/徐氏/wsep/廣陵/no_noc/王氏/wsep/睢陽/no_noc/宋氏/wsep/祖考妣也
        - no_noc: 禮部尚書#郡太君#郡太君#郡太君#縣令
    - [x] Appointment verbs (`APPOINTMENT_TYPE_CODES`).
    - Entry data (`ENTRY_CODES`).


In [1]:
import pandas as pd
import re
from tqdm import tqdm
import math

### Prepare dictionaries.

In [2]:
song_kin_title=pd.read_excel('../../../NLP/cn_corpus/data_raw/song_kinship_title.xlsx')
song_kin_title.rename(columns={'名称':'name'}, inplace=True)
song_kin_title['pos']='nk'
song_kin_title.sample(3)

Unnamed: 0,name,关系说明,亲属编码,pos
216,外王母,MM,371,nk
72,先父,F,75,nk
246,第八子,S8,199,nk


In [3]:
song_appt_type=pd.read_excel('../../../NLP/cn_corpus/data_raw/song_appt_type.xlsx')
song_appt_type.rename(columns={'c_appt_type_desc_chn':'name'}, inplace=True)
song_appt_type['pos']='vno'
song_appt_type.sample(3)

Unnamed: 0,c_appt_type_code,name,c_appt_type_desc,check,Unnamed: 4,Unnamed: 5,Unnamed: 6,pos
48,55,贈署,,1.0,,,,vno
26,33,簽書,Used When The Official'S Rank Is Lower Than Th...,,,,,vno
46,53,援例,,1.0,,,,vno


In [4]:
song_appt_type[['name', 'pos']].to_csv('../../../NLP/cn_corpus/data_build/cn_traditional_jieba/song_vno_dict.csv', 
                                       index=False, sep='\t')
song_kin_title[['name', 'pos']].to_csv('../../../NLP/cn_corpus/data_build/cn_traditional_jieba/song_nk_dict.csv', 
                                       index=False, sep='\t')

### Sentence compression.

#### Select sentences with kinship information.

In [5]:
# Read kinship dictionary. nk = POS Kinship.
df_nk=pd.read_csv('../data_dict/song_nk_dict.csv', sep='\t', header=None).rename(columns={0:'name', 1:'pos'})
# Read office title. no = office title. noc = office category.
df_no_noc=pd.read_csv('../data_dict/song_no_noc_dict.csv', sep='\t', header=None).rename(columns={0:'name', 1:'pos'})
df_ns=pd.read_csv('../data_dict/song_ns_dict.csv', sep='\t', header=None).rename(columns={0:'name', 1:'pos'})
df_vno=pd.read_csv('../data_dict/song_vno_dict.csv', sep='\t')
df_nz=pd.read_csv('../data_dict/song_nz_dict.csv', sep='\t', header=None).rename(columns={0:'name', 1:'pos'})
df_kin_mismatch=pd.read_csv('../data_dict/kin_mismatch.csv', sep='\t', header=None).rename(columns={0:'name'})

In [6]:
df_nk.sample(3)

Unnamed: 0,name,pos
636,王父,nk
607,從母,nk
574,玄孫,nk


In [7]:
df_kin_mismatch.sample(3)

Unnamed: 0,name
73,考覈
16,壬子
30,老子


In [8]:
df_nz.sample(3)

Unnamed: 0,name,pos
230,武忠公,nz
249,憲節公,nz
380,莊懷公,nz


In [9]:
# All QSW records.
df_qsw_raw=pd.read_excel('../data_raw/quan_song_wen_muzhi.xlsx', sheet_name='墓誌銘墓表壙誌行狀神道碑塔銘墓碑')[['content_id', 'content', 'subject', 'author']].set_index('content_id')
df_qsw_raw['content']=[str(s) for s in df_qsw_raw['content']]
df_qsw_raw.drop(math.nan, inplace=True) # Drop index==NaN
# Sample a subset about 10%. Annotated after testing.
#df_qsw_raw=df_qsw_raw.sample(500)

In [10]:
# Function for seperating sentences.
def sep_mark_sent(string):
    string=string.replace('，', '/wsep/').replace('；', '/wsep/').replace('、', '/wsep/')
    string=string.replace('！', '/wend/').replace('。', '/wend/').replace('？', '/wend/').replace('！', '/wend/')
    string=string.replace('：', '/wm/')
    return [s for s in string.split("/wend/") if s!='']

In [11]:
# Seperate sentences and retain sentences with kinship words.
all_sent_count=0
kin_sent_count=0
df_qsw_refined=pd.DataFrame()
for index in tqdm(df_qsw_raw.index):
    content=df_qsw_raw.loc[index]['content']
    sent_list=sep_mark_sent(content) # Seperate sentences.
    all_sent_count+=len(sent_list)
    subject=df_qsw_raw.loc[index]['subject']
    author=df_qsw_raw.loc[index]['author']
    kin_sent_list=[]
    content_id=index
    # Retain sentences with kinship information.
    for sent in sent_list:
        # Remove all kin_mismatch words first.
        sent_temp=sent
        for kin_mismatch in df_kin_mismatch['name']:
            sent_temp=sent_temp.replace(kin_mismatch, '')
        # See if kin_nm still in sentence.
        kin_nm_list=[]
        for kin_nm in df_nk['name']:
            if kin_nm in sent_temp:
                kin_nm_list.append(kin_nm)
        if kin_nm_list!=[]:
            kin_sent_list.append(sent)
    kin_sent_count+=len(kin_sent_list)
    df_qsw_refined=pd.concat([pd.DataFrame(data=[[content_id, subject, author, sent, content] for sent in kin_sent_list],
                                          columns=['content_id', 'subject', 'author', 'sent', 'content']
                                          ), df_qsw_refined], axis=0, ignore_index=True
                            )
print('Kinship sentences / All sentences: ', round(float(kin_sent_count)/all_sent_count, 3))
# Run on entire corpus, 21.2%.

100%|██████████| 4763/4763 [01:48<00:00, 44.06it/s]

Kinship sentences / All sentences:  0.273





In [12]:
[s for s in df_qsw_raw['content'] if '祖諱元晏' in s]

['君諱某，字某。其先始平人，在僖宗朝，有官於蜀者。廣明之亂，唐統紊裂，視世濁溺，留避於此。子孫蕃衍，有居於普者。五世祖紹卿，於五代時，以宗族門地雄於一州；高祖諱光偉，佐東川節度；曾祖諱嶠，祖諱元晏，并潛隱不仕。父諱某，少舉進士，以苦學被病，遂不顧舊業，專治養生之術。作詩百章道其事，自號丹珠子。年過八十，無疾而終。子三人，君其長也。君生而穎慧，不憙他技。未冠，求師於成都。是時，任玠溫如、李畋渭卿，皆以道義文章教授諸生，君執業門下，并爲其高第。歸，將試藝於其郡廷，以干薦書，而豪士惡子競以財賂占壓，寒素不得一步進於其下。乃退而嘆曰：‘是等也，我安能與之以力相較耶?’於是收歛退縮，芟去仕意，僻居靜處，討究群策。經深史隱，鉤擿藏詣；馳詞吐論，坐者常屈。閭里訟訴，槩先詣君所平決，以至不復更由官治，而兩講解矣。教諸子事業，悉有端次。慶曆中，其子、今中都外郎如晦，用其法一舉中進士。君曰：‘是吾門戶之大望，自此子爾。’嘉祐初，以子官授大理評事致仕，三遷爲秘書丞，賜緋衣銀魚。嘗即其居，盛創亭宇，榜之曰‘榮恩’；自作記，道其所以獲當世爲人之甚幸者。鄉人景慕之。治平二年春，中都爲晉原宰，君以雙輿就其養。晉原之治，高出一道，君實有所誨助。間則吟詩飲酒，日日不倦。一旦，召中都語之，曰：‘官居之樂，誠樂矣!然而吾之舊廬，近常往來於吾懷也。汝當具吾歸裝，宜無吾留。’十月，促就道，中都遂假檄侍還其家。既至，亟遣去，曰：‘汝速往，無以吾累汝；汝當憂民，慎毋吾憂也。’自是，日召鄉里故舊聚飲，歡嘑歗歌，愈益精健。諸子立左右，忽顧之，曰：‘父母之年，古人謂可以喜懼者，汝等當知之。吾受祿養幾二紀，名復掛朝籍，人能如吾者幾何?此可喜也。然吾春秋已高，汝能無所懼乎?’家人聞之，錯遌皇惑，問：‘何以及此?體中有覺不如平時者何所?’但俛首嘻笑，不答。又數日，食飲漸不進，求就枕，瞑目良久。以纊候其氣，已不屬矣，遂終焉，十一月十二日也。享年七十五。夫人趙氏，同郡之甲族，婉懿有善譽，宗黨模其閨法，四封爲壽光縣君。生男六人，三早夭；次，中都也；次，處晦、用晦，并舉進士，有文行聞其朋流。女五人：適昌元解惟正，都官員外郎景思問，郡人周著，進士景思永。歸思問者，先卒；後繼之以其娣，封永壽縣君。孫男十人，某某，皆嚮習文藝。孫女九人。其一始嫁河南趙仲遘〔一〕。其孤將以三年二月某日，葬君於樂至縣普安鄉之西山，從先塋也。中都與同有場屋之

In [13]:
df_qsw_refined.sample(5)

Unnamed: 0,content_id,subject,author,sent,content
3735,5888776.0,孟珙,劉克莊,配定襄郡夫人彭氏,孟氏之先自絳徙唐，後徙隨之棗陽〔一〕。公諱珙，字璞玉。高大父安，嘗從岳王飛軍。曾大父立，累贈...
29715,5764807.0,丁世雄妻,葉適,初/wsep/少雲外豪華/wsep/中易直/wsep/價傾一縣/wsep/客自天台鴈蕩者多歸...,夫人戴氏，黄巖人，嫁同縣丁世雄。年四十七，慶元六年二月二十五日卒。十二月二十一日，葬從其夫。...
55864,5234528.0,趙仲韠,范祖禹,曾孫女一人/wsep/皆幼,侯諱仲韠，字子儀，魏恭憲王元佐之曾孫，密國公允信之孫，祁國公宗説之子。母張氏，封清河縣君。初...
36267,5655071.0,俞贇,石[CFont]NFDC8[/CFont],娶同郡趙氏/wsep/先公十七年卒/wsep/生男二人/wm/長宗直/wsep/卒於淳熙庚子...,公諱贇，字公憲，俞氏。九世祖避地于台之寧海，因家焉。曾大父仲、大父璋、父璉，皆晦迹不耀。公幼...
42585,5503452.0,王序,朱承,男三/wm/長卿孫/wsep/右宣義郎/wsep/陝西路鑄錢公司幹辦公事,公諱序，字商彦，姓王氏。其先京兆人，六世祖知珏，唐廣明時差知榮州和義縣，因家焉。孫藴舒沈勇有...


In [14]:
drop_index_list=[]
for index in tqdm(df_qsw_refined.index):
    sent=df_qsw_refined.loc[index, 'sent']
    if ('娶' in sent or '取' in sent or '配' in sent) and ('夫人' not in sent and '氏' not in sent):
        drop_index_list.append(index)
for index in tqdm(df_qsw_refined.index):
    sent=df_qsw_refined.loc[index, 'sent']
    if ('歸' in sent or '嫁' in sent or '適' in sent or '許' in sent) and ('女' not in sent and '妹' not in sent and '姑' not in sent and '夫人' not in sent and '姊' not in sent):
        drop_index_list.append(index)
        
df_qsw_refined=df_qsw_refined.drop(drop_index_list)
len(drop_index_list)

100%|██████████| 70915/70915 [00:01<00:00, 42883.79it/s]
100%|██████████| 70915/70915 [00:01<00:00, 43366.37it/s]


13495

#### Subtracting information by order.

In [15]:
# First, subtract 固定称呼 (POS: nz).
for index in tqdm(df_qsw_refined.index):
    sent=df_qsw_refined.loc[index]['sent']
    nz_list=[]
    for nz in df_nz['name']:
        if nz in sent:
            nz_list+=re.findall(nz, sent)
            sent=sent.replace(nz, '/nz/')
    df_qsw_refined.loc[index, 'sent_comp']=sent
    df_qsw_refined.loc[index, 'nz']='#'.join(nz_list)

100%|██████████| 57733/57733 [09:46<00:00, 98.43it/s]


In [16]:
# Second, subtract place names (POS:ns).
for index in tqdm(df_qsw_refined.index):
    sent=df_qsw_refined.loc[index]['sent_comp']
    ns_list=[]
    for ns in df_ns['name']:
        if ns in sent:
            ns_list+=re.findall(ns, sent)
            sent=sent.replace(ns, '/ns/')
    df_qsw_refined.loc[index, 'sent_comp']=sent
    df_qsw_refined.loc[index, 'ns']='#'.join(ns_list)

100%|██████████| 57733/57733 [11:45<00:00, 81.88it/s]


In [17]:
# Third, subtract office title (POS:no_noc).
for index in tqdm(df_qsw_refined.index):
    sent=df_qsw_refined.loc[index]['sent_comp']
    no_noc_list=[]
    for no_noc in df_no_noc['name']:
        if no_noc in sent:
            no_noc_list+=re.findall(no_noc, sent)
            sent=sent.replace(no_noc, '/no_noc/')
    df_qsw_refined.loc[index, 'sent_comp']=sent
    df_qsw_refined.loc[index, 'no_noc']='#'.join(no_noc_list)

100%|██████████| 57733/57733 [13:57<00:00, 68.91it/s]


In [18]:
df_qsw_refined.sample(3)

Unnamed: 0,content_id,subject,author,sent,content,sent_comp,nz,ns,no_noc
33665,5713832.0,樓鐊,樓鑰,使假之年/wsep/其自見于世者何止此而已耶!嫂陳氏/wsep/家番昜,先光禄有十丈夫子，惟伯兄績谿尉生于紹興二年，仲兄嚴州生于四年，至七年而鑰始生。二兄愛鑰厚，期...,使假之年/wsep/其自見于世者何止此而已耶!嫂陳氏/wsep/家番昜,,,
52816,5267842.0,孫覽,畢仲游,天下稱賢弟兄者/wsep/必曰莘老/wsep/傳師焉,故朝請大夫、寶文閣待制、提舉江寧府崇禧觀、上柱國、華亭縣開國伯、食邑七百户、賜紫金魚袋孫公，...,天下稱賢弟兄者/wsep/必曰莘老/wsep/傳師焉,,,
28059,5784055.0,吳柔勝,曹彦約,勝之諱柔勝/wsep/家本姑蘇/wsep/八世祖徙宣城/wsep/以儒爲業,勝之修撰葬有日，墓當立碑，真希元直院已諾執筆，柴與之秘監又狀其事矣。二公號大手筆，一代端慤不...,勝之諱柔勝/wsep/家本姑蘇/wsep/八世祖徙/ns//wsep/以儒爲業,,宣城,


In [19]:
def vno_mark(vno, sent):
    vno_re=re.compile(vno+'/no_noc/|'+vno+'/ns/')
    vno_list_temp=re.findall(vno_re, sent)
    if vno_list_temp!=[]:
        sent_comp_temp=sent.replace(vno+'/no_noc/', '/vno//no_noc/')
        sent_comp_temp=sent_comp_temp.replace(vno+'/ns/', '/vno//ns/')
        return {'sent_comp':sent_comp_temp, 
                'vno_list':[s.strip('/no_noc/').strip('/ns/') for s in vno_list_temp]}
    else:
        return None

In [20]:
# Subtract appointing verb (POS:vno).
for index in df_qsw_refined.index:
    sent=df_qsw_refined.loc[index]['sent_comp']
    vno_list=[]
    for vno in df_vno['name']:
        tag_result=vno_mark(vno, sent)
        if tag_result!=None:
            sent=tag_result['sent_comp']
            vno_list+=tag_result['vno_list']
    df_qsw_refined.loc[index, 'sent_comp']=sent
    df_qsw_refined.loc[index, 'vno']='#'.join(vno_list)

In [21]:
df_qsw_refined.fillna('', inplace=True)
df_qsw_refined.sample(5)

Unnamed: 0,content_id,subject,author,sent,content,sent_comp,nz,ns,no_noc,vno
33449,5713694.0,趙師龍,樓鑰,進止詳華/wsep/占對明辯/wsep/壽皇嘉納/wsep/且曰/wm/‘秀王之孫與卿同名/...,公諱師龍，字舜臣，太祖皇帝九世孫。曾大父令蘧，邕州管内觀察使，累贈少師，追封昌國公，謚孝良。...,進止詳華/wsep/占對明辯/wsep/壽皇嘉納/wsep/且曰/wm/‘秀王之孫與卿同名/...,,,,
36166,5657905.0,吴之才妻,趙善括,南昌吴隱君之才之夫人萬氏/wsep/少而賢/wsep/爲婦有孝聲,夷考今昔，爲人婦者，咸知愛己之親，鮮有事人之親，能極其愛，情固然爾。縱强能之，率容敬心悖，始...,/ns/吴隱君之才之夫人萬氏/wsep/少而賢/wsep/爲婦有孝聲,,南昌,,
22595,5888906.0,陳垣妻,劉克莊,女五人/wsep/長適文林郎/wsep/潮州録事參軍趙汝腴/wsep/次適修職郎/wsep/...,故海陽陳令君諱垣之配孺人鄭氏，以紹定元年二月六日卒，年五十一。明年三月丁酉，合葬於令君之墓。...,女五人/wsep/長適/no_noc//wsep//ns//no_noc//no_noc/趙...,,臨安府#潮州,文林郎#修職郎#參軍#録事,
20359,5895930.0,徐桂,徐經孫,余叔父金陵法曹爲之記/wsep/故識與不識/wsep/皆號曰内省居士,族伯父内省居士徐公，諱桂，字億年，居豫章豐城之覺溪，其先則撫之宜黄人也。曾祖諱端仁，祖諱邦義...,余叔父金陵/no_noc/爲之記/wsep/故識與不識/wsep/皆號曰内省居士,,,法曹,
40297,5583171.0,允中,韓元吉,公諱允中/wsep/字子忱/wsep/登政和五年進士第/wsep/積官至左通議大夫/wsep...,上即位之二年，詔資政殿大學士賀公落致仕，提舉萬壽觀，兼侍讀。上親御翰墨，累數十語，其略曰：‘...,公諱允中/wsep/字子忱/wsep/登/ns/五年/no_noc/第/wsep/積官至/n...,,會稽郡#政和,左通議大夫#開國公#食邑#進士#爵,


In [22]:
df_qsw_refined.to_excel('../output/qsw_subtract.xlsx', encoding='utf8')