## Script purpose ## 
**TTS-CBDB kinship disambiguation**

### Phase I: Exactly matching
1. Restrict kinship type: "子". Kinship code:
```Python
kin_code_list_son=['180','182','183','184','185','186','191','193','194','195','196','198',
                   '199','202','205','206','207','211','212','213','214','220','221','222',
                   '226','229','231','234','235','307','326','339','436','560','568','572','575',]
```
2. Construct a list of pairs using CBDB data: ego person's CBDB ID + kinship person's name (we have CBDB ID), e.g.,
```Python
[[0, '田照鄰'],
 [1, '安邡'],
 [1, '安邠'],
 [1, '安郊'],
 [1, '安邦'],
 [3, '安扶'],
 [4, '查沖之'],
 [4, '查循之'],
 [6, '柴貽範'],
 [12, '晁子與'],
 [10097, '宋璲']]
```
3. Construct pairs using TTS data: ego person's CBDB ID (we have this) + kinship person's name (we don't have CBDB ID), e.g., 
```Python
[[56812, '富察昌齡'],
 [56812, '富察科占'],
 [56812, '富察查納'],
 [56813, '邵桓'],
 [56814, '刁錄'],
 [56814, '刁鈞'],
 [56814, '刁安仁'],
 [56814, '刁錦'],
 [56815, '邵鐸'],
 [56816, '于廷翼'],
 [10097, '宋璲']]
```
very similar to #2, but we don't have CBDB's ID for the kinship persons.
4. find the intersection of the two lists, in the example, it is `[10097, '宋璲']`, then we give `宋璲`'s CBDB ID to TTS, i.e., `120940`.
5. Kinship type "子" resolves 233 pairs of TTS records (total: 7983 pairs). 3%, not that bad.

### Phase II: Kinship normalization.

- Basic principle: Maximize the utilization of the kinship information in CBDB in disambiguating TTS records.
- Task: Resolve the missing information (i.e., CBDB ID) in TTS kinship network (KN) using information from CBDB kinship network (KN).
- The task requires: TTS KN (TKN) is comparable to CBDB KN (CKN).
- However: TKN and CKN is not apple to apple;
- Because: we don't have a normalized/standard kinship network representation method.

In order to have a normalized/standard kinship network representation method, we need:
- Have a list of basic kinship relationships, this is discussed by many studies including Deng Ke's;
- But: the basic relationships and representation method are not ontologically neat (normalized), i.e., some of the basic relationships can be represented by other basic relationships. 
    - For example, by using Deng Ke's method, if $A$ is $B$'s son, then $B$ is $A$'s father. 
    - There are two types of information represented here: $A$ and $B$ are male, $A$ is the child of $B$. Gender is the character of people, kinship is the character of relationship.

So a better kinship relationship representation method should be:
- Gender is the attribute of nodes, basic kinship relationship is the attribute of ties.
- Basic kinship relationship only includes: child, spouse.
- Immediate family is defined as: two nodes connected through "spouse" and their child nodes.
- Calculate other kinship relationship by using the shortest path between two nodes.
- This is very interesting because the "shortest path" method is how we "calculate" kinship in the real world. 
- This makes sense because calculation based on shortest path requires least cognitive workload.

In [1]:
% matplotlib inline
import sqlite3
import pandas as pd
import networkx as nx
import xlrd
import matplotlib.pyplot as plt
import math
import warnings
from tqdm import tqdm
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

In [2]:
# Read data.
conn = sqlite3.connect('../../SQL/20170424CBDBauUserSqlite.db')
df_kin_data=pd.read_sql_query("SELECT * FROM KIN_DATA", conn)
df_biog_main=pd.read_sql_query("SELECT * FROM BIOG_MAIN", conn).set_index('c_personid')
df_kinship_codes=pd.read_sql_query("SELECT * FROM KINSHIP_CODES", conn).set_index('c_kincode')
df_kin_tts=pd.read_excel('../data_file/110305 Kinship data (TTS_MQ for Deng Ke).xls', sheet_name='DATA')

In [3]:
# Get Top 10 kinship relationship in TTS dataset.
df_kin_tts.groupby('this is the relationship of the kin to EGO in Chinese').count()[['line_id']].sort_values(by='line_id', ascending=False)[0:10]

Unnamed: 0_level_0,line_id
this is the relationship of the kin to EGO in Chinese,Unnamed: 1_level_1
子,7983
孫,3925
父,3262
弟,1998
兄,1481
祖父,1443
曾孫,1276
曾祖父,931
祖,658
玄孫,578


## Phase II: Kinship normalization.

### Build and normalize TTS kinship graph.

In [4]:
df_kin_tts['line_id']=['L_'+str(s) for s in df_kin_tts['line_id']]
df_kin_tts['CBDB PersonID of EGO']=['C_'+str(s) for s in df_kin_tts['CBDB PersonID of EGO']]

In [5]:
df_kin_tts.head(30)

Unnamed: 0,line_id,CBDB PersonID of EGO,Name in Chinese of EGO; this is the main subject.,sys_id,This is the kin who is related to EGO.full name of kin,Kin Last Name,Kin FirstName,this is the relationship of the kin to EGO in Chinese,this is the kinship_code for the relationship of the kin to EGO
0,L_1,C_56812,富察傅鼐,1,富察額色泰,富察,額色泰,祖父,62
1,L_2,C_56812,富察傅鼐,1,富察噶爾漢,富察,噶爾漢,父,75
2,L_3,C_56812,富察傅鼐,1,富察昌齡,富察,昌齡,子,180
3,L_4,C_56812,富察傅鼐,1,富察科占,富察,科占,子,180
4,L_5,C_56812,富察傅鼐,1,富察查納,富察,查納,子,180
5,L_6,C_56813,邵洪,2,邵基,邵,基,祖父,62
6,L_7,C_56813,邵洪,2,邵鐸,邵,鐸,父,75
7,L_8,C_56813,邵洪,2,邵桓,邵,桓,子,180
8,L_9,C_56814,刁承祖,3,刁克俊,刁,克俊,曾祖父,48
9,L_10,C_56814,刁承祖,3,刁包,刁,包,祖父,62


In [6]:
g_tts_kin=nx.Graph()
for index in tqdm(df_kin_tts.index):
    edge=df_kin_tts.loc[index][['line_id', 'CBDB PersonID of EGO']]
    edge_attr=df_kin_tts.loc[index]['this is the relationship of the kin to EGO in Chinese']
    g_tts_kin.add_edges_from([edge.tolist()], kin_nm=edge_attr)
    g_tts_kin.node[edge[0]]['name']=df_kin_tts.loc[index, 'This is the kin who is related to EGO.full name of kin']
    g_tts_kin.node[edge[1]]['name']=df_kin_tts.loc[index, 'Name in Chinese of EGO; this is the main subject. ']

100%|██████████| 34758/34758 [00:54<00:00, 638.67it/s]


In [7]:
g_tts_kin.node['C_56814'], g_tts_kin.node['L_10']

({'name': '刁承祖'}, {'name': '刁包'})

In [8]:
list(g_tts_kin.edges('C_56814', data=True))

[('C_56814', 'L_9', {'kin_nm': '曾祖父'}),
 ('C_56814', 'L_10', {'kin_nm': '祖父'}),
 ('C_56814', 'L_11', {'kin_nm': '父'}),
 ('C_56814', 'L_12', {'kin_nm': '子'}),
 ('C_56814', 'L_13', {'kin_nm': '子'}),
 ('C_56814', 'L_14', {'kin_nm': '子'}),
 ('C_56814', 'L_15', {'kin_nm': '子'})]

In [9]:
'''
Prepare to "nomalize" the kinship:
    - Edge weight records parental relationship.
    - Gender: 0 = male; 1 = female.
'''
for edge in g_tts_kin.edges:
    kin_nm=g_tts_kin[edge[0]][edge[1]]['kin_nm']
    kin_node=[s for s in edge if s.startswith('L')][0]
    
    if kin_nm=='父':
        g_tts_kin[edge[0]][edge[1]]['weight']=1
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='祖父':
        g_tts_kin[edge[0]][edge[1]]['weight']=2
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾祖父':
        g_tts_kin[edge[0]][edge[1]]['weight']=3
        g_tts_kin.node[kin_node]['gender']=0
        
    elif kin_nm=='子':
        g_tts_kin[edge[0]][edge[1]]['weight']=-1
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='孫':
        g_tts_kin[edge[0]][edge[1]]['weight']=-2
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾孫':
        g_tts_kin[edge[0]][edge[1]]['weight']=-3
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='玄孫':
        g_tts_kin[edge[0]][edge[1]]['weight']=-4
        g_tts_kin.node[kin_node]['gender']=0
        
    elif kin_nm=='弟' or kin_nm=='兄':
        g_tts_kin[edge[0]][edge[1]]['weight']=0
        g_tts_kin.node[kin_node]['gender']=0

In [10]:
len([g_tts_kin[edge[0]][edge[1]] for edge in g_tts_kin.edges if 'weight' in g_tts_kin[edge[0]][edge[1]].keys()]), len(g_tts_kin.edges)

(22877, 34758)

In [11]:
# Calculate parental relationships.
for node in tqdm(df_kin_tts['CBDB PersonID of EGO']):
    edge_list=list(g_tts_kin.edges(node, data=True))
    for index1 in range(len(edge_list)-1):
        edge1=edge_list[index1]
        for index2 in range(index1+1, len(edge_list)):
            edge2=edge_list[index2]
            if 'weight' in edge1[2].keys() and 'weight' in edge2[2].keys():
                # The two "if" below supposed to be used in directed graphs.
                # But let's ignore the direction for now.
                if edge1[2]['weight']-edge2[2]['weight']==1:
                    node_f=[s for s in [edge1[0], edge1[1]] if 'L' in s][0]
                    node_c=[s for s in [edge2[0], edge2[1]] if 'L' in s][0]
                    g_tts_kin.add_edge(node_f, node_c, weight=1)
                if edge1[2]['weight']-edge2[2]['weight']==-1:
                    node_c=[s for s in [edge1[0], edge1[1]] if 'L' in s][0]
                    node_f=[s for s in [edge2[0], edge2[1]] if 'L' in s][0]
                    g_tts_kin.add_edge(node_f, node_c, weight=-1)

100%|██████████| 34758/34758 [00:17<00:00, 1980.47it/s]


In [12]:
# Parental relationship added:
len([s for s in g_tts_kin.edges if 'L' in s[0] and 'L' in s[1]])

31994

In [13]:
# Do a quick check.
g_tts_kin.edges('L_23')
# Looks GOOD!!!!

EdgeDataView([('L_23', 'C_56816'), ('L_23', 'L_20'), ('L_23', 'L_21'), ('L_23', 'L_22')])

In [14]:
num=0
for edge in g_tts_kin.edges(data=True):
    if 'weight' in edge[2].keys():
        if abs(edge[2]['weight'])==1:
            num+=1
print('#parental relationship pairs:', num)

#parental relationship pairs: 43239


### Build and normalize CBDB kinship graph.

In [15]:
# First let's working on selected kinship relationships.
df_kinship_codes[(df_kinship_codes.c_kinrel_chn=='父')|
                 (df_kinship_codes.c_kinrel_chn=='祖父')|
                 (df_kinship_codes.c_kinrel_chn=='曾祖父')|
                 (df_kinship_codes.c_kinrel_chn=='子')|
                 (df_kinship_codes.c_kinrel_chn=='孫')|
                 (df_kinship_codes.c_kinrel_chn=='曾孫')|
                 (df_kinship_codes.c_kinrel_chn=='玄孫')|
                 (df_kinship_codes.c_kinrel_chn=='弟')|
                 (df_kinship_codes.c_kinrel_chn=='兄')
                ]
# 曾孫 = 255

Unnamed: 0_level_0,c_kin_pair1,c_kin_pair2,c_kin_pair_notes,c_kinrel_chn,c_kinrel,c_kinrel_alt,c_pick_sorting,c_upstep,c_dwnstep,c_marstep,c_colstep
c_kincode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
62,243,251,,祖父,FF,FF,-4.0,2,0,0,0
75,180,176,,父,F,F Father,-1.0,1,0,0,0
125,126,169,,兄,B+,"B+ Brother, elder",2.0,0,0,0,1
126,125,164,,弟,B-,"B– Brother, younger",3.0,0,0,0,1
180,75,111,,子,S,S Son,5.0,0,1,0,0
243,62,370,,孫,SS,SS Son's son,8.0,0,2,0,0


#### Try conservative approach.

In [16]:
df_kin_cbdb=df_kin_data[(df_kin_data.c_kin_code==62)|
                        (df_kin_data.c_kin_code==75)|
                        (df_kin_data.c_kin_code==125)|
                        (df_kin_data.c_kin_code==126)|
                        (df_kin_data.c_kin_code==180)|
                        (df_kin_data.c_kin_code==243)|
                        (df_kin_data.c_kin_code==255)
                       ]

In [17]:
df_kin_cbdb.sample(3)

Unnamed: 0,tts_sysno,c_personid,c_kin_id,c_kin_code,c_source,c_pages,c_notes,c_autogen_notes,c_created_by,c_created_date,c_modified_by,c_modified_date
42540,42600.0,19360,898,180,0.0,,,,TTS,20070312,,
43676,45260.0,21528,1598,126,7596.0,9736.0,,"Auto-generated from PersonID = 0001598, KinCod...",TTS,20080122,,
220972,219005.0,171815,171809,75,32033.0,,"(Tackett) subjectid=132718,targetid=132712,rel...",,load,20130923,,


In [19]:
len(df_kin_cbdb)

328167

#### Try aggressive approach on various '子'.

In [20]:
kin_code_son_list=[s for s in df_kinship_codes.index 
                   if '子' in str(df_kinship_codes.loc[s, 'c_kinrel_chn']) 
                   and '孫' not in str(df_kinship_codes.loc[s, 'c_kinrel_chn'])
                  ]

In [21]:
df_kin_cbdb=pd.concat([df_kin_data[(df_kin_data.c_kin_code==62)|
                                   (df_kin_data.c_kin_code==75)|
                                   (df_kin_data.c_kin_code==125)|
                                   (df_kin_data.c_kin_code==126)|
                                   (df_kin_data.c_kin_code==180)|
                                   (df_kin_data.c_kin_code==243)|
                                   (df_kin_data.c_kin_code==255)
                                  ], 
                       df_kin_data.loc[[index for index 
                                        in df_kin_data.index 
                                        if df_kin_data.loc[index, 'c_kin_code'] 
                                        in kin_code_son_list]]
                      ])

In [22]:
len(df_kin_cbdb)

450640

In [23]:
df_biog_main.sample(3)

Unnamed: 0_level_0,tts_sysno,c_name,c_name_chn,c_index_year,c_female,c_ethnicity_code,c_household_status_code,c_tribe,c_birthyear,c_by_nh_code,...,c_mingzi_proper,c_name_proper,c_surname_rm,c_mingzi_rm,c_name_rm,c_created_by,c_created_date,c_modified_by,c_modified_date,c_self_bio
c_personid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
280265,261846.0,Qiu Ning,丘寧,,0,0.0,0.0,,,0.0,...,,,,,,load,20130923,,,1
294697,275918.0,Xia Shang,夏商,1535.0,0,0.0,0.0,,,0.0,...,,,,,,load,20130923,,,1
136888,129384.0,Xu Shi,徐氏（洪子壽妻）,1278.0,1,0.0,0.0,,,0.0,...,,,,,,load,20131003,,,0


In [24]:
# Find names in the kinship graph. Do parallel computing.
import ipyparallel as ipp
c = ipp.Client()
print(c.ids)
dview = c[:]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]


In [25]:
dview.execute('import pandas as pd')

dview['df_biog_main']=df_biog_main
dview['df_kin_cbdb']=df_kin_cbdb
dview['df_engin']=pd.DataFrame()

In [26]:
@dview.parallel(block=True)
def give_chn(index):
    global df_biog_main, df_kin_cbdb, df_engin
    c_personid=df_kin_cbdb.loc[index, 'c_personid']
    c_kin_id=df_kin_cbdb.loc[index, 'c_kin_id']
    try:
        c_ego_chn=df_biog_main.loc[c_personid, 'c_name_chn']
        c_kin_chn=df_biog_main.loc[c_kin_id, 'c_name_chn']
        df_kin_cbdb.loc[index, 'c_ego_chn']=c_ego_chn
        df_kin_cbdb.loc[index, 'c_kin_chn']=c_kin_chn
        df_engin=df_engin.append([df_kin_cbdb.loc[index]])
    except:
        pass

In [None]:
t=give_chn.map(df_kin_cbdb.index)

In [41]:
df_kin_cbdb_chn=pd.concat(dview.gather('df_engin').result())

In [42]:
# Save as intermediary file.
df_kin_cbdb_chn.to_pickle('../data_file/intermediary/df_kin_cbdb_chn.pkl.gzip', compression='gzip')

In [43]:
df_kin_cbdb_chn=pd.read_pickle('../data_file/intermediary/df_kin_cbdb_chn.pkl.gzip', compression='gzip')
df_kin_cbdb_chn['c_personid']=['E_'+str(s) for s in df_kin_cbdb_chn['c_personid']]
df_kin_cbdb_chn['c_kin_id']=['K_'+str(s) for s in df_kin_cbdb_chn['c_kin_id']]

In [44]:
df_kin_cbdb_chn.sample(3)

Unnamed: 0,tts_sysno,c_personid,c_kin_id,c_kin_code,c_source,c_pages,c_notes,c_autogen_notes,c_created_by,c_created_date,c_modified_by,c_modified_date,c_ego_chn,c_kin_chn
259613,257772.0,E_198525,K_284021,62,32084.0,第二甲第二十四名,,,load,20130923,,,王瑤,王鉅
134233,132446.0,E_129039,K_303907,126,32070.0,第三甲第八十三名,,,load,20130923,,,胡彥,胡新
367179,365338.0,E_226902,K_207054,255,32091.0,第三甲第三十九名,,,load,20130923,,,劉炖,劉三英


In [45]:
# Build CBDB kinship graph.
g_cbdb_kin=nx.Graph()
for index in tqdm(df_kin_cbdb_chn.index):
    edge=df_kin_cbdb_chn.loc[index][['c_personid', 'c_kin_id']]
    kin_code=df_kin_cbdb_chn.loc[index]['c_kin_code']
    edge_attr=df_kinship_codes.loc[kin_code, 'c_kinrel_chn']
    g_cbdb_kin.add_edges_from([edge.tolist()], kin_nm=edge_attr)
    g_cbdb_kin.node[edge[0]]['name']=df_kin_cbdb_chn.loc[index, 'c_ego_chn']
    g_cbdb_kin.node[edge[1]]['name']=df_kin_cbdb_chn.loc[index, 'c_kin_chn']

100%|██████████| 292171/292171 [08:00<00:00, 608.37it/s]


In [47]:
list(g_cbdb_kin.nodes)[0:10]

['E_2',
 'K_1',
 'E_3',
 'K_3001',
 'K_45938',
 'E_4',
 'K_3002',
 'K_13313',
 'E_5',
 'K_13318']

In [49]:
g_cbdb_kin.node['E_5'], g_cbdb_kin.node['K_204014']

({'name': '查籥'}, {'name': '宋登'})

In [50]:
list(g_cbdb_kin.edges('K_204014', data=True))

[('K_204014', 'E_313226', {'kin_nm': '曾孫; 重孫'}),
 ('K_204014', 'E_313227', {'kin_nm': '孫'})]

In [None]:
''' ============= Conservative =============
Prepare to "nomalize" the kinship:
    - Edge weight records parental relationship.
    - Gender: 0 = male; 1 = female.
'''
for edge in list(g_cbdb_kin.edges(data=True)):
    kin_nm=edge[2]['kin_nm']
    kin_node=[s for s in edge[0:2] if s.startswith('K')][0]
    # Upward.
    if kin_nm=='父':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=1
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='祖父':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=2
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾祖父':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=3
        g_cbdb_kin.node[kin_node]['gender']=0
    # Downward.
    elif kin_nm=='子':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-1
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='孫':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-2
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾孫':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-3
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='玄孫':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-4
        g_cbdb_kin.node[kin_node]['gender']=0
    # Peer.
    elif kin_nm=='弟' or kin_nm=='兄':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=0
        g_cbdb_kin.node[kin_node]['gender']=0

In [51]:
''' ============= Aggressive =============
Prepare to "nomalize" the kinship:
    - Edge weight records parental relationship.
    - Gender: 0 = male; 1 = female.
'''
for edge in list(g_cbdb_kin.edges(data=True)):
    kin_nm=edge[2]['kin_nm']
    kin_node=[s for s in edge[0:2] if s.startswith('K')][0]
    # Upward.
    if kin_nm=='父':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=1
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='祖父':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=2
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾祖父':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=3
        g_cbdb_kin.node[kin_node]['gender']=0
    # Downward.
    elif '子' in kin_nm and '孫' not in kin_nm:
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-1
        g_cbdb_kin.node[kin_node]['gender']=0
    elif kin_nm=='孫':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-2
        g_cbdb_kin.node[kin_node]['gender']=0
    elif '曾孫' in kin_nm:
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-3
        g_cbdb_kin.node[kin_node]['gender']=0
    elif '玄孫' in kin_nm:
        g_cbdb_kin[edge[0]][edge[1]]['weight']=-4
        g_cbdb_kin.node[kin_node]['gender']=0
    # Peer.
    elif kin_nm=='弟' or kin_nm=='兄':
        g_cbdb_kin[edge[0]][edge[1]]['weight']=0
        g_cbdb_kin.node[kin_node]['gender']=0

In [52]:
df_kin_cbdb_chn.sample(3)

Unnamed: 0,tts_sysno,c_personid,c_kin_id,c_kin_code,c_source,c_pages,c_notes,c_autogen_notes,c_created_by,c_created_date,c_modified_by,c_modified_date,c_ego_chn,c_kin_chn
211467,209571.0,E_164423,K_143755,75,32039.0,(XJ)Kaiyuan170,YP NewEpitaphID=6062,,load,20130923,,,范庭金,范安及
207867,206283.0,E_161169,K_142083,75,32038.0,Qianfu 15,YP NewEpitaphID=3409,,load,20130923,,,牛孟十,牛延宗
353328,351487.0,E_213087,K_206085,125,32079.0,第二甲第七十名,,,load,20130923,,,李三省,李三才


In [53]:
# Original kinship relationships.
len(g_cbdb_kin.edges)

291904

In [54]:
# Now let's do some calculation on CBDB kinship relationships to infer more parental relationships.

for node in tqdm(df_kin_cbdb_chn['c_personid']):
    edge_list=list(g_cbdb_kin.edges(node, data=True))
    for index1 in range(len(edge_list)-1):
        edge1=edge_list[index1]
        for index2 in range(index1+1, len(edge_list)):
            edge2=edge_list[index2]
            if 'weight' in edge1[2].keys() and 'weight' in edge2[2].keys():
                # The two "if" below supposed to be used in directed graphs.
                # But let's ignore the direction for now.
                if edge1[2]['weight']-edge2[2]['weight']==1:
                    node_f=[s for s in [edge1[0], edge1[1]] if 'K' in s][0]
                    node_c=[s for s in [edge2[0], edge2[1]] if 'K' in s][0]
                    g_cbdb_kin.add_edge(node_f, node_c, weight=1)
                if edge1[2]['weight']-edge2[2]['weight']==-1:
                    node_c=[s for s in [edge1[0], edge1[1]] if 'K' in s][0]
                    node_f=[s for s in [edge2[0], edge2[1]] if 'K' in s][0]
                    g_cbdb_kin.add_edge(node_f, node_c, weight=1)

100%|██████████| 292171/292171 [00:23<00:00, 12191.28it/s]


In [55]:
len(g_cbdb_kin.edges)
# 93,739 parental relationship inferred.
# 142,646 parental relationship inferred.

434550

In [56]:
df_kin_tts.set_index('line_id', inplace=True)

### Matching TTS and CBDB.

In [57]:
# Matching criteria: Ego CBDB ID AND Kin name.

g_cbdb_node_list=[s.strip('E_').strip('K_') for s in list(g_cbdb_kin.nodes())]
for edge in tqdm(list(g_tts_kin.edges(data=True))):
    if 'C' in str(edge):
        cbdb_ego_id=[s for s in edge[0:2] if s.startswith('C')][0].strip('C_')
        line_id=[s for s in edge[0:2] if s.startswith('L')][0]
        line_id_chn=g_tts_kin.node[line_id]['name']
        kin_cbdb_id_list=[]
        if 'E_'+cbdb_ego_id in g_cbdb_kin.nodes():
            kin_cbdb_id_list+=[s.strip('E_').strip('K_') for s in list(g_cbdb_kin.neighbors('E_'+cbdb_ego_id))]
        if 'K_'+cbdb_ego_id in g_cbdb_kin.nodes():
            kin_cbdb_id_list+=[s.strip('E_').strip('K_') for s in list(g_cbdb_kin.neighbors('K_'+cbdb_ego_id))]
        for kin_cbdb_id in set(kin_cbdb_id_list):
            c_name_chn=df_biog_main.loc[int(kin_cbdb_id)]['c_name_chn']
            if c_name_chn==line_id_chn:
                df_kin_tts.loc[line_id, 'kin_cbdb_id']=int(kin_cbdb_id)
                df_kin_tts.loc[line_id, 'kin_cbdb_chn']=c_name_chn

100%|██████████| 66752/66752 [00:17<00:00, 3925.82it/s]


In [58]:
# Matching criteria: People's name AND Kinship relationship.

# Construct a list of TTS parental relationships to be matched.
tts_parental_edge_list=[s for s in list(g_tts_kin.edges) if 'C' not in str(s)]

# Get a collection of parental relationships from CBDB. Use as base for matching.
cbdb_parental_edge_list=[s for s in g_cbdb_kin.edges 
                         if g_cbdb_kin.edges[s]['weight']==1 
                         or g_cbdb_kin.edges[s]['weight']==-1]
# Drop duplicates. Not considering order, e.g., (2,3) and (3,2) are treated as different.
cbdb_parental_edge_list=set([(s[0].split('_')[1], s[1].split('_')[1]) for s in cbdb_parental_edge_list])

In [None]:
# Construct a dataframe for faster matching.
df_cbdb_parental_edge=pd.DataFrame()
for cbdb_parental_edge in tqdm(cbdb_parental_edge_list):
    cbdb_nm1=df_biog_main.loc[int(cbdb_parental_edge[0]), 'c_name_chn']
    cbdb_nm2=df_biog_main.loc[int(cbdb_parental_edge[1]), 'c_name_chn']
    df_cbdb_parental_edge=df_cbdb_parental_edge.append(pd.Series([cbdb_parental_edge[0], cbdb_parental_edge[1], cbdb_nm1, cbdb_nm2]), ignore_index=True)
df_cbdb_parental_edge.rename(columns={0:'cbdb_id1', 1:'cbdb_id2', 2:'cbdb_nm1', 3:'cbdb_nm2'}, inplace=True)

 61%|██████    | 152555/249752 [1:37:19<1:02:00, 26.12it/s]

In [None]:
# Save as intermediary file to save time.
df_cbdb_parental_edge.to_pickle('../data_file/intermediary/df_cbdb_parental_edge.pkl.gzip', compression='gzip')

In [None]:
for tts_parental_edge in tqdm(tts_parental_edge_list):
    tts_nm1=g_tts_kin.node[tts_parental_edge[0]]['name']
    tts_nm2=g_tts_kin.node[tts_parental_edge[1]]['name']
    df_temp=df_cbdb_parental_edge[(df_cbdb_parental_edge.cbdb_nm1==tts_nm1)&(df_cbdb_parental_edge.cbdb_nm2==tts_nm2)]
    if list(df_temp.values)!=[]:
        df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_chn']=tts_nm1
        df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_chn']=tts_nm2
        df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_id']=str(df_temp['cbdb_id1'].values)
        df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_id']=str(df_temp['cbdb_id2'].values)
    df_temp=df_cbdb_parental_edge[(df_cbdb_parental_edge.cbdb_nm1==tts_nm2)&(df_cbdb_parental_edge.cbdb_nm2==tts_nm1)]
    if list(df_temp.values)!=[]:
        df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_chn']=tts_nm1
        df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_chn']=tts_nm2
        df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_id']=str(df_temp['cbdb_id2'].values)
        df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_id']=str(df_temp['cbdb_id1'].values)

In [None]:
df_kin_tts.to_excel('../data_file/intermediary/df_kin_tts.xlsx', encoding='utf8')

## Phase I: Exact Matching.

### First let's try various types of sons.

In [84]:
##### Construct Father-Son pair from CBDB.
kin_code_list_son=[s for s in df_kinship_codes.index if '子' in str(df_kinship_codes.loc[s]['c_kinrel_chn'])]

df_kin_data_son=df_kin_data[df_kin_data['c_kin_code'].isin(kin_code_list_son)]

father_son_list_cbdb=[]
for index in tqdm(df_kin_data_son.index):
    try:
        father_c_personid=df_kin_data_son.loc[index, 'c_personid']
        son_name=df_biog_main.loc[df_kin_data_son.loc[index, 'c_kin_id'], 'c_name_chn']
        father_son_list_cbdb.append([father_c_personid, son_name])
    except:
        father_c_personid=df_kin_data_son.loc[index, 'c_personid']
        father_son_list_cbdb.append([father_c_personid, 'son_name_NULL'])

##### Construct Father-Son pair from TTS.
df_kin_son_tts=df_kin_tts.loc[[index for index in df_kin_tts.index if '子' in df_kin_tts.loc[index]['this is the relationship of the kin to EGO in Chinese']]]
father_son_list_tts=[]
for index in tqdm(df_kin_son_tts.index):
    try:
        father_c_personid=df_kin_son_tts.loc[index, 'CBDB PersonID of EGO']
        son_name=df_kin_son_tts.loc[index, 'This is the kin who is related to EGO.full name of kin']
        father_son_list_tts.append([father_c_personid, son_name])
    except:
        father_c_personid=df_kin_son_tts.loc[index, 'CBDB PersonID of EGO']
        father_son_list_tts.append([father_c_personid, 'son_name_NULL'])

##### Get the number of intersections, i.e., number of CBDB IDs to be assigned to TTS.
len(set([str(s) for s in father_son_list_cbdb]).intersection([str(s) for s in father_son_list_tts]))

100%|██████████| 122492/122492 [00:07<00:00, 15506.81it/s]
100%|██████████| 9163/9163 [00:00<00:00, 23521.31it/s]


251

### Let's build the process in bulk.

In [6]:
print('Kinship_recognized'+'\t'+'#Records_recognized')
for kr_str in df_kin_tts.groupby('this is the relationship of the kin to EGO in Chinese').count()[['line_id']].sort_values(by='line_id', ascending=False)[0:20].index:
    # KR = kinship to be recognized.
    print(kr_str+'\t', end='')
    ##### Construct KR pair from CBDB.
    kin_code_list_kr=[s for s in df_kinship_codes.index if kr_str in str(df_kinship_codes.loc[s]['c_kinrel_chn'])]
    df_kin_data_kr=df_kin_data[df_kin_data['c_kin_code'].isin(kin_code_list_kr)]
    kr_list_cbdb=[]
    for index in df_kin_data_kr.index:
        try:
            c_personid=df_kin_data_kr.loc[index, 'c_personid']
            kr_name=df_biog_main.loc[df_kin_data_kr.loc[index, 'c_kin_id'], 'c_name_chn']
            kr_list_cbdb.append([c_personid, kr_name])
        except:
            c_personid=df_kin_data_kr.loc[index, 'c_personid']
            kr_list_cbdb.append([c_personid, 'kr_name_NULL'])

    ##### Construct KR pair from TTS.
    df_kin_kr_tts=df_kin_tts.loc[[index for index in df_kin_tts.index if kr_str in df_kin_tts.loc[index]['this is the relationship of the kin to EGO in Chinese']]]
    kr_list_tts=[]
    for index in df_kin_kr_tts.index:
        try:
            c_personid=df_kin_kr_tts.loc[index, 'CBDB PersonID of EGO']
            kr_name=df_kin_kr_tts.loc[index, 'This is the kin who is related to EGO.full name of kin']
            kr_list_tts.append([c_personid, kr_name])
        except:
            c_personid=df_kin_kr_tts.loc[index, 'CBDB PersonID of EGO']
            kr_list_tts.append([c_personid, 'kr_name_NULL'])

    ##### Get the number of intersections, i.e., number of CBDB IDs to be assigned to TTS.
    print(len(set([str(s) for s in kr_list_cbdb]).intersection([str(s) for s in kr_list_tts])))

Kinship_recognized	#Records_recognized
子	251
孫	32
父	530
弟	111
兄	79
祖父	147
曾孫	2
曾祖父	0
祖	302
玄孫	1
曾祖	116
姪	3
祖先	0
高祖父	0
兄弟	1
後代	0
婿	15
母	4
叔	1
長子	1


## Draft.

Draft codes
```Python
    for cbdb_parental_edge in cbdb_parental_edge_list:
        cbdb_nm1=df_biog_main.loc[int(cbdb_parental_edge[0].split('_')[1]), 'c_name_chn']
        cbdb_nm2=df_biog_main.loc[int(cbdb_parental_edge[1].split('_')[1]), 'c_name_chn']
        if tts_nm1==cbdb_nm1 and tts_nm2==cbdb_nm2:
            df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_chn']=cbdb_nm1
            df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_chn']=cbdb_nm2
            df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_id']=int(cbdb_parental_edge[0].split('_')[1])
            df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_id']=int(cbdb_parental_edge[1].split('_')[1])
        if tts_nm1==cbdb_nm2 and tts_nm2==cbdb_nm1:
            df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_chn']=cbdb_nm2
            df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_chn']=cbdb_nm1
            df_kin_tts.loc[tts_parental_edge[0], 'kin_cbdb_id']=int(cbdb_parental_edge[1].split('_')[1])
            df_kin_tts.loc[tts_parental_edge[1], 'kin_cbdb_id']=int(cbdb_parental_edge[0].split('_')[1])
    pass
```

In [None]:
df_kinship_codes.to_excel('dump/df_kinship_codes.xlsx', encoding='utf8')

In [None]:
len([s for s in df_tobe_done['this is the relationship of the kin to EGO in Chinese'] if '子' in s])