*Script purpose: TTS-CBDB Kinship disambiguation.*

### First phase: Exactly matching
1. Restrict kinship type: "子". Kinship code:
```Python
kin_code_list_son=['180','182','183','184','185','186','191','193','194','195','196','198',
                   '199','202','205','206','207','211','212','213','214','220','221','222',
                   '226','229','231','234','235','307','326','339','436','560','568','572','575',]
```
2. Construct a list of pairs using CBDB data: ego person's CBDB ID + kinship person's name (we have CBDB ID), e.g.,
```Python
[[0, '田照鄰'],
 [1, '安邡'],
 [1, '安邠'],
 [1, '安郊'],
 [1, '安邦'],
 [3, '安扶'],
 [4, '查沖之'],
 [4, '查循之'],
 [6, '柴貽範'],
 [12, '晁子與'],
 [10097, '宋璲']]
```
3. Construct pairs using TTS data: ego person's CBDB ID (we have this) + kinship person's name (we don't have CBDB ID), e.g., 
```Python
[[56812, '富察昌齡'],
 [56812, '富察科占'],
 [56812, '富察查納'],
 [56813, '邵桓'],
 [56814, '刁錄'],
 [56814, '刁鈞'],
 [56814, '刁安仁'],
 [56814, '刁錦'],
 [56815, '邵鐸'],
 [56816, '于廷翼'],
 [10097, '宋璲']]
```
very similar to #2, but we don't have CBDB's ID for the kinship persons.
4. find the intersection of the two lists, in the example, it is `[10097, '宋璲']`, then we give `宋璲`'s CBDB ID to TTS, i.e., `120940`.
5. Kinship type "子" resolves 233 pairs of TTS records (total: 7983 pairs). 3%, not that bad.

### Second Phase: Kinship normalization.

- Basic principle: Maximize the utilization of the kinship information in CBDB in disambiguating TTS records.
- Task: Resolve the missing information (i.e., CBDB ID) in TTS kinship network (KN) using information from CBDB kinship network (KN).
- The task requires: TTS KN (TKN) is comparable to CBDB KN (CKN).
- However: TKN and CKN is not apple to apple;
- Because: we don't have a normalized/standard kinship network representation method.

In order to have a normalized/standard kinship network representation method, we need:
- Have a list of basic kinship relationships, this is discussed by many studies including Deng Ke's;
- But: the basic relationships and representation method are not ontologically neat (normalized), i.e., some of the basic relationships can be represented by other basic relationships. 
    - For example, by using Deng Ke's method, if A is B's son, then B is A's father. 
    - There are two types of information represented here: A and B are male, A is the child of B. Gender is the character of people, kinship is the character of relationship.

So a better kinship relationship representation method should be:
- Gender is the attribute of nodes, basic kinship relationship is the attribute of ties.
- Basic kinship relationship only includes: child, spouse.
- Immediate family is defined as: two nodes connected through "spouse" and their child nodes.
- Calculate other kinship relationship by using the shortest path between two nodes.
- This is very interesting because the "shortest path" method is how we "calculate" kinship in the real world. 
- This makes sense because calculation based on shortest path requires least cognitive workload.

In [1]:
% matplotlib inline
import sqlite3
import pandas as pd
import networkx as nx
import xlrd
import matplotlib.pyplot as plt
import math
import warnings
from tqdm import tqdm
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

In [2]:
# Read data.
conn = sqlite3.connect('../../SQL/20170424CBDBauUserSqlite.db')
df_kin_data=pd.read_sql_query("SELECT * FROM KIN_DATA", conn)
df_biog_main=pd.read_sql_query("SELECT * FROM BIOG_MAIN", conn).set_index('c_personid')
df_kinship_codes=pd.read_sql_query("SELECT * FROM KINSHIP_CODES", conn).set_index('c_kincode')
df_kin_tts=pd.read_excel('../data_file/110305 Kinship data (TTS_MQ for Deng Ke).xls', sheet_name='DATA')

In [3]:
# Get Top 10 kinship relationship in TTS dataset.
df_kin_tts.groupby('this is the relationship of the kin to EGO in Chinese').count()[['line_id']].sort_values(by='line_id', ascending=False)[0:10]

Unnamed: 0_level_0,line_id
this is the relationship of the kin to EGO in Chinese,Unnamed: 1_level_1
子,7983
孫,3925
父,3262
弟,1998
兄,1481
祖父,1443
曾孫,1276
曾祖父,931
祖,658
玄孫,578


### Kinship normalization.

In [4]:
df_kin_tts['line_id']=['L_'+str(s) for s in df_kin_tts['line_id']]
df_kin_tts['CBDB PersonID of EGO']=['C_'+str(s) for s in df_kin_tts['CBDB PersonID of EGO']]

In [81]:
df_kin_tts.head(30)

Unnamed: 0,line_id,CBDB PersonID of EGO,Name in Chinese of EGO; this is the main subject.,sys_id,This is the kin who is related to EGO.full name of kin,Kin Last Name,Kin FirstName,this is the relationship of the kin to EGO in Chinese,this is the kinship_code for the relationship of the kin to EGO
0,L_1,C_56812,富察傅鼐,1,富察額色泰,富察,額色泰,祖父,62
1,L_2,C_56812,富察傅鼐,1,富察噶爾漢,富察,噶爾漢,父,75
2,L_3,C_56812,富察傅鼐,1,富察昌齡,富察,昌齡,子,180
3,L_4,C_56812,富察傅鼐,1,富察科占,富察,科占,子,180
4,L_5,C_56812,富察傅鼐,1,富察查納,富察,查納,子,180
5,L_6,C_56813,邵洪,2,邵基,邵,基,祖父,62
6,L_7,C_56813,邵洪,2,邵鐸,邵,鐸,父,75
7,L_8,C_56813,邵洪,2,邵桓,邵,桓,子,180
8,L_9,C_56814,刁承祖,3,刁克俊,刁,克俊,曾祖父,48
9,L_10,C_56814,刁承祖,3,刁包,刁,包,祖父,62


In [69]:
g_tts_kin=nx.Graph()
for index in tqdm(df_kin_tts.index):
    edge=df_kin_tts.loc[index][['line_id', 'CBDB PersonID of EGO']]
    edge_attr=df_kin_tts.loc[index]['this is the relationship of the kin to EGO in Chinese']
    g_tts_kin.add_edges_from([edge.tolist()], kin_nm=edge_attr)
    g_tts_kin.node[edge[0]]['name']=df_kin_tts.loc[index, 'This is the kin who is related to EGO.full name of kin']
    g_tts_kin.node[edge[1]]['name']=df_kin_tts.loc[index, 'Name in Chinese of EGO; this is the main subject. ']

100%|██████████| 34758/34758 [01:24<00:00, 412.88it/s]


In [71]:
g_tts_kin.node['C_56814'], g_tts_kin.node['L_10']

({'name': '刁承祖'}, {'name': '刁包'})

In [76]:
list(g_tts_kin.edges('C_56814', data=True))

[('C_56814', 'L_9', {'kin_nm': '曾祖父', 'weight': 3}),
 ('C_56814', 'L_10', {'kin_nm': '祖父', 'weight': 2}),
 ('C_56814', 'L_11', {'kin_nm': '父', 'weight': 1}),
 ('C_56814', 'L_12', {'kin_nm': '子', 'weight': -1}),
 ('C_56814', 'L_13', {'kin_nm': '子', 'weight': -1}),
 ('C_56814', 'L_14', {'kin_nm': '子', 'weight': -1}),
 ('C_56814', 'L_15', {'kin_nm': '子', 'weight': -1})]

In [73]:
'''
Prepare to "nomalize" the kinship:
    - Edge weight records parental relationship.
    - Gender: 0 = male; 1 = female.
'''
for edge in g_tts_kin.edges:
    kin_nm=g_tts_kin[edge[0]][edge[1]]['kin_nm']
    kin_node=[s for s in edge if s.startswith('L')][0]
    
    if kin_nm=='父':
        g_tts_kin[edge[0]][edge[1]]['weight']=1
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='祖父':
        g_tts_kin[edge[0]][edge[1]]['weight']=2
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾祖父':
        g_tts_kin[edge[0]][edge[1]]['weight']=3
        g_tts_kin.node[kin_node]['gender']=0
        
    elif kin_nm=='子':
        g_tts_kin[edge[0]][edge[1]]['weight']=-1
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='孫':
        g_tts_kin[edge[0]][edge[1]]['weight']=-2
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='曾孫':
        g_tts_kin[edge[0]][edge[1]]['weight']=-3
        g_tts_kin.node[kin_node]['gender']=0
    elif kin_nm=='玄孫':
        g_tts_kin[edge[0]][edge[1]]['weight']=-4
        g_tts_kin.node[kin_node]['gender']=0
        
    elif kin_nm=='弟' or kin_nm=='兄':
        g_tts_kin[edge[0]][edge[1]]['weight']=0
        g_tts_kin.node[kin_node]['gender']=0

In [74]:
len([g_tts_kin[edge[0]][edge[1]] for edge in g_tts_kin.edges if 'weight' in g_tts_kin[edge[0]][edge[1]].keys()]), len(g_tts_kin.edges)

(22877, 34758)

In [75]:
for node in tqdm(df_kin_tts['CBDB PersonID of EGO']):
    edge_list=list(g_tts_kin.edges(node, data=True))
    for index1 in range(len(edge_list)-1):
        edge1=edge_list[index1]
        for index2 in range(index1+1, len(edge_list)):
            edge2=edge_list[index2]
            if 'weight' in edge1[2].keys() and 'weight' in edge2[2].keys():
                # The two "if" below supposed to be used in directed graphs.
                # But let's ignore the direction for now.
                if edge1[2]['weight']-edge2[2]['weight']==1:
                    node_f=[s for s in [edge1[0], edge1[1]] if 'L' in s][0]
                    node_c=[s for s in [edge2[0], edge2[1]] if 'L' in s][0]
                    g_tts_kin.add_edge(node_f, node_c, weight=1)
                if edge1[2]['weight']-edge2[2]['weight']==-1:
                    node_c=[s for s in [edge1[0], edge1[1]] if 'L' in s][0]
                    node_f=[s for s in [edge2[0], edge2[1]] if 'L' in s][0]
                    g_tts_kin.add_edge(node_f, node_c, weight=-1)

100%|██████████| 34758/34758 [00:22<00:00, 1561.99it/s]


In [82]:
# Parental relationship added:
len([s for s in g_tts_kin.edges if 'L' in s[0] and 'L' in s[1]])

31994

In [85]:
# Do a quick check.
g_tts_kin.edges('L_23')
# Looks GOOD!!!!

EdgeDataView([('L_23', 'C_56816'), ('L_23', 'L_20'), ('L_23', 'L_21'), ('L_23', 'L_22')])

In [92]:
num=0
for edge in g_tts_kin.edges(data=True):
    if 'weight' in edge[2].keys():
        if abs(edge[2]['weight'])==1:
            num+=1
print('#parental relationship pairs:', num)

#parental relationship pairs: 43239


#### TODO: Build and normalize CBDB kinship graph.

### Exactly Matching.

#### First let's try various types of sons.

In [84]:
##### Construct Father-Son pair from CBDB.
kin_code_list_son=[s for s in df_kinship_codes.index if '子' in str(df_kinship_codes.loc[s]['c_kinrel_chn'])]

df_kin_data_son=df_kin_data[df_kin_data['c_kin_code'].isin(kin_code_list_son)]

father_son_list_cbdb=[]
for index in tqdm(df_kin_data_son.index):
    try:
        father_c_personid=df_kin_data_son.loc[index, 'c_personid']
        son_name=df_biog_main.loc[df_kin_data_son.loc[index, 'c_kin_id'], 'c_name_chn']
        father_son_list_cbdb.append([father_c_personid, son_name])
    except:
        father_c_personid=df_kin_data_son.loc[index, 'c_personid']
        father_son_list_cbdb.append([father_c_personid, 'son_name_NULL'])

##### Construct Father-Son pair from TTS.
df_kin_son_tts=df_kin_tts.loc[[index for index in df_kin_tts.index if '子' in df_kin_tts.loc[index]['this is the relationship of the kin to EGO in Chinese']]]
father_son_list_tts=[]
for index in tqdm(df_kin_son_tts.index):
    try:
        father_c_personid=df_kin_son_tts.loc[index, 'CBDB PersonID of EGO']
        son_name=df_kin_son_tts.loc[index, 'This is the kin who is related to EGO.full name of kin']
        father_son_list_tts.append([father_c_personid, son_name])
    except:
        father_c_personid=df_kin_son_tts.loc[index, 'CBDB PersonID of EGO']
        father_son_list_tts.append([father_c_personid, 'son_name_NULL'])

##### Get the number of intersections, i.e., number of CBDB IDs to be assigned to TTS.
len(set([str(s) for s in father_son_list_cbdb]).intersection([str(s) for s in father_son_list_tts]))

100%|██████████| 122492/122492 [00:07<00:00, 15506.81it/s]
100%|██████████| 9163/9163 [00:00<00:00, 23521.31it/s]


251

#### Let's build the process in bulk.

In [6]:
print('Kinship_recognized'+'\t'+'#Records_recognized')
for kr_str in df_kin_tts.groupby('this is the relationship of the kin to EGO in Chinese').count()[['line_id']].sort_values(by='line_id', ascending=False)[0:20].index:
    # KR = kinship to be recognized.
    print(kr_str+'\t', end='')
    ##### Construct KR pair from CBDB.
    kin_code_list_kr=[s for s in df_kinship_codes.index if kr_str in str(df_kinship_codes.loc[s]['c_kinrel_chn'])]
    df_kin_data_kr=df_kin_data[df_kin_data['c_kin_code'].isin(kin_code_list_kr)]
    kr_list_cbdb=[]
    for index in df_kin_data_kr.index:
        try:
            c_personid=df_kin_data_kr.loc[index, 'c_personid']
            kr_name=df_biog_main.loc[df_kin_data_kr.loc[index, 'c_kin_id'], 'c_name_chn']
            kr_list_cbdb.append([c_personid, kr_name])
        except:
            c_personid=df_kin_data_kr.loc[index, 'c_personid']
            kr_list_cbdb.append([c_personid, 'kr_name_NULL'])

    ##### Construct KR pair from TTS.
    df_kin_kr_tts=df_kin_tts.loc[[index for index in df_kin_tts.index if kr_str in df_kin_tts.loc[index]['this is the relationship of the kin to EGO in Chinese']]]
    kr_list_tts=[]
    for index in df_kin_kr_tts.index:
        try:
            c_personid=df_kin_kr_tts.loc[index, 'CBDB PersonID of EGO']
            kr_name=df_kin_kr_tts.loc[index, 'This is the kin who is related to EGO.full name of kin']
            kr_list_tts.append([c_personid, kr_name])
        except:
            c_personid=df_kin_kr_tts.loc[index, 'CBDB PersonID of EGO']
            kr_list_tts.append([c_personid, 'kr_name_NULL'])

    ##### Get the number of intersections, i.e., number of CBDB IDs to be assigned to TTS.
    print(len(set([str(s) for s in kr_list_cbdb]).intersection([str(s) for s in kr_list_tts])))

Kinship_recognized	#Records_recognized
子	251
孫	32
父	530
弟	111
兄	79
祖父	147
曾孫	2
曾祖父	0
祖	302
玄孫	1
曾祖	116
姪	3
祖先	0
高祖父	0
兄弟	1
後代	0
婿	15
母	4
叔	1
長子	1


### Draft.

In [None]:
df_kinship_codes.to_excel('dump/df_kinship_codes.xlsx', encoding='utf8')

In [None]:
len([s for s in df_tobe_done['this is the relationship of the kin to EGO in Chinese'] if '子' in s])