# Disambiguating and collocating participants

This notebook follows "1a_Generating Hierarchies of Participants.ipynb" and provides the measures for combining participants across multiple chapters. The essential tasks are to 1) disambiguate non-coreferring participants with the same labels, and 2) collocate coreferring participants with different labels.

**Content**
1. Disambiguate labels
2. Collocate labels
3. Production and validation
4. Part-whole relationships across chapters
5. Export

In [1]:
#Dataset path
PATH = 'datasets/'

import csv, collections, html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tf.app import use

In [2]:
A = use('bhsa', hoist=globals(), mod='etcbc/heads/tf')

rate limit is 60 requests per hour, with 0 left for this hour
To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/
	connecting to online GitHub repo annotation/app-bhsa ... failed
GitHub says: 403 {"message": "API rate limit exceeded for 212.237.134.12. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://developer.github.com/v3/#rate-limiting"}
The offline TF-app may not be the latest
Using TF-app in C:\Users\Ejer/text-fabric-data/annotation/app-bhsa/code:
	rv2.0.0=#7b3b9ffba7ee6dbc76a52b8d76475d17babf0daf (latest? release)
rate limit is 60 requests per hour, with 0 left for this hour
To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/
	connecting to online GitHub repo etcbc/bhsa ... failed
GitHub says: 403 {"message": "API rate limit exceeded for 212.237.134.12. (But here's the good news: Authenticated requests get a higher rate limit. Ch

In [3]:
from Nodes import GenerateNodes

To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/
failed
GitHub says: 403 {"message": "API rate limit exceeded for 212.237.134.12. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://developer.github.com/v3/#rate-limiting"}
The offline TF-app may not be the latest
To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/
failed
GitHub says: 403 {"message": "API rate limit exceeded for 212.237.134.12. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)", "documentation_url": "https://developer.github.com/v3/#rate-limiting"}
The offline data may not be the latest
To increase the rate,see https://annotation.github.io/text-fabric/Api/Repo/
failed
GitHub says: 403 {"message": "API rate limit exceeded for 212.237.134.12. (But here's the good news: Authenticated requests get a hig

## 1. Disambiguation

Some actor labels are identical across the chapters. Before they can be automatically collocated, they need to be reviewed in order to remove false positives.

A few functions assist the analysis:

In [4]:
class Collocations():
    
    def __init__(self, csv_file):
        dataframe = pd.read_csv(csv_file)
        dataframe['book'] = dataframe['book'].str.slice(0,3)
        dataframe['chapt'] = dataframe.apply(lambda row: f'{row.book}{row.chapter}', axis=1)
        
        self.data = dataframe
        
    def predictCollocations(self, dataframe):
        
        table = pd.crosstab(index=dataframe.actor, columns=dataframe.chapt)
        
        return table

    def showPredictions(self, dataframe):
        
        self.prediction = self.predictCollocations(dataframe)
        self.prediction['Total'] = self.prediction.sum(axis=1)
        
        print('Predicted collocations')
        display(self.prediction[self.prediction.Total > 1])
        
        print('No collocations')
        display(self.prediction[self.prediction.Total == 1])
        
    def disambiguate(self, disambig_list):
        
        df_updated = self.data
        
        for l in disambig_list:
            actor = l[0]
            identifier = l[1]
            
            refs = df_updated[(df_updated.actor == actor) & (df_updated.chapt == identifier)].references.item()
            
            #Add new actor
            new_book = identifier[:3]
            new_chapter = identifier[3:]
            
            new_actor = False
            nr = 2
            while new_actor == False:
                if not f'{actor}#{nr}' in df_updated.actor.values:
                    new_actor = f'{actor}#{nr}'
                else:
                    nr += 1
            
            new_row = {'book':new_book, 'chapter': new_chapter, 'actor':new_actor, 'references':refs, 'chapt':identifier}
            df_updated = df_updated.append(new_row, ignore_index=True)
            
            #Delete exisiting actor
            row_number = df_updated[(df_updated.actor == actor) & (df_updated.chapt == identifier)].index
            df_updated = df_updated.drop(row_number)
            
            print(f'{actor} ({identifier}) --> {new_actor}')
        
        self.disambiguated = df_updated
        return df_updated
            
    def collocate(self, disambig_list, collocate_dict):
        '''
        Actor = The main actor to which the nodes of the synonym are added
        Synonym = Minor actor from which the nodes are transferred to the actor
        '''
        
        df_updated = self.disambiguate(disambig_list)       
        
        ##1. collocate actors from collocate_dict
        for actor in collocate_dict:
            
            #Walking through each of the synonyms to a particular actor
            for syn in collocate_dict[actor]:
                
                #Walking through each instance of the synonym (if the synonym occurs in multiple chapters)
                index = df_updated[df_updated.actor == syn].index
                for i in index:
                    df_updated.actor.loc[i] = actor #Renaming the synonym
        
        self.collocated = df_updated
        return df_updated
    
    def produceDict(self, dataframe):
        
        new_dict = collections.defaultdict(list)
        
        for row in dataframe.iterrows():
            actor = row[1].actor
            references = row[1].references
            
            new_dict[actor].append(references)
        
        return new_dict
    
    def validateDict(self, result_dict, disambig_list, collocate_dict):
        
        orig_data = data.data
        error_list = []
        
        ##1. Matching result_dict with the original dataset
        
        for actor in result_dict:
                    
            temp_list = [] 
            orig_refs = orig_data[orig_data.actor == actor].references
            for r in orig_refs:
                temp_list.append(r)
        
            for l in result_dict[actor]:
                if l in temp_list:
                    continue
                else:
                    error_list.append((actor, l))
        
        ##2. Explain error_list:
        if error_list:
            n=1
            for l in error_list:
                
                #Find references in orig_data:
                refs = l[1]
                orig_actor = orig_data[orig_data.references == refs].actor.item()
                orig_chapt = orig_data[orig_data.references == refs].chapt.item()
                
                print(f'{n}: {orig_actor} ({orig_chapt}) --> {l[0]}\n')
                n+=1

In [None]:
file_name = f'{PATH}human_references_for_plotting.csv'

data = Collocations(file_name)

#### 1a Predicting collocations

All participant labels are listed with respect to the chapters in which they occur. If they occur in more than one chapter they need to be reviewed and possibly disambiguated

In [6]:
data.showPredictions(data.data)

Predicted collocations


chapt,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26,Total
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2ms,0,1,1,1,1,0,1,0,1,0,6
3mp,0,0,0,1,0,0,0,1,0,0,2
<M,0,0,0,1,0,0,0,0,0,1,2
>CH,0,1,0,1,0,0,0,0,0,0,2
>DM,0,1,0,0,0,1,0,1,0,0,3
>HRN,1,0,0,0,1,1,0,1,0,0,4
>HRN BN ->HRN,0,0,0,0,0,1,0,1,0,0,2
>HRN BN ->HRN KL BN JFR>L,1,0,0,0,0,1,0,0,0,0,2
>JC,0,0,1,1,1,1,0,0,1,1,6
>JC >JC,1,0,0,0,0,0,0,1,0,0,2


No collocations


chapt,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26,Total
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2mp_sfx,0,0,0,0,1,0,0,0,0,0,1
3mp#2,0,0,0,1,0,0,0,0,0,0,1
3mp#3,0,0,0,1,0,0,0,0,0,0,1
3ms,0,0,1,0,0,0,0,0,0,0,1
3unknownp,0,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
XRC=/,0,0,1,0,0,0,0,0,0,0,1
ZR,0,0,0,0,0,1,0,0,0,0,1
ZR< ->JC,0,0,0,1,0,0,0,0,0,0,1
ZR</,0,0,0,0,0,1,0,0,0,0,1


#### 1b Disambiguate

A function is defined to walk through each participant label that occurs in more than one chapter. One case is visualized for each chapter to assist the decission.

In [7]:
def show(dataframe, n, disambiguate=True):
    '''
    Input: A dataframe and number of participant
    Output: One example of the participant for each relevant chapter
    '''   
    table = data.predictCollocations(dataframe) #The dataframe is cross-tabulated (actor and chapter)
    table['Total'] = table.sum(axis=1)
    
    if disambiguate: #For disambiguation, else all participants
        table = table[table.Total > 1] #Only actors occuring in more than one chapter is included
    
    table['id'] = [x for x in range(len(table))] #The table is given the numbers 0 to len(dataframe) in order to subset.
    
    table = table[table.id == n] #The table is subset according to the number of the actor
    actor = table.index.item()
    chapters = list(set(dataframe.chapter))

    print(f'Participant: {actor}')
    
    #Walking through each chapter
    for ch in chapters:
        #The original dataframe is subset according to chapter and actor
        subset = dataframe[(dataframe.chapter == ch) & (dataframe.actor == actor)]
        
        if not subset.empty:
            ref = int(subset.references.item().split()[0]) #The first reference is selected
            A.pretty(L.u(ref, 'verse')[0], highlights={ref:'gold'})
            print('\n')

Now we can walk through each relevant case and check for the need of disambiguating participants:

In [8]:
n=0

In [26]:
show(data.data, n)
n+=1

Participant: GR






















A few participants need disambiguation. They are automatically assigned labels derived but different from the original ones.

In [27]:
disambiguate = [('3mp','Lev20'),
                ('<M','Lev26'),
                ('>CH','Lev20'),
                ('>JC','Lev19'),('>JC','Lev21'),('>JC','Lev22'),('>JC','Lev25'),('>JC','Lev26'),
                ('>JC >JC','Lev24'),
                ('GR','Lev19'),
                ('KHN','Lev21'),
                ('MN QRB/ <M/ -NPC','Lev18'),
                ('NPC','Lev23')]

disambig_df = data.disambiguate(disambiguate)

3mp (Lev20) --> 3mp#4
<M (Lev26) --> <M#2
>CH (Lev20) --> >CH#3
>JC (Lev19) --> >JC#3
>JC (Lev21) --> >JC#4
>JC (Lev22) --> >JC#5
>JC (Lev25) --> >JC#6
>JC (Lev26) --> >JC#7
>JC >JC (Lev24) --> >JC >JC#3
GR (Lev19) --> GR#3
KHN (Lev21) --> KHN#2
MN QRB/ <M/ -NPC (Lev18) --> MN QRB/ <M/ -NPC#2
NPC (Lev23) --> NPC#2


In [28]:
data.showPredictions(disambig_df)

Predicted collocations


chapt,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26,Total
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2ms,0,1,1,1,1,0,1,0,1,0,6
>DM,0,1,0,0,0,1,0,1,0,0,3
>HRN,1,0,0,0,1,1,0,1,0,0,4
>HRN BN ->HRN,0,0,0,0,0,1,0,1,0,0,2
>HRN BN ->HRN KL BN JFR>L,1,0,0,0,0,1,0,0,0,0,2
>JC >JC#2,1,0,0,0,0,1,0,0,0,0,2
>T== ZKR=/,0,1,0,1,0,0,0,0,0,0,2
>X -2ms,0,1,0,0,0,0,0,0,1,0,2
B <MJT/ ->JC,0,0,1,0,0,0,0,1,0,0,2
BN JFR>L,1,1,0,1,0,1,1,1,0,0,6


No collocations


chapt,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26,Total
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2mp_sfx,0,0,0,0,1,0,0,0,0,0,1
3mp,0,0,0,0,0,0,0,1,0,0,1
3mp#2,0,0,0,1,0,0,0,0,0,0,1
3mp#3,0,0,0,1,0,0,0,0,0,0,1
3mp#4,0,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
XRC=/,0,0,1,0,0,0,0,0,0,0,1
ZR,0,0,0,0,0,1,0,0,0,0,1
ZR< ->JC,0,0,0,1,0,0,0,0,0,0,1
ZR</,0,0,0,0,0,1,0,0,0,0,1


## 2. Collocate

In order to collocate participants across chapters, we need to describe each participant as accurately as possible. Therefore, the table is exported as an Excel-file for further inquery.

In [29]:
table = data.predictCollocations(disambig_df) #The dataframe is cross-tabulated (actor and chapter)
table.insert(0, 'id', [x for x in range(len(table))]) #The table is given the numbers 0 to len(dataframe)

table.head()

chapt,id,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2mp_sfx,0,0,0,0,0,1,0,0,0,0,0
2ms,1,0,1,1,1,1,0,1,0,1,0
3mp,2,0,0,0,0,0,0,0,1,0,0
3mp#2,3,0,0,0,1,0,0,0,0,0,0
3mp#3,4,0,0,0,1,0,0,0,0,0,0


In [30]:
print(f'Number of participants to explore: {len(table)}')

Number of participants to explore: 191


Export:

In [31]:
table.to_excel('collocate_references_May2020.xlsx')

Explore participants

In [110]:
n = 142

#ID of participant

show(disambig_df, n, disambiguate=False)

Participant: KL/ NPC/






A dictionary of participants to be collocated is made:

In [37]:
collocate = {'>HRN BN ->HRN KL BN JFR>L':['>L >HRN/ W >L BN/ ->HRN W >L KL/ BN/ JFR>L/'],
             '>HRN':['KHN#2', '>JC#4','MN ZR</ >HRN/'],
             '>HRN BN ->HRN': ['L DWR/ ->HRN BN ->HRN','>JC#5'],
             'BN >HRN': ['BN ->HRN', 'KHN', 'MN >X/ -KHN'],
             'BT -2ms': ['BT KHN','BT >JC KHN'],
             '>JC >JC': ['3ms','>JC#3','>JC#6','NPC#2','KL/ NPC/'],
             '>JC': ['>JC >JC#2','>JC >JC#3','NPC'],
             'BN JFR>L': ['BJT JFR>L','<M >RY','<M#2','>ZRX','GR#3','L KL/ JCB[ ->RY','DWR -BN JFR>L',
                         'L DWR/ -BN JFR>L','GR TWCB'],
             'BFR/ BN/ -<M': ['BFR/ BT/ -<M','BN -GR TWCB'],
             '<M': ['SBJB -GR TWCB','>JC >RY','GWJ','>JB -<M','>JB -HM'],
             'GR': ['GR#2','GWR[','L <QR=/ MCPXT/ GR/'],
             '>X -2ms': ['<MJT -2ms','>T >X/ -2ms','>X ->JC','>X -GR TWCB','R<= -2ms','B <M/ -2ms','B <MJT/ ->JC',
                        'B >X/ ->JC','>T >X/ ->JC','>T BN/ <M/ -2ms'],
             'B <M/ -2ms':['MN <M/ ->JC >JC','MN <M/ -NPC','MN QRB/ <M/ ->JC >JC','MN QRB/ <M/ ->JC','MN QRB/ <M/ -NPC',
                          'MN QRB/ <M/ -NPC#2','<M -HW>','<M -KHN', 'MN QRB/ <M/ -KL','MN QRB/ <M/ -CNJM -'],
             '>L MCPXT/ ->JC': ['>L MCPXT/ ->X -2ms'],
             'BT >B -2msBT >M -2ms': ['BT >B ->JC BT >M ->JC'],
             '>M -2ms': ['>M ->JC'],
             '>T >CH/ >JC/': ['>T >CH/ >X/ ->JC','>CH <MJT -2ms','<RWH >CH >X -2ms'],
             'DWDH -2ms': ['>T== DWDH/ ->JC'],
             '>CH >B -2ms': ['>T== >CH/ >B/ ->JC'],
             '>B -2ms': ['>B ->JC'],
             '>CH BN -2ms': ['>T== KLH/ ->JC'],
             '>M/ ->JC W >B/ ->JC': ['>B ->JC >M ->JC', '<RWH/ >B/ -2msW <RWH/ >M/ -2ms','L >B/ -KHN W L >M/ -KHN'],
             'C>R >B -2ms': ['C>R >M -2ms','<RWH/ >XWT/ >B/ -2ms','<RWH/ >XWT/ >M/ -2ms',
                             '<RWH/ >XWT/ >M/ -2msW >XWT/ >B/ -2ms'],
             '>CH': ['>CH#3'],
             'ZR< ->JC': ['ZR</','ZR</ -KHN','MN ZR</ -2ms'],
             '>T PGR/ -<M': ['L NPC/','B KL/ VM>/ NPC/'],
             '>LMNH GRC XLL': ['>CH#4', '>CH/ ZNH[ W XLL/'],
             'G>L': ['G>L ->X -2ms'],
             'MLK=': ['>L H >LJL/','>L H >WB/','>WB JD<NJ','L H MLK=/','JD<NJ','F<JR='],
             'QNH': ['>JC#2','QNH ->X -2ms'],
             'PNH/ ZQN/': ['MN PNH/ FJBH/'],
             'HM': ['>JC#7'],
             'CPXH': ['>MH -2ms']
            }
collocate_df = data.collocate(disambiguate, collocate)

3mp (Lev20) --> 3mp#4
<M (Lev26) --> <M#2
>CH (Lev20) --> >CH#3
>JC (Lev19) --> >JC#3
>JC (Lev21) --> >JC#4
>JC (Lev22) --> >JC#5
>JC (Lev25) --> >JC#6
>JC (Lev26) --> >JC#7
>JC >JC (Lev24) --> >JC >JC#3
GR (Lev19) --> GR#3
KHN (Lev21) --> KHN#2
MN QRB/ <M/ -NPC (Lev18) --> MN QRB/ <M/ -NPC#2
NPC (Lev23) --> NPC#2


In [38]:
data.showPredictions(collocate_df)

Predicted collocations


chapt,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26,Total
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2ms,0,1,1,1,1,0,1,0,1,0,6
<M,0,1,0,2,0,0,0,0,1,2,6
>B -2ms,0,1,0,1,0,0,0,0,0,0,2
>CH,0,1,0,1,0,0,0,0,0,0,2
>CH >B -2ms,0,1,0,1,0,0,0,0,0,0,2
>CH BN -2ms,0,1,0,1,0,0,0,0,0,0,2
>DM,0,1,0,0,0,1,0,1,0,0,3
>HRN,1,0,0,0,3,2,0,1,0,0,7
>HRN BN ->HRN,0,0,0,0,0,3,0,1,0,0,4
>HRN BN ->HRN KL BN JFR>L,1,0,0,0,1,1,0,0,0,0,3


No collocations


chapt,Lev17,Lev18,Lev19,Lev20,Lev21,Lev22,Lev23,Lev24,Lev25,Lev26,Total
actor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2mp_sfx,0,0,0,0,1,0,0,0,0,0,1
3mp,0,0,0,0,0,0,0,1,0,0,1
3mp#2,0,0,0,1,0,0,0,0,0,0,1
3mp#3,0,0,0,1,0,0,0,0,0,0,1
3mp#4,0,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
W,0,0,0,1,0,0,0,0,0,0,1
W >T KL/,0,0,0,1,0,0,0,0,0,0,1
W L <BD/ -2msW L >MH/ -2msW L FKJR/ -2msW L TWCB/ -2ms,0,0,0,0,0,0,0,0,1,0,1
XRC=/,0,0,1,0,0,0,0,0,0,0,1


## 3. Production and validation

#### 3a. Production of participants and references

In [39]:
result_dict = data.produceDict(collocate_df)
result_dict

defaultdict(list,
            {'JHWH': ['943175 943176 943178 943187 943188 943189 943207 1317274 943228 943234 1317293 943246 943279 943295 63229 943302 943310 943311 943322 943360',
              '943401 943402 943404 943410 943411 943412 943422 943423 63436 63440 943439 943440 943441 63451 63454 943452 943453 943460 943461 943574 943575 943606 943607 943613 63724 63727 63804 943676 943677 943678 943573',
              '943680 943681 943683 943692 943693 943694 943695 63853 943702 943703 943704 943713 943714 943715 943720 943736 1317513 943775 943776 943777 63983 943796 943797 943820 943821 943843 943844 943867 943868 64086 943915 1317553 943964 943965 943966 943993 943994 64273 64276 944012 944013 944023 944024 944025 944034 944035 944061 944062 944063 944073 944074 944075 944077 64384 64388 944086 944087 1317527 943819 944033 943791 943795',
              '944089 944090 944092 944112 944113 64439 944117 64461 64467 944142 944143 64497 944147 944163 64541 944167 944176 944177 944178

#### 3b. Manual validation of all corrections made

In [40]:
data.validateDict(result_dict, disambiguate, collocate)

1: >L >HRN/ W >L BN/ ->HRN W >L KL/ BN/ JFR>L/ (Lev21) --> >HRN BN ->HRN KL BN JFR>L

2: MN ZR</ >HRN/ (Lev22) --> >HRN

3: >JC (Lev21) --> >HRN

4: KHN (Lev21) --> >HRN

5: BJT JFR>L (Lev17) --> BN JFR>L

6: <M >RY (Lev20) --> BN JFR>L

7: DWR -BN JFR>L (Lev23) --> BN JFR>L

8: >ZRX (Lev23) --> BN JFR>L

9: L DWR/ -BN JFR>L (Lev24) --> BN JFR>L

10: GR TWCB (Lev25) --> BN JFR>L

11: L KL/ JCB[ ->RY (Lev25) --> BN JFR>L

12: <M (Lev26) --> BN JFR>L

13: GR (Lev19) --> BN JFR>L

14: KL/ NPC/ (Lev17) --> >JC >JC

15: 3ms (Lev19) --> >JC >JC

16: >JC (Lev19) --> >JC >JC

17: >JC (Lev25) --> >JC >JC

18: NPC (Lev23) --> >JC >JC

19: KHN (Lev17) --> BN >HRN

20: KHN (Lev19) --> BN >HRN

21: MN >X/ -KHN (Lev21) --> BN >HRN

22: BN ->HRN (Lev22) --> BN >HRN

23: KHN (Lev23) --> BN >HRN

24: F<JR= (Lev17) --> MLK=

25: L H MLK=/ (Lev18) --> MLK=

26: JD<NJ (Lev19) --> MLK=

27: >L H >LJL/ (Lev19) --> MLK=

28: >L H >WB/ (Lev19) --> MLK=

29: >WB JD<NJ (Lev20) --> MLK=

30: >JC >JC#2 (Lev17) --

## 4. Remove hypernyms

At this stage, some hypernyms need to be removed. For instance, "Aaron and his sons" refer to a group of participants ("Aaron" and "Aaron's sons"), and the image would be blurred, if these participants appeared both as a group and as individuals. Other hypernyms are less clear. For instance, "the sons of Israel" seems to logically include a vast number of individuals, but the notion of who is included in this group is less clear. Therefore, only hypernyms in the form of a list (e.g. "Aaron and his son") will be removed (including demonstratives referring to a list of participants, e.g. "the two of them"). The references to these groups have already been distributed to the individuals forming the list so no information is lost.

First, the references in result_dict are contracted to single lists:

In [41]:
for actor in result_dict:
    temp_list = []
    for l in result_dict[actor]:
        
        temp_list += [int(ref) for ref in l.split(' ')]
        
    result_dict[actor] = temp_list
    
#result_dict

The following script identifies all possible hypernyms. However, also synonyms may be included, so the resulting list needs manual inspection:

In [42]:
coref = set()

for actor, refs in result_dict.items(): #Looping through each reference to each actor
    for r in refs:
        
        for actor2, refs2 in result_dict.items():
            if r in refs2 and actor != actor2: #Checking whether the reference occurs with another actor
                coref.add(actor2)
                
coref

{'2ms',
 '3mp#2',
 '3mp#3',
 '3mp#4',
 '3unknownp',
 '<BD -2ms>MH -2ms',
 '<RWH/ -<RWH -2ms',
 '<RWH/ >CH/ W BT/ ->CH',
 '>B -2ms',
 '>CH',
 '>CH >B -2ms',
 '>CH >M ->CH',
 '>CH BHMH',
 '>CH BN -2ms',
 '>CH#2',
 '>HRN',
 '>HRN BN ->HRN',
 '>HRN BN ->HRN KL BN JFR>L',
 '>JC',
 '>JC >JC',
 '>L KL/ C>R/ BFR/ ->JC >JC',
 '>M -2ms',
 '>M/ ->JC W >B/ ->JC',
 '>T >CH/ >JC/',
 '>T BT/ BN/ ->CH W >T BT/ BT/ ->CH',
 '>T== ZKR=/',
 '>X -2ms',
 'B H >JC/ H HW> W B MCPXT/ ->JC',
 'BJN/ -JHWH W BJN/ BN/ JFR>L/',
 'BJT JFR>L GR',
 'BN ->X -2ms',
 'BN >CH',
 'BN >HRN',
 'BN JFR>L',
 'BN JFR>LJ >JC JFR>LJ',
 'BT >B -2msBT >M -2ms',
 'C>R >B -2ms',
 'C>RH',
 'CNJM -',
 'CNJM -#2',
 'CNJM -#3',
 'CNJM -#4',
 'CPXH',
 'DWD ->X -2ms',
 'DWD ->X -2msBN DWD ->X -2ms',
 'DWDH -2ms',
 'GR',
 'HW>BN ->X -2ms',
 'JHWH',
 'KL',
 'L H <NJ/ W L H GR/',
 'MWT[',
 'N>P N>P',
 'W',
 'W >T KL/',
 'W L <BD/ -2msW L >MH/ -2msW L FKJR/ -2msW L TWCB/ -2ms'}

In [43]:
print(f'Number of participants to inspect: {len(coref)}')

Number of participants to inspect: 56


Manual inspection:

In [17]:
def inspect(coref=coref, result_dict=result_dict, n=0):
    coref = list(coref)
    refs = result_dict[coref[n]]

    print(f'{n}: Corefferent: {coref[n]}')
    
    for actor, ref in result_dict.items():
        for r in ref:
            if r == refs[0] and actor != coref[n]:
                print(f'\nOther referents to this reference: {actor}, {r}')
    A.pretty(L.u(refs[0], 'verse')[0], highlights={refs[0]:'gold'})

In [32]:
n=11

In [58]:
inspect(n=n)
n+=1

35: Corefferent: BT >B -2msBT >M -2ms


In [44]:
hypernyms = ['CNJM -',
             '>L KL/ C>R/ BFR/ ->JC >JC',
             '>HRN BN ->HRN KL BN JFR>L',
             '3mp#2',
             'W L <BD/ -2msW L >MH/ -2msW L FKJR/ -2msW L TWCB/ -2ms',
             'BN JFR>LJ >JC JFR>LJ',
             'C>RH',
             '>JC',
             '>HRN BN ->HRN',
             'W >T KL/',
             'BJN/ -JHWH W BJN/ BN/ JFR>L/',
             'N>P N>P',
             'L H <NJ/ W L H GR/',
             'W',
             '>M/ ->JC W >B/ ->JC',
             'MWT[',
             '>CH BHMH',
             'DWD ->X -2msBN DWD ->X -2ms',
             '3unknownp',
             'CNJM -#2',
             'CNJM -#3',
             '3mp#3',             
             'HW>BN ->X -2ms',
             'B H >JC/ H HW> W B MCPXT/ ->JC',
             '<BD -2ms>MH -2ms',
             'CNJM -#4',
             'BJT JFR>L GR',
             '3mp#4',
            ]

This procedure reduces the number of participants drastically, perhaps too drastically. A participant such as <NJ "poor" is removed, because it only occurs in a compound phrase with GR "sojourner". One could say that the participant is so infrequent that it is irrelevant for the analysis of more frequent participants.

In [45]:
print(f'Number of participants removed: {len(hypernyms)}')

Number of participants removed: 28


Before removal, however, the references to these hypernyms need to be distributed to their hyponyms so that the references will not be lacking. First we create a dictionary with all hypernyms (key) and their respective hyponyms (values):

In [46]:
hyponyms = collections.defaultdict(set)

for actor1 in hypernyms:

    refs = result_dict[actor1]

    for actor2, ref in result_dict.items():
        for r in ref:
            if r == refs[0] and actor2 != actor1:
                hyponyms[actor1].add(actor2)
                
#hyponyms

In [47]:
hyponyms

defaultdict(set,
            {'CNJM -': {'>CH >B -2ms', '>JC'},
             '>L KL/ C>R/ BFR/ ->JC >JC': {'<RWH/ -<RWH -2ms',
              '>B -2ms',
              '>CH >B -2ms',
              '>CH BN -2ms',
              '>M -2ms',
              '>M/ ->JC W >B/ ->JC',
              '>T >CH/ >JC/',
              'BT >B -2msBT >M -2ms',
              'C>R >B -2ms',
              'DWDH -2ms'},
             '>HRN BN ->HRN KL BN JFR>L': {'>HRN', 'BN JFR>L'},
             '3mp#2': {'>JC', 'DWDH -2ms'},
             'W L <BD/ -2msW L >MH/ -2msW L FKJR/ -2msW L TWCB/ -2ms': {'2ms',
              '<BD -2ms>MH -2ms',
              'CPXH',
              'GR'},
             'BN JFR>LJ >JC JFR>LJ': {'BN >CH'},
             'C>RH': {'<RWH/ >CH/ W BT/ ->CH',
              '>CH',
              '>T BT/ BN/ ->CH W >T BT/ BT/ ->CH'},
             '>JC': {'>JC >JC', 'GR'},
             '>HRN BN ->HRN': {'>HRN', 'BN >HRN'},
             'W >T KL/': {'>JC', 'KL'},
             'BJN/ -JHWH W BJN/ BN/ JFR>

Specify additional part-whole relationships:

In [48]:
hyponyms['>HRN BN ->HRN KL BN JFR>L'].add('BN >HRN')

Now, all references can be transferred from the hypernym to each of the hyponyms:

In [49]:
for actor1 in hyponyms:
    hyper_refs = result_dict[actor1] #Getting hypernym references
    
    for actor2 in hyponyms[actor1]: #Looping over each of the hyponyms
        result_dict[actor2] += hyper_refs

Some references may occur more than once for each actor in result_dict so we clean the dictionary before export:

In [50]:
final_dict = collections.defaultdict()

for actor in result_dict:
    final_dict[actor] = list(set(result_dict[actor]))

In [51]:
final_dict

defaultdict(None,
            {'JHWH': [944128,
              946176,
              946179,
              946182,
              946184,
              944142,
              944143,
              944147,
              944163,
              944167,
              946219,
              944176,
              944177,
              944178,
              946231,
              944185,
              944186,
              944187,
              946374,
              946377,
              946378,
              946379,
              946397,
              946398,
              946399,
              946405,
              946406,
              63724,
              63727,
              946417,
              944371,
              944372,
              944382,
              944383,
              944389,
              944392,
              944398,
              944399,
              946446,
              944404,
              944405,
              944406,
              946455,
              944408,
        

## 5. Export

The resulting dictionary can now be exported:

In [52]:
file = f'{PATH}participants_FINAL.csv'

with open(file, 'w') as f:
    f.write('''participant,refs\n''')
    for actor in final_dict:
        if actor not in hypernyms: #Actors listed in the hypernyms list are ignored
            references = ' '.join(str(e) for e in final_dict[actor])
            f.write(f'{actor},{references}\n')