# Exploring the dataset of participant references in Lev 17-26

*Christian Højgaard Jensen, chj@dbi.edu*

This notebook contains an exploration of the mapped dataset created in [1_Preparing dataset.ipynb
](https://github.com/ch-jensen/Semantic-mapping-of-participants/blob/master/1_Preparing%20dataset.ipynb). The exploration involves several tests as to the internal consistency of the dataset and concludes in a list of rows to be manually checked and, if necessary, corrected.

## 1. Firing up

Import packages, including [text-fabric](https://dans-labs.github.io/text-fabric/) and the [ETCBC database](https://github.com/ETCBC/bhsa):

In [None]:
import os, sys, collections
import pprint as pp
import pandas as pd
import matplotlib.pyplot as plt
from colour import Color

In [None]:
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa

BHSA = f'etcbc/bhsa/tf/c'
TF = Fabric(modules = BHSA)

In [None]:
api = TF.load('''
    typ function 
    sp lex
''')
api.loadLog()
api.makeAvailableIn(globals())

In [None]:
B = Bhsa(api, 'search', version="c")

Importing the dataset to be explored:

In [None]:
file = 'Datasets/Lev17toLev26.mapped.csv'

data = pd.read_csv(file).fillna('...') #empty cells are replaced with '...'
data[:10]

In [None]:
len(data)

## 2. Preparations

The dataset needs to be modified for speeding up the processing of data. The dataset represents the layering of participant references in the Hebrew text which means that some rows are compounds of other rows (called constituents). As shown below, the reference in row 5 spans over a long range of slots including slots referring to other referents - or, rather, constituents of that reference - in rows 6-11.

We want to test the consistency between compounds and their reference, and for convenience we want to add another column to the dataset comprising the row number of any compound and its respective constituents.

In [None]:
data.iloc[5:12]

Two functions are defined to simplify the processing of the data in the dataset:

In [None]:
def wordList(row, dataframe):
    '''
    Input: row and dataframe
    Output: list of words in the column 'slots'
    '''
    slots = dataframe.iloc[row].slots
    word_list = slots.split()
    int_word_list = [int(w) for w in word_list]
    
    return int_word_list

def compoundReferences(row):
    '''
    Input: Row number
    Output: List of rows for compound and embedded references    
    '''
    max_iteration = 30 #A max iteration is set to 30, as no compound reference exceeds 30 rows.
    
    word_list = wordList(row, data)
    
    row_string = ''
    
    #The iteration has to restrictions. It cannot surpass the max iteration (row + 30 row) and the length of the dataset
    line=row
    while line < row+max_iteration and line < len(data): #Looping over the next 30 lines relative to the actual row.
        new_word_list = wordList(line, data)
        intersection = set(word_list).intersection(new_word_list) #Finding the intersection between the two rows.
        if intersection:        
            row_string += f'{line} ' #row number added to row_string
        line+=1
    
    return row_string.rstrip(' ')

#compoundReferences(5)

A string of row numbers of the compound and its constituents (if any) is added to a compound_list comprising all row numbers of the dataset. The compound_list is then added to the dataset as an additional column:

In [None]:
compound_list = []
for r in data.iterrows():
    row = r[0]
    compound_list.append(compoundReferences(row))
        
data['compound'] = compound_list

In [None]:
data[:6]

Now we can look up how many compound references the dataset contains by finding all rows with a compound longer than one element:

In [None]:
compound_length_list = []
for r in data.iterrows():
    row = r[0]
    
    compound_length = len(data['compound'][row].split())
    if compound_length > 1:
        compound_length_list.append(compound_length)
        
len(compound_length_list)

We can also find the number of compounds that are not themselves constiuents of a compound. Here we need to check whether the words of the compound intersect with a range of words further up in the data set. We select a max iteration of 30, which means we will check for intersection up to 30 lines further up the data set:

In [None]:
def motherReferences(dataframe):
    '''
    Input: Dataframe
    Output: A list of all mother references
    '''
    max_iteration = 30
    mother_list = []
    
    for r in dataframe.iterrows():
        row = r[0]
        mother = True
        
        compound_length = len(data['compound'][row].split())
        if compound_length > 1:                 #If length is more than 1, it means that the reference of this
                                                #row is a compound of other references futher down in the data set.
        
            #The iteration has two restrictions: It may not surpass the max iteration (row - 30) or the start of the dataset.
            line = row-1
            while line > row-max_iteration and line >= 0:
                word_list_row = wordList(row, dataframe)
                word_list_line = wordList(line, dataframe)
                    
                intersection = set(word_list_row).intersection(word_list_line)
                
                if len(intersection) > 0: #If the two rows intersect in terms of shared slots, the compound is itself a constituent.
                    mother = False
                line -= 1
            
            if mother == True:
                mother_list.append(row)
    return mother_list

Now we can calculate the number of compound references (excluding all compound references that are themselves constituents of other compound references):

In [None]:
len(motherReferences(data))

Finally, we want to export the updated dataframe to a CSV file:

In [None]:
file = 'Datasets/Lev17toLev26_mapped_updated.csv'
data.to_csv(file, index=False)

## 3. Quality check of internal consistency

The compound references form a major part of the dataset. It appears that the compounds generally include the references of their respective constituents in their own reference. However, this is not always the case. Therefore, it is worth while investigating, whether we can actually identify a pattern of reference distribution among compounds and constituents, and whether we can identify inconsistencies in this respect.

### 3.1. Checking the whole data set for internal consistency

#### 3.1.1. Exploring compounds and their respective constituents

We check whether compound referents correspond to the sum of their respective sub referents.

First, we import the updated dataframe:

In [None]:
file = 'Datasets/Lev17toLev26_mapped_updated.csv'
data = pd.read_csv(file)

In the example below we see that the clause atom 528165 contains a compound reference spanning word 63013 to 63025. In the rows below the sub elements of this compound reference are listed. We want to check if the compound references correspond to their respective embedded referents.

In [None]:
data[data.clause_atom == 528165]

We want to check the internal consistency between compounds and their constituents. We do this by running through all compounds (including those that can also be constituents themselves) and check whether the actor reference of their constituents is not included in the compound. If one of its constituents is not included, the compound is treated as a mismatch. It means that only compounds in which all of the constituents are included in the compound actor reference are treated as correct:

In [None]:
error_list = []
n=0

for row in data.iterrows(): #Looping over all rows of the dataset
    row = row[0]
    compound = data['compound'][row].split() #A list of rows of the compound and its constituents
    compound_actor = str(data.iloc[row].actor) #The actor of this row
    
    if len(compound) > 1: #We exclude all non-compound rows (length 1 means no constituent)
        error = False

        for c in compound[1:]: #We skip the first element of the compound_list because that is the compound itself
            constituent_actor = str(data.iloc[int(c)].actor)

            if constituent_actor not in compound_actor:
                error = True

        if error == True:
            error_list.append(row)
        else:
            n+=1
        
print(f'Number of mismatches: {len(error_list)}')
print(f'Number of correct cases: {n}')

In [None]:
#error_list

Let's inspect the lengths of the compound references in the error_list. We do this by creating a list of the length of each compound and count the frequencies of each length:

In [None]:
count_compound_length = collections.Counter([len(data['compound'][case].split()) for case in error_list])

The lengths and their respective frequencies are plotted:

In [None]:
plot = plt.bar(count_compound_length.keys(), count_compound_length.values())

#values of x axis calculated as the range between the minimum and maximum value+1
plt.xticks(range(min(sorted(count_compound_length.keys())), max(sorted(count_compound_length.keys()))+1), fontsize=16)
plt.yticks(fontsize=16)
plt.xlabel('compound length', fontsize=16)
plt.rcParams["figure.figsize"] = [8,6]
plt.show()

Compounds with a length of two dominate our data. A length of two means that the compound contains one constituent beside itself. Compounds with many constituents (bigger length) will sometimes contain other compounds with a smaller set of constituents. Therefore, if we address the compounds with a length of two, we will also be adressing some of the issues pertaining to the longer compounds. In other words, if we can exclude short compounds, we may by implication also be able to exclude longer compounds, because the inconsistency observed in the longer compounds may be caused by an inconsistency in one of its embedded compounds.

We therefore make a sublist of all compounds of the length two:

In [None]:
compound_length_2 = [case for case in error_list if len(data['compound'][case].split()) == 2]
#compound_length_2

We want to inspect the constituents of the compounds with a length of two in terms of reference type (suffix or lexical) and function. We therefore create a dictionary in which we store this information for each constituent:

In [None]:
ref_func_dict = {}

for case in compound_length_2:
    
    #The row number of the constituent is found by getting the compound list and taking the second row:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    constituent_reference = data.iloc[row_constituent].reference
    if constituent_reference != 'sfx':
        constituent_reference = 'lexeme'
    constituent_function = data.iloc[row_constituent].func
    ref_func_dict[row_constituent] = [constituent_reference, constituent_function]
    
#ref_func_dict

The dictionary is transformed to a dataframe. The rows reference type (row 0) and function (row 1) are crosstabulated and an extra column with counts is added:

In [None]:
table = pd.DataFrame.from_dict(ref_func_dict)
ref_func = pd.crosstab(index=table.loc[0],columns=table.loc[1])
ref_func

In [None]:
plot = ref_func.plot(kind='bar', stacked=True)
plot.legend(fontsize = 16)
plt.xlabel('reference type and function', fontsize=16)
plt.xticks(rotation='horizontal', fontsize=16)
plt.ylabel('frequency', fontsize=16)
plt.yticks(fontsize=16)
plt.rcParams["figure.figsize"] = [8,6]
plt.show()

It becomes clear that there is a difference between the reference types (suffix or lexeme) in terms of which functions they have. Lexical references never function as objects.

#### 3.1.2. Inspecting compounds with suffixes

As for object suffixes, it makes sense that the suffix reference is not included in the reference of the compound reference, since the compound (probably a verb) and the suffix usually refer to two different referents - unless of cause the object refers to the same referent as the verb (a reflexive/reciprocal use). In the example below, the subject ("I" = the Lord) has given "it" (= the blood). The suffix clearly refers to another referent.

In [None]:
B.pretty(528202, highlights = {688474:'gold'})

Let's take a close look at the object suffixes:

In [None]:
for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second row:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference == 'sfx' and data.iloc[row_constituent].func == 'Obj1':
        display(data.iloc[case:row_constituent+1])

Generally, the object suffixes refer to actors distinct from the actors of the compound. There is no indication that the compound inadventently includes parts of the suffix reference. The only apparent exception is in row 65 in which the suffix referent is 'ZBX BN JFR>L' while the compound refers to 'BN JFR>L'. A closer look at the data set, however, reveals that the suffix refers to a distinct actor already referred to in row 58.

In short, compounds with object suffixes reveal a consistent pattern and we can exclude these from our list of inconsistencies. We add the excluded suffixes to a separate list that we can use for cross-checking:

In [None]:
exclude_cases = []
for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second element:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference == 'sfx' and data.iloc[row_constituent].func == 'Obj1' and case not in exclude_cases:
        exclude_cases.append(case)
        
print(f'Number of excluded cases: {len(exclude_cases)}')

In [None]:
#exclude_cases

The remaining suffixes all have the function '-gentf'. Let's take a look at this group:

In [None]:
for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second row:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference == 'sfx' and data.iloc[row_constituent].func == '-gentf':
        display(data.iloc[case:row_constituent+1])

Most of the cases involve the compound reference to be empty, and we can exclude those cases.

In [None]:
n=0
for case in compound_length_2:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference == 'sfx' and data.iloc[case].actor == '...' and case not in exclude_cases:
        exclude_cases.append(case)
        n+=1
        
print(f'Excluded cases: {(n)}')
print(f'Total number of excluded cases: {len(exclude_cases)}')

Let's look at the remaining suffixes:

In [None]:
n=0
for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second row:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference == 'sfx' and data.iloc[row_constituent].func == '-gentf':
        if data.iloc[case].actor != '...':
            display(data.iloc[case:row_constituent+1])
            n+=1
print(n)

The picture becomes more blurred when looking at these suffixes. The first case, row 262, has a suffix with the reference 'BN JSR>L' while its compound has '>LHJM BN'. Most likely, the compound was intended to be '>LHJM BN JSR>L' but the final word is missing, witnessed by the longer suffix referent. In other words, in this case the compound was probalby intended to include the suffix referent.

Another example points in the opposite direction. Row 415 has a compound referring to '3sf"SHE"' while its suffix refers to '2sm"YOUSgmas"'. The word is "your bride" which clearly refers to two different referents. Moreover, the referent '3sf"SHE"' occurs elsewhere in the chapter and clearly has a role on its own. This fact contrasts the previous example in which '>LHJM' doesn't have a role on its own in its chapter.

Therefore, the rule seems to be: If a referent has a role on its own in a chapter, it will also be distinguished in compound references. If a referent never occurs as an independent referent, the compound will always include the suffix - or at least, that would be the intention.

We therefore need to look at the references for each chapter to sort out those referents that are compounds and never occurs independently. For this group of references, we would be able to automatically correct the data set by appending the suffix reference to the compound reference.

A dependent reference can exist in different functions, including verbs, predicate complements, subjects and objects. The difference between independent and dependent references is that whenever a dependent reference occurs as a substantive, it will need the suffix.

In [None]:
actor_dict = {}

for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second element:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference == 'sfx' and data.iloc[row_constituent].func == '-gentf' and data.iloc[case].actor != '...':
        chapter = data.iloc[case].chapter
        compound_actor = data.iloc[case].actor
        dict_key = f'{chapter}: {compound_actor}'
        
        #All rows in which the compound reference occurs in the present chapter.
        chapter_rows = data[(data.actor == compound_actor) & (data.chapter==chapter)
                            & (data.reference != 'sfx') & (data.func.isin(['VbPred']) == False)].index
        ref_type_list = []
        
        #Looping over the subset. Whenever the compound is longer than one (= has a constituent) it is marked as dependent.
        #If shorter than one, it is marked as independent.
        for r in chapter_rows:
            if len(data['compound'][r].split()) > 1:
                ref_type = 'dependent'
            else:
                ref_type = 'independent'
            ref_type_list.append(ref_type)
        actor_dict[dict_key] = [ref_type_list, chapter_rows]
#actor_dict

As becomes evident in the actor_dict, some referents only occur in conjunction with a suffix and never independently. Let's inspect this group more carefully:

In [None]:
n=0
for actor in actor_dict:
    actor_set = set(actor_dict[actor][0])
    chapter_rows = actor_dict[actor][1]
    if len(actor_set) == 1 and 'dependent' in actor_set: #The actor_set must only contain dependent referents
        for row in chapter_rows:
            compound_references = data['compound'][row].split()
            display(data.iloc[int(compound_references[0]):int(compound_references[-1])+1])
            n+=1
print(n)

Generally, the cases support the hypothesis that compound references that never occur independently include the references of the their respective constitents in their compound reference. A closer inspection of each of these actor references reveals that they all intend to include the suffix, either by using a reference also used in the suffix (e.g. 'BN') or by providing number, gender and person information corresponding to the embedded suffix (e.g. '3sm"HE"').

A few examples go counter to this notion:
* QDC BN JFR>L (Lev 22)
* TBW>H FDH (Lev 25)
* '>X >X 2sm"YOUSgmas"' (Lev 25)
* JD >X 2sm"YOUSgmas" (Lev 25)
* G>LH BJT MWCB <JR XWMH (Lev 25)
* MMKR BJT MWCB <JR XWMH (Lev 25)
* MMKR BJT MWCB <JR XWMH (Lev 25)

Inspecting each of these instances in detail reveals that they might in fact be exceptions. Read in their context, the references seem to be wrong which might explain the inconsistency. We will need to correct this list manually.

Let's take a look at the actors also occuring independently:

In [None]:
n=0
for actor in actor_dict:
    actor_set = set(actor_dict[actor][0])
    chapter_rows = actor_dict[actor][1]
    if 'independent' in actor_set:
        for row in chapter_rows:
            print(actor)
            compound_references = data['compound'][row].split()
            display(data.iloc[int(compound_references[0]):int(compound_references[-1])+1])
        n+=1
n

When inspecting these cases, it becomes clear that these compounds with suffixes exist independently and apparently not intend to include the suffix in the compound reference. There is only one exception to this observation, namely the actor '>X BN "YOUSgmas"' which occurs in Lev 25. The reference looks like a reference dependent on the suffix, but in fact, it occurs independently twice (25:41, 54), or at least independent from the suffix "your son". Perhaps this reference has to be changed manually so it better reflex the actor's status as independent.

In [None]:
B.shbLink(529158)
B.pretty(529158)

Other than that, the pattern is consistent, and we can exclude these cases from our list of mismatches. We do this by adding the cases to our exclude_cases list:

In [None]:
n=0
for actor in actor_dict:
    actor_set = set(actor_dict[actor][0])
    chapter_row = actor_dict[actor][1]
    if 'independent' in actor_set:
        for row in chapter_row:
            if row not in exclude_cases and row in compound_length_2:
                exclude_cases.append(row)
                n+=1
            
print(f'Excluded cases: {(n)}')
print(f'Total number of excluded cases: {len(exclude_cases)}')

#### 3.1.3. Inspecting lexical compounds

The next step is to inspect the remaining two-length compounds, namely those compounds with a lexical constituent in contrast to a suffix constituent.

In [None]:
for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second row:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference != 'sfx':
        display(data.iloc[case:row_constituent+1])

Most of the cases concern an empty compound or an empty constituent. It makes good sense not to include empty constituents in a compound or to include a constituent in an empty compound. For that reason, we can reasonably regard the empty cases as consistent, and we will add those to our exclude_cases list:

In [None]:
n=0
for case in compound_length_2:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference != 'sfx' and case not in exclude_cases:
        if data.iloc[case].actor == '...' or data.iloc[row_constituent].actor == '...':
            exclude_cases.append(case)
            n+=1    
        
print(f'Excluded cases: {(n)}')
print(f'Total number of excluded cases: {len(exclude_cases)}')

Let's take a look at the remaining cases:

In [None]:
n=0
for case in compound_length_2:
    #The row number of the constituent is found by getting the compound list and taking the second row:
    row_compound = data['compound'][case].split()
    row_constituent = int(row_compound[1])
    if data.iloc[row_constituent].reference != 'sfx' and case not in exclude_cases:
        display(data.iloc[case:row_constituent+1])
        highlight_word = data['slots'][row_constituent]
        B.pretty(data['clause_atom'][case], highlights = {int(highlight_word): 'gold'})
        n+=1
n

When inspecting the results above, it becomes clear that the inconsistency is caused by either errors in the participant tracking or by more complicated issues, e.g. dobbelt object where one referent is referred to in two ways. In short, I need to correct these cases manually in view of their larger context.

#### 3.1.4. Updating error_list

We have now looked at compound references consisting of two rows (one compound and one constituent). How many cases have excluded so far?

In [None]:
print(len(exclude_cases))

Some of theses compounds are themselves constituents of longer compounds. Therefore, by excluding those cases, longer compounds may be affected as well and could be excluded.

In [None]:
def updateErrorList(compound_list, exclude_cases = []):
    error_list = []
    for row in compound_list:
        compound = data['compound'][row].split() #A list of rows of the compound and its constituents
        compound_actor = str(data.iloc[row].actor)

        error = False

        if row not in exclude_cases:

            for c in compound[1:]: #We skip the first element of the compound_list because that is the compound itself
                constituent_actor = str(data.iloc[int(c)].actor)

                if constituent_actor not in compound_actor and int(c) not in exclude_cases: #Ignoring compounds whose
                     error = True                                                           #constituents are stored in
                                                                                            #the exclude cases          
        if error == True:
            error_list.append(row)
    
    return error_list

updated_error_list = updateErrorList(error_list, exclude_cases)

print(f'Original count of mismatches: {len(error_list)}')
print(f'Cases to be excluded: {len(exclude_cases)}')
print(f'Number of mismatches after cross-checking: {len(updated_error_list)}')

Finally, we have identified 296 mismatches for manual correction:

In [None]:
#updated_error_list

#### 3.1.5. Statistics of mismatches

Which chapters have the largest proportions of mismatches?

In [None]:
error_dict = collections.defaultdict(int)
for case in updated_error_list:
    chapter = data['chapter'][case]
    error_dict[chapter] += 1
    
#error_dict

In [None]:
new_dict = {}

for chapter in error_dict:
    where = T.nodeFromSection(('Leviticus', int(chapter),))
    length = len(data[data.chapter == chapter])
    new_dict[chapter] = [error_dict[chapter], length]
    
#new_dict

In [None]:
table = pd.DataFrame(new_dict).T
table.columns = ['mismatches','total_words']
#table

In [None]:
prop_table = table.mismatches/table.total_words*100

In [None]:
plot = prop_table.plot(kind='bar', color='darkblue')
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.ylabel('%', fontsize=16)
plt.show()

Chapters 18 and 25 show the highest proportion of mismatches. This fact justifies a close-reading of these chapters for further observing the dataset and its representation of the participant references.

### 3.2. Consistency between predicate and subject references

Before close-reading selected chapters of the dataset, we will make one additional objective check for internal consistency.  Here, we will check whether the explicit subject and the predicate phrase in a clause always refer to the same actor.

First, a subset is created from the column 'func'. Only references occuring as subjects are included. Second, a new subset is created in which only the predicates are included for the subjects in the first subset:

In [None]:
subjects = data[data.func == 'Subj'] #Creates a subset consisting of all subject phrases in the data set.

#Using the subject subset to create a new subset consisting of all predicate phrases.
verbs = data[(data.clause_atom.isin(subjects.clause_atom)) & (data.func=='VbPred')]

The two subsets are compared. If the predicate and the explicit subject do not contain the same actor, the clause atom is added to an error_list:

In [None]:
error_list = []

for cl in verbs.clause_atom:
    predicate = data[(data.clause_atom == cl) & (data.func == 'VbPred')].actor.item()
    subject = data[(data.clause_atom == cl) & (data.func == 'Subj')].actor.item()
    
    if predicate != subject:
        print(cl, predicate, subject)
        error_list.append(cl)

There are 6 instances of clauses with an explicit subject and for which the subject and the predicate do not correspond.

In the first case, Lev 19:34, the explicit subject "the foreigner" has been annotated with the actor "you (sg)", perhaps because of the reference to this participant in the preceding verse. This interpretation, however, is not right and should be corrected.

In [None]:
B.shbLink(error_list[0])
B.pretty(error_list[0])

# data[(data.chapter == 19) & (data.verse.between(33,34))]

In Lev 20:12 the subject "the two of them" is used to refer to both the man and his daughter in law. In the data set the actor of the subject has been annotated as "the two of the man". This is an exceptional case which is not easy to compute. However, in the similar cases in the preceding and subsequent verse the subject and the verb have both been given the same reference to "the two of the man".

The data set could be manually corrected with a reference to both actors. 

In [None]:
B.shbLink(error_list[1])
B.pretty(error_list[1])

# data[(data.chapter == 20) & (data.verse.between(11,13))]

In the third case, Lev 24:4, the subject of the predicate is "he" deduced from the implicit subject in Lev 21:1. The subject is in the clause is "baal" (husband?). In fact, the data set does not correspond to the database here, because "baal" is labeled as Predicative Adjunct. The actor "baal" occurs only once in Lev 21 so it is strange that the actor has not been marked as empty due to its uniqueness as is normal procedure. This could be corrected.

In [None]:
B.shbLink(error_list[2])
B.pretty(error_list[2])

data[(data.chapter == 21) & (data.verse == 4)]

In the fourth case, Lev 25:30, the subject of the predicate is "house". In the data set, however, the actor of the predicate is not "house" but "man" or "anyone", referring to the subject in the preceding verse.

In [None]:
B.shbLink(error_list[3])
B.pretty(error_list[3])

# data[(data.chapter == 25) & (data.verse.between(29,30))]

As for the fifth case, Lev 26:8, there is an inconsistency between the predicate and subject in this clause. As the table below shows, the verb of the clause atom in question apparently refers back to the subject of the preceding clause in which "five of you" is the subject. In the second clause, the subject is now "hundred of you", taking another subset of "you" as the the subject.

In [None]:
data[(data.chapter == 26) & (data.verse == 8)]

In the sixth and final case, Lev 26:13, the subject is marked as empty, because this subject only occurs once in the discourse. What is being labeled as subject in this clause should probably be interpreted as a predicate complement (cf. the query below, especially 1 Kings 2:27) because the entity describes the subject "you" as servants.

In [None]:
query = '''
clause typ=InfC
  phrase function=Pred
    =: word lex=MN
    word lex=HJH[
  phrase function=PreC
'''
results = B.search(query)
B.show(results)

To conclude this second check for internal consistency, then, six cases of inconsistencies were identified as to the linking of the predicate and the subject which we would normally expect to refer to the same actor. The apparent reasons can be briefly summarized:

* Lev 19:34; Lev 25:30; 26:8: The actor of the subject refers to the subject of the preceding verse
* Lev 20:12: The subject refers only to one participant instead of a conjunction of two participants.
* Lev 21:4: The subject is probably not the actual subject of the clause but rather a predicative adjunct.
* Lev 26:13: The subject has no actor because it occurs only once.

### 4. Close-reading of selected chapters

Previously, it was determined that chapters 18 and 25 contain the highest proportions of mismatches in terms of compound references and constituents. This fact justifies a close-reading of these two chapters.

First a function is defined to apply unique colors to the actors in order to distinguish them:

In [None]:
def colorMap(dataframe, compound=True):
    '''
    The function creates a colormap of all actor references in a dataframe. Each actor reference is assigned a color
    and mapped with the word nodes in columns first_word and last_word.
    Input: dataframe
    Output: dictionary with word node and color.
    '''

    '''
    This loop runs through each line of the dataframe and creates a list of words
    in the range between first_word and last_word. Afterwards, each word is stored in a dictionary
    with their respective actor reference.
    '''
    actor_dict = {}
    for row in dataframe.iterrows():
        word_slots = dataframe['slots'][row[0]].split()
        actor = dataframe['actor'][row[0]]

        for w in word_slots:
            
            #If compound is set to True, word nodes won't be overwritten and compound phrases have only one actor reference
            if compound == True:
                if w not in actor_dict and actor != '...': #Stipulating that word nodes cannot be overwritten.
                    actor_dict[w] = actor
            else:
                if actor != '...':
                    actor_dict[w] = actor
                    
    '''
    The loop creates a range of colors corresponding to the number of different actors in the dictionary created
    in the preceding loop. Finally, each actor is stored in a dictionary with its unique color.
    '''
    
    red = Color('red')
    blue = Color('blue')
    color_range = list(red.range_to(blue, len(set(actor_dict.values()))))
    
    color_dict = {}
    n=0
    for actor in set(actor_dict.values()):
        color_dict[actor] = color_range[n]
        n+=1
    '''
    The final loop goes through each word node of the actor_dict created above. Each word node is then stored in a new
    dictionary with the color corresponding to its respective actor.
    '''
    actor_color_dict = {}
    for w in actor_dict:
        actor = actor_dict[w]
        color = color_dict[actor]
        actor_color_dict[int(w)] = color
    
    return actor_color_dict

In [None]:
def displayText(dataframe, compound=True):
    '''
    Input: dataframe
    Output: Pretty display of verses with corresponding colorcodes from colorMap()
    '''
    
    clause_atom_list = set(dataframe.clause_atom)
    verse_list = set([L.u(cl, 'verse')[0] for cl in clause_atom_list])
    
    for v in range(min(verse_list), max(verse_list)+1):
        B.pretty(v, highlights = colorMap(dataframe, compound))

In [None]:
displayText(data[data.chapter == 18])