# Correct Relative phrases and Predicate phrases

This notebook is used to correct the dataset of Volition and Affectedness. During the annotation of those features, some Relatative (`Rela`) phrases were annotated if the subject or object were not explicit. Unfortunately, however, those phrases are not specified as referring to either the subject or object so they cannot easily be qualified. I therefore need to make some corrections to avoid the use of `Rela` phrases.

**Update** Secondly, some annotations have been added to the predicate even for sentences with an explicit subject. These cases need to be corrected to retain consistency when mapping with other data.

Both types of corrections are stored in [Lev17-26.Volition_Affectedness_all_cor_4.csv](datasets/Lev17-26.Volition_Affectedness_all_cor_4.csv)

In [2]:
#Dataset path
PATH = 'datasets/'

import pandas as pd
import numpy as np

from tf.app import use

In [3]:
A = use('bhsa', hoist=globals(), mod='ch-jensen/participants/actor/tf')

	connecting to online GitHub repo annotation/app-bhsa ... connected
Using TF-app in C:\Users\Ejer/text-fabric-data/annotation/app-bhsa/code:
	rv1.2=#5fdf1778d51d938bfe80b37b415e36618e50190c (latest release)
	connecting to online GitHub repo etcbc/bhsa ... connected
Using data in C:\Users\Ejer/text-fabric-data/etcbc/bhsa/tf/c:
	rv1.6=#bac4a9f5a2bbdede96ba6caea45e762fe88f88c5 (latest release)
	connecting to online GitHub repo etcbc/phono ... connected
Using data in C:\Users\Ejer/text-fabric-data/etcbc/phono/tf/c:
	r1.2=#1ac68e976ee4a7f23eb6bb4c6f401a033d0ec169 (latest release)
	connecting to online GitHub repo etcbc/parallels ... connected
Using data in C:\Users\Ejer/text-fabric-data/etcbc/parallels/tf/c:
	r1.2=#395dfe2cb69c261862fab9f0289e594a52121d5c (latest release)
	connecting to online GitHub repo ch-jensen/participants ... connected
Using data in C:\Users\Ejer/text-fabric-data/ch-jensen/participants/actor/tf/c:
	r1.5=#1c17398f92c0836c06de5e1798687c3fa18133cf (latest release)
   |  

### Extract `Rela` phrases

In [46]:
query = '''
book book=Leviticus
  chapter chapter=17|18|19|20|21|22|23|24|25|26
     phrase function=Rela
'''

Rela = A.search(query)
Rela = [r[2] for r in Rela]

  1.23s 109 results


In [47]:
Rela

[688358,
 688363,
 688368,
 688391,
 688419,
 688432,
 688435,
 688451,
 688454,
 688461,
 688493,
 688501,
 688504,
 688507,
 688535,
 688573,
 688580,
 688603,
 688758,
 688782,
 688789,
 688804,
 688808,
 688814,
 688822,
 688876,
 689069,
 689075,
 689190,
 689212,
 689233,
 689236,
 689283,
 689291,
 689323,
 689334,
 689337,
 689345,
 689357,
 689369,
 689382,
 689397,
 689408,
 689422,
 689443,
 689470,
 689481,
 689501,
 689512,
 689536,
 689552,
 689555,
 689600,
 689605,
 689608,
 689684,
 689687,
 689753,
 689763,
 689770,
 689777,
 689828,
 689838,
 689842,
 689867,
 689872,
 689878,
 689881,
 689884,
 689889,
 690008,
 690032,
 690036,
 690043,
 690169,
 690189,
 690211,
 690253,
 690431,
 690440,
 690497,
 690506,
 690793,
 690838,
 690843,
 691040,
 691066,
 691084,
 691117,
 691119,
 691123,
 691132,
 691207,
 691248,
 691264,
 691268,
 691275,
 691282,
 691284,
 691389,
 691541,
 691795,
 691832,
 691839,
 691881,
 691894,
 691899,
 691973,
 691985]

### Comparing with participant-tracking data

We check whether the Rela-phrase (or its sub-componenents, sub-phrase and word) is annotated with participant data:

In [48]:
phrases = []

for ph in Rela:
    sub_ph = L.d(ph, 'phrase_atom')
    word = L.d(ph, 'word')
    
    if F.actor.v(ph):
        phrases.append(ph)
        
    for s in sub_ph:
        if F.actor.v(s):
            phrases.append(s)
            
    for w in word:
        if F.actor.v(w) or F.prs_actor.v(w):
            phrases.append(w)
            
print(f'Number of Rela phrases annotated with participant data: {len(phrases)}')

Number of Rela phrases annotated with participant data: 0


None of the Rela phrases are annotated with participant data, so we can safely correct the Affectedness-Volition dataset for this particular issue.

### Correcting dataset

The first step is to isolate those Rela phrases that are actually annotated:

In [49]:
#A dictionary of columns to be imported as integers.
int_cols = {col:'Int64' for col in ['clause','Act_phr','Und1_phr','Und2_phr']}

data = pd.read_csv(f'{PATH}Lev17-26.Volition_Affectedness_all_cor_3.csv', dtype=int_cols)

data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
0,0,0,439650,DBR[,688348,y,n,688349.0,y,y,,,,
1,1,1,439651,>MR[,688350,y,n,,,,,,,
2,2,2,439652,DBR[,688351,y,n,688352.0,y,y,,,,
3,3,3,439653,>MR[,688354,y,n,688355.0,y,y,,,,
4,4,4,439655,YWH[,688360,y,n,688358.0,n,n,,,,


We check whether the Rela phrase exists in either of the three columns containing phrase nodes:

In [50]:
Rela1 = []

for ph in Rela:
    if ph in list(data.Act_phr) or ph in list(data.Und1_phr) or ph in list(data.Und2_phr):
        Rela1.append(ph)
        
print(f'Number of Rela-phrases to correct: {len(Rela1)}')

Number of Rela-phrases to correct: 73


Secondly, we walk through each case for corrections. The procedure is simple:

* If Rela is subject: The reference is changed to the predicate of the clause
* If Rela is object: The reference is retained (or linked to the existing object of the clause if coreferring)

The corrections are inserted into a new dataset ("Lev17-26.Volition_Affectedness_all_cor_4.csv").

In [51]:
def show(ph):
    print(ph)
    print(n)
    A.pretty(L.u(ph, 'clause')[0], highlights={ph:'gold'})

In [52]:
n=0

In [126]:
show(Rela1[n])
n+=1

IndexError: list index out of range

### Checking corrections

A small count of (object) relative phrases have been retained:

In [141]:
#A dictionary of columns to be imported as integers.
int_cols = {col:'Int64' for col in ['clause','Act_phr','Und1_phr','Und2_phr']}

data = pd.read_csv(f'{PATH}Lev17-26.Volition_Affectedness_all_cor_4.csv', dtype=int_cols)

data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
0,0,0,439650,DBR[,688348,y,n,688349.0,y,y,,,,
1,1,1,439651,>MR[,688350,y,n,,,,,,,
2,2,2,439652,DBR[,688351,y,n,688352.0,y,y,,,,
3,3,3,439653,>MR[,688354,y,n,688355.0,y,y,,,,
4,4,4,439655,YWH[,688360,y,n,688358.0,n,n,,,,


In [142]:
Rela2 = []

for ph in Rela:
    if ph in list(data.Act_phr) or ph in list(data.Und1_phr) or ph in list(data.Und2_phr):
        Rela2.append(ph)
        
print(f'Number of Rela-phrases to correct: {len(Rela2)}')

Number of Rela-phrases to correct: 18


In [145]:
n=0

In [163]:
show(Rela2[n])
n+=1

691985
17


## 2. Correcting Predicate phrases

Some annotations have been added to the Predicate phrase, even if a subject phrase exists. These cases will be corrected here:

In [164]:
#A dictionary of columns to be imported as integers.
int_cols = {col:'Int64' for col in ['clause','Act_phr','Und1_phr','Und2_phr']}

data = pd.read_csv(f'{PATH}Lev17-26.Volition_Affectedness_all_cor_4.csv', dtype=int_cols)

data.head(35)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
0,0,0,439650,DBR[,688348.0,y,n,688349.0,y,y,,,,
1,1,1,439651,>MR[,688350.0,y,n,,,,,,,
2,2,2,439652,DBR[,688351.0,y,n,688352.0,y,y,,,,
3,3,3,439653,>MR[,688354.0,y,n,688355.0,y,y,,,,
4,4,4,439655,YWH[,688360.0,y,n,688358.0,n,n,,,,
5,5,5,439656,>MR[,688361.0,y,n,,,,,,,
6,6,6,439658,CXV[,688364.0,y,n,688365.0,n,y,,,,
7,7,7,439659,CXV[,688369.0,y,n,,,,,,,
8,8,8,439660,BW>[,688374.0,y,n,688374.0,n,y,688372.0,n,n,
9,9,9,439661,QRB[,688375.0,y,n,688376.0,n,y,688377.0,n,n,


In [165]:
predicates = []

for n, row in data.iterrows():
    
    phrases = [row.Act_phr, row.Und1_phr, row.Und2_phr]
    
    for ph in phrases:
        if ph:
            if F.function.v(ph) in {'Pred','PreC','PreO','PtcO'}:
                predicates.append(ph)
                
len(predicates)

743

In [166]:
missing_subjects = []

for ph in predicates:
    clause = L.u(ph, 'clause')[0]
    
    for phr in L.d(clause, 'phrase'):
        if F.function.v(phr) == 'Subj':
            missing_subjects.append(phr)
            
len(missing_subjects)

10

In [167]:
missing_subjects

[688473,
 688562,
 689226,
 689242,
 689529,
 690271,
 691323,
 691326,
 691596,
 691694]

In [168]:
def review(phrase, df=data):
    cl = L.u(phrase, 'clause')[0]
    
    print(f'Subject phrase: {phrase}')
    display(df[df.clause == cl])
    A.pretty(cl, highlights={phrase: 'gold'})

#### Correcting phrases

The corrections are made in [Lev17-26.Volition_Affectedness_all_cor_5.csv](datasets/Lev17-26.Volition_Affectedness_all_cor_5.csv)

In [169]:
review(missing_subjects[1])

Subject phrase: 688562


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
57,57,0,439721,DBR[,688561,y,n,688563,y,y,,,,


In [37]:
review(missing_subjects[2])

Subject phrase: 689226


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
230,230,173,439943,DBR[,689225,y,n,689227,y,y,,,,


In [44]:
review(missing_subjects[9])

Subject phrase: 691694


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
859,859,76,440740,HLK[,691693,y,n,691695,n,y,,,,
