# Correct Relative phrases

This notebook is used to correct the dataset of Volition and Affectedness. During the annotation of those features, some Relatative (`Rela`) phrases were annotated if the subject or object were not explicit. Unfortunately, however, those phrases are not specified as referring to either the subject or object so they cannot easily be qualified. I therefore need to make some corrections to avoid the use of `Rela` phrases.

In [27]:
#Dataset path
PATH = 'datasets/'

import pandas as pd
import numpy as np

from tf.app import use

In [4]:
A = use('bhsa', hoist=globals(), mod='ch-jensen/participants/actor/tf')

TF app is up-to-date.
Using annotation/app-bhsa commit 5fdf1778d51d938bfe80b37b415e36618e50190c (=latest)
  in C:\Users\Ejer/text-fabric-data/__apps__/bhsa.
Using etcbc/bhsa/tf - c r1.4 in C:\Users\Ejer/text-fabric-data
Using etcbc/phono/tf - c r1.1 in C:\Users\Ejer/text-fabric-data
Using etcbc/parallels/tf - c r1.1 in C:\Users\Ejer/text-fabric-data
Using ch-jensen/participants/actor/tf - c r?? in C:\Users\Ejer/text-fabric-data


### Extract `Rela` phrases

In [30]:
query = '''
book book=Leviticus
  chapter chapter=17|18|19|20|21|22|23|24|25|26
     phrase function=Rela
'''

Rela = A.search(query)
Rela = [r[2] for r in Rela]

  0.54s 109 results


In [32]:
Rela

[688358,
 688363,
 688368,
 688391,
 688419,
 688432,
 688435,
 688451,
 688454,
 688461,
 688493,
 688501,
 688504,
 688507,
 688535,
 688573,
 688580,
 688603,
 688758,
 688782,
 688789,
 688804,
 688808,
 688814,
 688822,
 688876,
 689069,
 689075,
 689190,
 689212,
 689233,
 689236,
 689283,
 689291,
 689323,
 689334,
 689337,
 689345,
 689357,
 689369,
 689382,
 689397,
 689408,
 689422,
 689443,
 689470,
 689481,
 689501,
 689512,
 689536,
 689552,
 689555,
 689600,
 689605,
 689608,
 689684,
 689687,
 689753,
 689763,
 689770,
 689777,
 689828,
 689838,
 689842,
 689867,
 689872,
 689878,
 689881,
 689884,
 689889,
 690008,
 690032,
 690036,
 690043,
 690169,
 690189,
 690211,
 690253,
 690431,
 690440,
 690497,
 690506,
 690793,
 690838,
 690843,
 691040,
 691066,
 691084,
 691117,
 691119,
 691123,
 691132,
 691207,
 691248,
 691264,
 691268,
 691275,
 691282,
 691284,
 691389,
 691541,
 691795,
 691832,
 691839,
 691881,
 691894,
 691899,
 691973,
 691985]

### Comparing with participant-tracking data

We check whether the Rela-phrase (or its sub-componenents, sub-phrase and word) is annotated with participant data:

In [148]:
phrases = []

for ph in Rela:
    sub_ph = L.d(ph, 'phrase_atom')
    word = L.d(ph, 'word')
    
    if F.actor.v(ph):
        phrases.append(ph)
        
    for s in sub_ph:
        if F.actor.v(s):
            phrases.append(s)
            
    for w in word:
        if F.actor.v(w) or F.prs_actor.v(w):
            phrases.append(w)
            
print(f'Number of Rela phrases annotated with participant data: {len(phrases)}')

Number of Rela phrases annotated with participant data: 0


None of the Rela phrases are annotated with participant data, so we can safely correct the Affectedness-Volition dataset for this particular issue.

### Correcting dataset

The first step is to isolate those Rela phrases that are actually annotated:

In [34]:
#A dictionary of columns to be imported as integers.
int_cols = {col:'Int64' for col in ['clause','Act_phr','Und1_phr','Und2_phr']}

data = pd.read_csv(f'{PATH}Lev17-26.Volition_Affectedness_all_cor_3.csv', dtype=int_cols)

data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
0,0,0,439650,DBR[,688348,y,n,688349.0,y,y,,,,
1,1,1,439651,>MR[,688350,y,n,,,,,,,
2,2,2,439652,DBR[,688351,y,n,688352.0,y,y,,,,
3,3,3,439653,>MR[,688354,y,n,688355.0,y,y,,,,
4,4,4,439655,YWH[,688360,y,n,688358.0,n,n,,,,


We check whether the Rela phrase exists in either of the three columns containing phrase nodes:

In [38]:
Rela1 = []

for ph in Rela:
    if ph in list(data.Act_phr) or ph in list(data.Und1_phr) or ph in list(data.Und2_phr):
        Rela1.append(ph)
        
print(f'Number of Rela-phrases to correct: {len(Rela1)}')

Number of Rela-phrases to correct: 73


Secondly, we walk through each case for corrections. The procedure is simple:

* If Rela is subject: The reference is changed to the predicate of the clause
* If Rela is object: The reference is retained (or linked to the existing object of the clause if coreferring)

The corrections are inserted into a new dataset ("Lev17-26.Volition_Affectedness_all_cor_4.csv").

In [52]:
def show(ph):
    print(ph)
    print(n)
    A.pretty(L.u(ph, 'clause')[0], highlights={ph:'gold'})

In [176]:
n=0

In [250]:
show(Rela1[n])
n+=1

IndexError: list index out of range

### Checking corrections

A small count of (object) relative phrases have been retained:

In [251]:
#A dictionary of columns to be imported as integers.
int_cols = {col:'Int64' for col in ['clause','Act_phr','Und1_phr','Und2_phr']}

data = pd.read_csv(f'{PATH}Lev17-26.Volition_Affectedness_all_cor_4.csv', dtype=int_cols)

data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,clause,lex,Act_phr,Act_vol,Act_aff,Und1_phr,Und1_vol,Und1_aff,Und2_phr,Und2_vol,Und2_aff,comment
0,0,0,439650,DBR[,688348,y,n,688349.0,y,y,,,,
1,1,1,439651,>MR[,688350,y,n,,,,,,,
2,2,2,439652,DBR[,688351,y,n,688352.0,y,y,,,,
3,3,3,439653,>MR[,688354,y,n,688355.0,y,y,,,,
4,4,4,439655,YWH[,688360,y,n,688358.0,n,n,,,,


In [252]:
Rela2 = []

for ph in Rela:
    if ph in list(data.Act_phr) or ph in list(data.Und1_phr) or ph in list(data.Und2_phr):
        Rela2.append(ph)
        
print(f'Number of Rela-phrases to correct: {len(Rela2)}')

Number of Rela-phrases to correct: 18


In [253]:
n=0

In [271]:
show(Rela2[n])
n+=1

691985
17
