# BHSA-2-RRG

*Christian Canu Højgaard ([cch@dbi.edu](mailto:cch@dbi.edu))*

This notebook is created to transform syntactic trees generated on the ETCBC database of the Hebrew Bible (BHSA) into Role and Reference (RRG)-like tree structures applicable for semantic parsing. The transformed trees are used as input in the growing database of RRG-trees which is part of the Düsseldorf University [TreeGraSP](https://rrg-bh.phil.hhu.de/editor/index) research project.

The purpose is to transform the synctactic trees as much as possible into RRG-tree like structures so that the need for manual annotation is decreased.

**Content of this notebook:**
1. Import packages and BHSA data
2. Format tree layout
3. Create trees
4. Postprocessing
5. Testing
6. Export

## 1. Import packages and data

In [1]:
import re, collections
from utils import structure, layout
import random

from tf.fabric import Fabric
from tf.app import use

The trees and phonological modules are imported as well as the BHSA corpus itself:

In [2]:
TF = Fabric(locations=['~/text-fabric-data/etcbc/bhsa/tf',
                      '~/text-fabric-data/etcbc/trees/tf',
                      '~/text-fabric-data/etcbc/phono/tf'],
            modules='2017')

api = TF.load('''tree
              rela code label number
              language gloss
              typ function
              vs vt gn nu ps sp pdp
              prs_gn prs_nu prs_ps
              lex pfm g_pfm vbe g_vbe prs g_prs trailer_utf8''')

This is Text-Fabric 8.5.2
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

120 features found and 0 ignored
  0.00s loading features ...
   |     0.00s Dataset without structure sections in otext:no structure functions in the T-API
  8.44s All features loaded/computed - for details use loadLog()


In [3]:
A = use('bhsa', api=api, hoist=globals())

## 2. Format tree layout

There is need for some processing of the BHSA syntactic trees to make them suitable as RRG-trees:
1. Phonetic representation of the text
2. Further segmentation of the words so that all referring morphemes are represented as constituents (including  inflectional suffixes with gender, number and/or gender information). There are four possible constituents for each word: prefix, lexical root, suffix, pronominal suffix. The "lexical root" is not the dictionary form but the core of the word after stripping possible affixes. In some cases Ø (i.e., zero marking) is used to designate a referrent not marked grammatically, e.g., the 3ms *qatal* form which has no suffix but nevertheless refers to 3ms participant.

The procedure is to format each word at a time and then feed the formatted word into tree. The formatting process is carried out by the following functions.

In [4]:
def stripSigns(w):
    '''The function takes a word string and cleanses it based on accepted letters and returns the length'''
    
    #Complex vowels (e.g. holem waw) are rendered as one letter in the phonetic representation. Therefore,
    #complex vowels are stripped to one letter first:
    complex_vowels = re.sub('OW|IJ|;J|W.','O', w)
    
    #Remove final shewa (because it will also be silent (= absent) in the phonetic representation))
    if complex_vowels:
        if complex_vowels[-1] in {':','H'}:
            complex_vowels = complex_vowels[:-1]

    letters = '[>BGDHWZXVJKkLMmNnS<pPYyQRCF#CFTA@E;:IOU]+' #accepted letters (Hebrew consonants and vowels)
    return len("".join(re.findall(letters, complex_vowels)))

def mapSP(sp):
    '''Function to map ETCBC POS-tags with tree tags'''
    mapping = {'verb':'v',
              'subs':'n',
              'prep':'p',
              'adjv':'aj',
              'intj':'itj'}
    
    return mapping[sp]

def segmentWord(w):
    '''This function segmentates word strings into morphological constituents. The function returns a segmented version
    of the word replete parentheses representing nodes or leaves.'''

    string = ''
    
    word = re.sub('[ˈˌ]', '', F.phono.v(w)) #Removing accent markers because they are not accounted for in the lexical rep.
    word = word.replace(' . f', '').replace(' . s','') #Removing sof-passoq.
    
    #Identifying affixes:
    preformative = stripSigns(F.g_pfm.v(w))
    verbal_ending = stripSigns(F.g_vbe.v(w))
    pronominal_sfx = stripSigns(F.g_prs.v(w))
    
    #For some reason, a pro-sfx. may not have a 'g_prs' annotation. In those cases, 'prs' i used:
    if not pronominal_sfx and F.g_prs.v(w): 
        pronominal_sfx = len(F.prs.v(w))
    
    if F.trailer_utf8.v(w) not in {'-','־'}: 
        trailer = 0 #If trailer is a white space, ignore. 
    else:
        trailer = len(F.trailer_utf8.v(w)) #Length of trailer
    
    #1. Verbs in the suffix conjugation are handled independently:
    if F.vt.v(w) in {'perf','impv','ptca','ptcp'} and F.pdp.v(w) == 'verb':
        if not verbal_ending:
            verbal_ending = 1 #length of morpheme
            if pronominal_sfx or trailer:
                word = f'{word[:-pronominal_sfx-trailer]}Ø{word[-pronominal_sfx-trailer:]}'#Ø inserted between root and pro_sfx
            else:
                word = f'{word}Ø' #Ø inserted after verbal root
        
        root = word[:-verbal_ending-pronominal_sfx-trailer]
        string += f'({mapSP(F.sp.v(w))} {root}-)' #Write root
        string += f'(ARG {word[len(root):len(root)+verbal_ending]}' #Write verbal ending
    
    #2. Verbs in the prefix conjugation
    elif F.vt.v(w) in {'impf','wayq'}:
        if pronominal_sfx or trailer:
            root = word[preformative:-pronominal_sfx-trailer] #Get root
        else:
            root = word[preformative:]
            
        string += f'(ARG {word[:preformative]}-)' #Write verbal ending
        string += f'({mapSP(F.sp.v(w))} {root}' #Write root
    
    #3. Non-verbs, presumably with pronominal suffixes
    else: 
        string += f'({mapSP(F.sp.v(w))} {word[:-pronominal_sfx-trailer]}'
        
    if pronominal_sfx: #Finally, adding pronominal suffix
        string += f'-)(PRO {word[-pronominal_sfx-trailer:]})' #Write pronominal suffix
    else: #If no pro-sfx, a closing bracket and a possible trailer is written
        if trailer:
            string += f'{word[-trailer]})'
        else:
            string += ')'
        
    return string
    
segmentWord(1796)

'(v hᵊyôṯ-)(PRO ām)'

## 3. Create trees

### 3a Select sentences to annotate

The sentences are picked semi-randomly. A few sentences are manually selected, the remaining are manually collected:

In [5]:
manual_selection = [1173012,
                   1217205,
                   1217206,
                   1217207,
                   1217208]

In [6]:
all_sentences = list(F.otype.s('sentence'))
print(f'Number of sentences in corpus: {len(all_sentences)}')

#Shuffle sentences
random.Random(4).shuffle(all_sentences)

Number of sentences in corpus: 63711


In [7]:
#First export: ~1000 sentences
first_sentences = all_sentences[:1000]
first_sentences += manual_selection
first_sentences = list(set(first_sentences))
first_sentences.sort()

### 3b Get raw tree

The rawtrees are imported for formatting:

In [14]:
sentenceNode = 1172308
firstSlot = L.d(sentenceNode, 'word')[0]
rawTree = F.tree.v(sentenceNode)

### 3c Filling phrases

Phrase type information is inserted instead of the original phrase numbers:

In [15]:
numPattern = re.compile('[0-9]+')

def fillPhrases(tree, start, phraseRep):
    
    phrases = set()
    
    for n in numPattern.findall(tree):
        
        ph = L.u(int(n)+start, 'phrase')[0]
        if ph not in phrases:
            mother = tree.rfind(F.typ.v(ph), 0, tree.index(n))
            tree = tree[:mother] + f'{F.typ.v(ph)}-{phraseRep(ph)}' + tree[len(F.typ.v(ph))+mother:]
            phrases.add(ph)

    return tree

ph_functions = fillPhrases(rawTree, firstSlot, F.function.v)

In [16]:
#Phrase-formatted tree
ph_functions

'(S(C(CP-Conj(cj 0))(NP-Subj(U(n 1))(U(U(n 2))(U(dt 3)(n 4))))(AdvP-Time(n 5))(VP-Pred(vb 6))))'

### 3d Filling the leaves

Next step is to fill the leaves. The word segmentation functions are already given above. Now, the functions are called but only in specific cases. The words are only segmented if they have a prononimal suffix (```prs```) or are finite verbs or participles. The copula *HJH* is never segmented into morphemes as it is treated as an auxillary in RRG.

In [12]:
#Identifying all cases of the copula HJH
HJH = '''
clause
  phrase function=PreC
  phrase
   word lex=HJH[
'''

HJH = A.search(HJH)
exclude_HJH = [r[3] for r in HJH]

  1.32s 2196 results


In [None]:
#Accepted verbal tenses for word segmentation
tenses = {'wayq','perf','impf','impv','ptca','ptcp'}

Formatting:

In [17]:
numPattern = re.compile('[0-9]+')

def fillWords(tree, start, wordRep):
    def numReplace(match):
        return wordRep(int(match.group(0)) + start)
    return numPattern.sub(numReplace, tree)

def phonoGloss(n):
    layout = '{}'

    if F.prs.v(n) not in {'none','N/A','n/a','nan','absent'} or (F.vt.v(n) in tenses and F.pdp.v(n) == 'verb' and n not in exclude_HJH):
        return segmentWord(n)
    else:
        if F.trailer_utf8.v(n) == '':
            return f'{F.phono.v(n)}-'
        return F.phono.v(n)

fillWords(rawTree, firstSlot, phonoGloss).replace(' . f', '').replace(' . s','')

'(S(C(CP(cj wᵊ-))(NP(U(n ḵol-))(U(U(n ʕˌēśev))(U(dt ha-)(n śśāḏˌeh))))(AdvP(n ṭˈerem))(VP(vb (ARG yi-)(v ṣmāḥ)))))'

### 3e Multilinedisplay

In [18]:
print(layout(structure(ph_functions), firstSlot, phonoGloss, withLevel=True))

 1  S
 2    C
 3      CP-Conj
 4        cj wᵊ-
 3      NP-Subj
 4        U
 5          n ḵol-
 4        U
 5          U
 6            n ʕˌēśev
 5          U
 6            dt ha-
 6            n śśāḏˌeh
 3      AdvP-Time
 4        n ṭˈerem
 3      VP-Pred
 4        vb (ARG yi-)(v ṣmāḥ)


In [19]:
print(layout(structure(rawTree), firstSlot, phonoGloss, withLevel=True))

 1  S
 2    C
 3      CP
 4        cj wᵊ-
 3      NP
 4        U
 5          n ḵol-
 4        U
 5          U
 6            n ʕˌēśev
 5          U
 6            dt ha-
 6            n śśāḏˌeh
 3      AdvP
 4        n ṭˈerem
 3      VP
 4        vb (ARG yi-)(v ṣmāḥ)


## 4. Postprocessing

Some sentences end with the sof-passoq and are translitterated ``` . f``` or ``` . s``` These accent marks are removed.

In [20]:
for s in first_sentences:
    rawTree = F.tree.v(s)
    firstSlot = L.d(s, 'word')[0]
    with_accent = fillWords(fillPhrases(rawTree, firstSlot, F.function.v),firstSlot,phonoGloss)
    no_accent = fillWords(fillPhrases(rawTree, firstSlot, F.function.v),firstSlot,phonoGloss).replace(' . f', '').replace(' . s','')
    
    if with_accent != no_accent:
        print(s, no_accent)

1172298 (S(C(NP-PreC(U(n yˌôm))(U(dt ha-)(aj ššiššˈî)))))
1172868 (S(C(CP-Conj(cj û-))(PP-Loca(pp mi-)(av ššˌām))(VP-PreO(vb (v hᵉfîṣ-)(ARG Ø-)(PRO ām)))(PrNP-Subj(n-pr [yᵊhwˈāh]))(PP-Cmpl(pp ʕal-)(U(n pᵊnˌê))(U(U(n kol-))(U(dt hā-)(n ʔˈāreṣ))))))
1176716 (S(C(PrNP-Frnt(n-pr gˌāḏ)))(Cresu(Ccoor(NP-Subj(n gᵊḏˈûḏ))(VP-PreO(vb (ARG yᵊ-)(v ḡûḏenn-)(PRO û))))(Ccoor(CP-Conj(cj wᵊ-))(PPrP-Subj(pr-ps hˌû))(VP-Pred(vb (ARG yā-)(v ḡuḏ)))(NP-Objc(n ʕāqˈēv)))))
1177396 (S(C(CP-Conj(cj wᵊ-))(NegP-Nega(ng lˌō))(VP-Pred(vb (v šāmaʕ-)(ARG Ø)))(PP-Cmpl(pp (p ʔᵃlē-)(PRO hem)))(Cadju(CP-Conj(pp ka-)(cj ʔᵃšˌer))(VP-Pred(vb (v dibber-)(ARG Ø)))(PrNP-Subj(n-pr [yᵊhwˈāh])))))
1182480 (S(C(Cadju(CP-Conj(pp ka-)(cj ʔᵃšˌer))(VP-Pred(vb (v dibber-)(ARG Ø)))(PrNP-Subj(n-pr [yᵊhwˌāh]))(PP-Cmpl(pp ʔel-)(n-pr mōšˈeh)))(AdvP-Modi(av kˌēn))(VP-Pred(vb (v ʕāś-)(ARG û)))(NP-Subj(U(n bᵊnˌê))(U(n-pr yiśrāʔˈēl)))))
1182666 (S(C(DPrP-Subj(pr-dem zˈeh))(NP-PreC(U(n qorbˌan))(U(n-pr gamlîʔˌēl))(U(n ben-))(U(n-pr pᵊḏāhṣˈûr))))

The sentences can succesfully be cleaned by removing the punctuations.

## 5. Test word segmentation

The following codes allow a visual check of the word segmentations for validation

In [None]:
check_words = collections.defaultdict(list)
for s in first_sentences:
    
    for w in L.d(s, 'word'):
        if F.prs.v(w) not in {'none','N/A','n/a','nan','absent'} or (F.vt.v(w) in tenses and F.pdp.v(w) == 'verb'):
            check_words[F.phono.v(w)].append(w)
            
len(check_words)

For simplicity, we will only check one instance of each representation:

In [None]:
check_list = [check_words[n][0] for n in check_words]

In [None]:
def show(n):
    print(n, segmentWord(n))
    A.pretty(n)

In [None]:
n=0

In [None]:
print(n)
show(check_list[n])
n+=1

## 6. Export trees

Having settled on a layout, we can now export as an txt-file.

First, however, we want to filter out those sentences which have been found corrupted in the sanity test (cf. [Sanity_tests.ipynb](Sanity_tests.ipynb)):

In [None]:
corrupted_sentences = [1176197,
 1178797,
 1182766,
 1184652,
 1185036,
 1187217,
 1187686,
 1187914,
 1188003,
 1190665,
 1192881,
 1194836,
 1201271,
 1202352,
 1203512,
 1205405,
 1208029,
 1208691,
 1208830,
 1209662,
 1213205,
 1216647,
 1217208,
 1217763,
 1219985,
 1221948,
 1222908,
 1225430,
 1229092,
 1230004,
 1230106,
 1232963,
 1233972,
 1234606,
 1235787]

In [None]:
export_sentences = [s for s in first_sentences if s not in corrupted_sentences]
len(export_sentences)

In [None]:
with open('bhsa_trees_5.txt', 'w', encoding='utf-8') as f:
    for s in first_sentences:
        if s not in corrupted_sentences:
            bo, ch, ve = T.sectionFromNode(s)
            rawTree = F.tree.v(s)
            firstSlot = L.d(s, 'word')[0]
            f.write('''{}_{}:{}_{}\t{}\n'''.format(bo[:3], ch, ve, s, fillWords(fillPhrases(rawTree, firstSlot, F.function.v),firstSlot,phonoGloss).replace(' . f', '').replace(' . s','')))