# Sanity tests

The tool to generate BHSA trees do not handle well sentences with embedded clauses. It appears that embedded clauses become located at the end of the tree representation.

Therefore, the sanity tests provided here check whether the leaves (nodes) of the syntactic tree are in canonical order, and whether there are gaps.

In [2]:
import re
from utils import structure, layout
import random

from tf.app import use

In [4]:
A = use('bhsa', version=2017, hoist=globals(), mod='etcbc/lingo/trees/tf')

### Example

In Zechariah 4:10 there is a sentence with an embedded pronominal phrase that has been suspended to the end of the sentence:

In [8]:
passage = ('Zechariah',4,10)
sentenceNode = L.d(T.nodeFromSection(passage), 'sentence')[3]
print(sentenceNode)
firstSlot = L.d(sentenceNode, 'word')[0]
rawTree = F.tree.v(sentenceNode)

print(f'Passage: {passage}\nFirst slot: {firstSlot}\n{rawTree}')

1217208
Passage: ('Zechariah', 4, 10)
First slot: 306071
(S(C(NP(U(n 0))(U(pr-dem 1))))(Cresu(NP(U(n 2))(U(n-pr 3))(Cattr(VP(vb 5))(PP(pp 6)(U(n 7))(U(dt 8)(n 9)))))(PPrP(pr-ps 4))))


In [9]:
A.pretty(1217208)

## Sanity test

We test for two factors:
1. Are the nodes of the tree ordered?
2. Is there a gap in the sequence of nodes?

In [10]:
#Get all trees
tree_dict = {}

numPattern = re.compile('[0-9]+')
for s in F.otype.s('sentence'):
    rawTree = F.tree.v(s)
    node_list = [int(n) for n in numPattern.findall(rawTree)]
    tree_dict[s] = node_list

In [12]:
#Sanity checks
mismatches = set()

for s in tree_dict:
    
    first_node = tree_dict[s][0]
    last_node = tree_dict[s][-1]
    
    #1. Are the nodes in order?
    if tree_dict[s] != sorted(tree_dict[s]):
        mismatches.add(s)
    
    #2. Is there a gap in the tree?
    elif (first_node + last_node + 1) != len(tree_dict[s]):
        mismatches.add(s)

print(f'Number of trees in wrong order or with gaps: {len(mismatches)}')

Number of trees in wrong order or with gaps: 1999


These mismatches include all sentences with one or more embedded sentences, as shown below:

In [11]:
query = '''
s:sentence
  PreGap:sentence_atom
  Last:sentence_atom
  :=

Gap:sentence_atom
PreGap <: Gap
Gap < Last

s || Gap
'''

results = A.search(query)

  6.65s 775 results


In [21]:
print(f'Number of sentence gaps found among the mismatches: {len([r[0] for r in results if r[0] in mismatches])}')

Number of sentence gaps found among the mismatches: 775


#### How many trees in the sample of 1000 sentences are affected?

In [14]:
manual_selection = [1173012,
                   1217205,
                   1217206,
                   1217207,
                   1217208]

In [15]:
all_sentences = list(F.otype.s('sentence'))
print(f'Number of sentences in corpus: {len(all_sentences)}')

#Shuffle sentences
random.Random(4).shuffle(all_sentences)

Number of sentences in corpus: 63711


In [16]:
#First export: ~1000 sentences
first_sentences = all_sentences[:1000]
first_sentences += manual_selection
first_sentences = list(set(first_sentences))
first_sentences.sort()

In [17]:
affected_sentences = [s for s in first_sentences if s in mismatches]
len(affected_sentences)

35

In [19]:
affected_sentences

[1176197,
 1178797,
 1182766,
 1184652,
 1185036,
 1187217,
 1187686,
 1187914,
 1188003,
 1190665,
 1192881,
 1194836,
 1201271,
 1202352,
 1203512,
 1205405,
 1208029,
 1208691,
 1208830,
 1209662,
 1213205,
 1216647,
 1217208,
 1217763,
 1219985,
 1221948,
 1222908,
 1225430,
 1229092,
 1230004,
 1230106,
 1232963,
 1233972,
 1234606,
 1235787]