# Export BHSA Data for Processing

Use Text-Fabric to export a time-expression (timex) dataset and a verb dataset.

## Selections for Time Expressions
* **Select phrases with the function of `Time`.**
    * Exclude phrases that occur in clauses with multiple time-function phrases.
    <br><br>
* **Features of the phrase**
    * lexeme/count of prepositions, substantives, and quantifiers (cardinal numbers).
    * part of speech patterns within the phrase
    <br><br>
* **Features of the enclosing clause**
    * domain
    * lex and tense of verb if present
    * position of phrase in relation to the verb 
    * position of phrase in relation to the clause
    * main vs. subordinate
    
## Selections for Verbs

In order to compare the time phrase dataset with the broader tendencies of verbs in the Hebrew Bible, another export contains:
* **Features of verbs in the HB**
    * verb must be in a predicate phrase
    * tense and lexeme
    <br><br>
* **Features of the enclosing clause**
    * domain
    * position of verb's phrase in relation to the clause
    * main vs. subordinate

<hr>

In [43]:
from tf.fabric import Fabric

In [44]:
from os import sys, path
import csv

In [45]:
# put custom functions in path
if __name__ == '__main__' and __package__ is None:
    sys.path.append('../')

In [56]:
TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c')

This is Text-Fabric 3.0.9
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored


In [57]:
api = TF.load('''
                 book chapter verse
                 function kind
                 pdp typ rela ls
                 domain vt lex
                 number
              ''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B chapter              from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B verse                from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.17s B function             from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.05s B kind                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.21s B pdp                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.46s B typ                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.39s B rela                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.17s B ls                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.03s B domain               from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.21s B vt                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.23s B lex                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.54s B number            

In [58]:
# custom function to id weqetal verbs and for subject isolation
from my_functions.verbs import is_weqt 
from my_functions.phrases import is_subs

## Collect Time Phrases

The time phrase cannot occur inside a clause with more than 1 time phrase.

In [59]:
timexes = []

for phrase in F.function.s('Time'):
    
    # skip phrases with >1 TP in its clause
    clause = L.u(phrase, otype='clause')[0]
    phrase_functs = [F.function.v(ph) for ph in L.d(clause, otype='phrase')]
    if phrase_functs.count('Time') > 1:
        continue
        
    # append the relevant data
    timexes.append((clause, phrase))

print(len(timexes), 'time expressions ready for proccessing...')

3826 time expressions ready for proccessing...


## Process Time Phrases

Retrieve the phrase/clause level information.

In [60]:
# header for csv export & and how data should be assembled
header = ['book', 'chapter', 'verse', 'clause.n', 'phrase.n', 
          'phrase.trans', 'preps', 'num.preps', 'subs',
          'num.subs', 'quants', 'num.quants', 'pdp.pattern',
          'domain', 'verb.lex', 'verb.tense', 'position.at.vb',
          'position.at.cl', 'cl.dependency']

# put data here for csv export
rows = []

# get row data per time phrase
for clause_n, phrase_n in timexes:

    book, chapter, verse = T.sectionFromNode(clause_n)
        
        
    # // phrase data //
    
    # transcription
    ph_words = L.d(phrase_n, otype='word')
    ph_trans = T.text(ph_words, fmt='lex-trans-plain').strip()
    
    # substantives
    subs = [F.lex.v(w) for w in ph_words if is_subs(w)] # w/ custom funct. is_subs()
    num_subs = len(subs)
    subs_txt = '|'.join(subs) or 'ø'
    
    # prepositions
    preps = [F.lex.v(w) for w in ph_words if F.pdp.v(w) == 'prep']
    num_preps = len(preps)
    preps_txt = '|'.join(preps) or 'ø'
    
    # quantities (card == "cardinal number")
    quants = [F.lex.v(w) for w in ph_words if F.ls.v(w) == 'card']
    num_quants = len(quants)
    quants_txt = '|'.join(quants) or 'ø'
    
    # phrase dep. part of speech pattern
    pdp_pattern = '-'.join(F.pdp.v(w) for w in L.d(phrase_n, otype='word'))
    
    
    # // clause data // 
    
    domain = F.domain.v(clause_n)
        
    # get verb lex and its tense
    pred_functs = {'Pred', 'PreO', 'PreS', 'PtcO'}
    verb = [word for phrase in L.d(clause_n, otype='phrase')
               for word in L.d(phrase, otype='word')
               if F.pdp.v(word) == 'verb' 
               and F.function.v(phrase) in pred_functs
           ]
    if verb:
        verb_n = verb[0]
        verb_lex = F.lex.v(verb_n)
        verb_tense = F.vt.v(verb_n) if not is_weqt(verb_n) else 'weqt' # + hacked weqetal
    else:
        verb_lex = 'ø'
        verb_tense = 'ø'
        
    # time-phrase position to verb
    pos_at_vb = 1 if phrase_n < verb_n else 2
    pos_at_cl = F.number.v(phrase_n)
    
    # clause dependency
    cl_dependency = 'I' if F.rela.v(clause_n) == 'NA' else 'D'
    
    
    # // ship it //
    
    # package the row and append it
    row = [book, chapter, verse, clause_n, phrase_n,
           ph_trans, preps_txt, num_preps, subs_txt,
           num_subs, quants_txt, num_quants, pdp_pattern,
           domain, verb_lex, verb_tense, pos_at_vb,
           pos_at_cl, cl_dependency]
           
    rows.append(row)
    
print(len(rows), 'rows ready for export...')
print()
print('sample: ', rows[100])

3826 rows ready for export...

sample:  ['Genesis', 19, 34, 429263, 656759, '>MC', 'ø', 0, 'ø', 0, 'ø', 0, 'advb', 'Q', 'CKB[', 'perf', 2, 3, 'I']


## Export Timex Data to CSV

In [61]:
# Export the .csv
with open('time_phrases.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)

## Collect and Process Verb Data


In [62]:
header = ['book', 'chapter', 'verse', 'clause.n', 
          'verb.n', 'lexeme', 'tense', 'domain', 
          'position.at.cl', 'cl.dependency']

rows = []

# gather verb data
for verb in F.pdp.s('verb'):

    book, chapter, verse = T.sectionFromNode(verb)
    
    # skip verbs without a predicate phrase function
    phrase = L.u(verb, otype='phrase')[0]
    pred_functs = {'Pred', 'PreO', 'PreS', 'PtcO'}
    if F.function.v(phrase) not in pred_functs:
        continue
        
    # verb level data
    tense = F.vt.v(verb)
    lex = F.lex.v(verb)
    
    # clause level data
    clause_n = L.u(verb, otype='clause')[0]
    domain = F.domain.v(clause_n)
    pos_at_cl = F.number.v(phrase)
    cl_dependency = 'I' if F.rela.v(clause_n) == 'NA' else 'D'

    # package and save
    row = [book, chapter, verse, clause_n, verb, 
           lex, tense, domain, pos_at_cl, 
           cl_dependency]
    
    rows.append(row)
    
print(len(rows), 'rows ready for export...')
print()
print('sample: ', rows[124])

63522 rows ready for export...

sample:  ['Genesis', 2, 15, 427730, 953, 'NWX[', 'wayq', 'N', 2, 'I']


## Export Tense Data to CSV

In [63]:
with open('verbs.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)