# Export BHSA Data for Processing

Use Text-Fabric to export a time-expression (timex) dataset and a verb dataset.

In all cases, the data is only selected from Biblical Hebrew.

## 1. Dataset for Time Expressions
* **Select phrases with the function of `Time`.**
    * Exclude phrases that occur in clauses with multiple time-function phrases.
    <br><br>
* **Features of the phrase**
    * lexeme/count of prepositions, substantives, and quantifiers (cardinal numbers).
    * part of speech patterns within the phrase
    <br><br>
* **Features of the enclosing clause**
    * domain
    * lex and tense of verb if present
    * position of phrase in relation to the verb 
    * position of phrase in relation to the clause
    * main vs. subordinate
    
## 2. Dataset for Verbs

In order to compare the time phrase dataset with the broader tendencies of verbs in the Hebrew Bible, another export contains:
* **Features of verbs in the HB**
    * verb must be in a predicate phrase
    * tense and lexeme
    <br><br>
* **Features of the enclosing clause**
    * domain
    * position of verb's phrase in relation to the clause
    * main vs. subordinate
    
## 3. Dataset for All Clauses
* **A simple export where each row represents a clause, has a T/F value for the presence of a time phrase**.
    * Include book, chapter, and verse.

<hr>

In [1]:
from tf.fabric import Fabric

In [2]:
from os import sys, path
import csv

In [3]:
# put custom functions in path
if __name__ == '__main__' and __package__ is None:
    sys.path.append('../')

In [4]:
TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c')

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored


In [5]:
api = TF.load('''
                 book chapter verse
                 function kind
                 pdp sp typ rela ls
                 domain vt lex
                 number language
              ''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B chapter              from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B verse                from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.08s B function             from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.03s B kind                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.19s B pdp                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.17s B sp                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.24s B typ                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.23s B rela                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.14s B ls                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.02s B domain               from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.13s B vt                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.15s B lex               

In [6]:
# custom function to id weqetal verbs and for subject isolation
from my_functions.verbs import is_weqt 
from my_functions.phrases import is_subs

## 1. Dataset for Time Phrases

The time phrase cannot occur inside a clause with more than 1 time phrase.

In [7]:
timexes = []

for phrase in F.function.s('Time'):
    
    # skip non-hebrew phrases
    language = F.language.v(L.d(phrase, otype='word')[0])
    if language != 'hbo':
        continue
    
    # skip phrases with >1 TP in its clause
    clause = L.u(phrase, otype='clause')[0]
    phrase_functs = [F.function.v(ph) for ph in L.d(clause, otype='phrase')]
    if phrase_functs.count('Time') > 1:
        continue
        
    # append the relevant data
    timexes.append((clause, phrase))

print(len(timexes), 'time expressions ready for proccessing...')

3712 time expressions ready for proccessing...


### Process Time Phrases

Retrieve the phrase/clause level information.

In [8]:
# header for csv export & and how data should be assembled
header = ['book', 'chapter', 'verse', 'clause.n', 'phrase.n', 
          'phrase.trans', 'preps', 'num.preps', 'subs',
          'num.subs', 'quants', 'num.quants', 'advbs', 
          'num.advbs','pdp.pattern', 'sp.pattern', 'domain', 
          'verb.lex', 'verb.tense', 'position.at.vb', 
          'position.at.cl', 'cl.dependency']

# put data here for csv export
rows = []

# get row data per time phrase
for clause_n, phrase_n in timexes:

    book, chapter, verse = T.sectionFromNode(clause_n)
        
        
    # // phrase data //
    
    # transcription
    ph_words = L.d(phrase_n, otype='word')
    ph_trans = T.text(ph_words, fmt='lex-trans-plain').strip()
    
    # substantives
    subs = [F.lex.v(w) for w in ph_words if is_subs(w)] # w/ custom funct. is_subs()
    num_subs = len(subs)
    subs_txt = '|'.join(subs) or 'ø'
    
    # prepositions
    preps = [F.lex.v(w) for w in ph_words if F.sp.v(w) == 'prep']
    num_preps = len(preps)
    preps_txt = '|'.join(preps) or 'ø'
    
    # quantities (card == "cardinal number")
    quants = [F.lex.v(w) for w in ph_words if F.ls.v(w) == 'card']
    num_quants = len(quants)
    quants_txt = '|'.join(quants) or 'ø'
    
    # adverbs
    advbs = [F.lex.v(w) for w in ph_words if F.sp.v(w) == 'advb']
    num_advbs = len(advbs)
    advbs_txt = '|'.join(advbs) or 'ø'
    
    # part of speech patterns
    pdp_pattern = '-'.join(F.pdp.v(w) for w in L.d(phrase_n, otype='word'))
    sp_pattern = '-'.join(F.sp.v(w) for w in L.d(phrase_n, otype='word'))
    
    # // clause data // 
    
    domain = F.domain.v(clause_n)
        
    # get verb lex and its tense
    pred_functs = {'Pred', 'PreO', 'PreS', 'PtcO'}
    verb = [word for phrase in L.d(clause_n, otype='phrase')
               for word in L.d(phrase, otype='word')
               if F.pdp.v(word) == 'verb' 
               and F.function.v(phrase) in pred_functs
           ]
    if verb:
        verb_n = verb[0]
        verb_lex = F.lex.v(verb_n)
        verb_tense = F.vt.v(verb_n) if not is_weqt(verb_n) else 'weqt' # + hacked weqetal
    else:
        verb_lex = 'ø'
        verb_tense = 'ø'
        
    # time-phrase position to verb
    pos_at_vb = 1 if phrase_n < verb_n else 2
    pos_at_cl = F.number.v(phrase_n)
    
    # clause dependency
    cl_dependency = 'I' if F.rela.v(clause_n) == 'NA' else 'D'
    
    
    # // ship it //
    
    # package the row and append it
    row = [book, chapter, verse, clause_n, phrase_n,
           ph_trans, preps_txt, num_preps, subs_txt,
           num_subs, quants_txt, num_quants, advbs_txt, num_advbs,
           pdp_pattern, sp_pattern, domain, verb_lex, 
           verb_tense, pos_at_vb, pos_at_cl, cl_dependency]
           
    rows.append(row)
    
print(len(rows), 'rows ready for export...')
print()
print('sample: ', rows[100])

3712 rows ready for export...

sample:  ['Genesis', 19, 34, 429263, 656759, '>MC', 'ø', 0, 'ø', 0, 'ø', 0, 'ø', 0, 'advb', 'subs', 'Q', 'CKB[', 'perf', 2, 3, 'I']


### Export Timex Data to CSV

In [9]:
# Export the .csv
with open('time_phrases.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)

## 2. Dataset for Verbs


In [10]:
header = ['book', 'chapter', 'verse', 'clause.n', 
          'verb.n', 'lexeme', 'tense', 'domain', 
          'position.at.cl', 'cl.dependency']

rows = []

# gather verb data
for verb in F.pdp.s('verb'):

    book, chapter, verse = T.sectionFromNode(verb)
    clause_n = L.u(verb, otype='clause')[0] # clause node for filtering (below)

    # skip non-Hebrew verbs
    if F.language.v(verb) != 'hbo':
        continue

    # skip verbs in a clause with a time phrase
    clause_functs = set(F.function.v(phrase) for phrase in L.d(clause_n, otype='phrase'))
    if 'Time' in clause_functs:
        continue
    
    # skip verbs without a predicate phrase function
    phrase = L.u(verb, otype='phrase')[0]
    pred_functs = {'Pred', 'PreO', 'PreS', 'PtcO'}
    if F.function.v(phrase) not in pred_functs:
        continue
        
    # verb level data
    tense = F.vt.v(verb_n) if not is_weqt(verb_n) else 'weqt' # + hacked weqetal
    lex = F.lex.v(verb)
    
    # clause level data
    domain = F.domain.v(clause_n)
    pos_at_cl = F.number.v(phrase)
    cl_dependency = 'I' if F.rela.v(clause_n) == 'NA' else 'D'

    # package and save
    row = [book, chapter, verse, clause_n, verb, 
           lex, tense, domain, pos_at_cl, 
           cl_dependency]
    
    rows.append(row)
    
print(len(rows), 'rows ready for export...')
print()
print('sample: ', rows[124])

59726 rows ready for export...

sample:  ['Genesis', 2, 16, 427735, 977, '>KL[', 'perf', 'Q', 3, 'I']


### Export Tense Data to CSV

In [11]:
with open('verbs.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)

## 3. Dataset for All Clauses

In [12]:
header = ['book', 'chapter', 'verse', 'has.timex']
rows = []

for clause in F.otype.s('clause'):
    
    # skip non-hebrew
    language = F.language.v(L.d(clause, otype='word')[0])
    if language != 'hbo':
        continue
    
    book, chapter, verse = T.sectionFromNode(clause)
    
    has_timex = 1 if 'Time' in set(F.function.v(ph) for ph in L.d(clause, otype='phrase')) else 2
    
    row = [book, chapter, verse, has_timex]
    
    rows.append(row)
    
print(len(rows),'rows ready for export...')
print()
print('sample: ', rows[0])

86808 rows ready for export...

sample:  ['Genesis', 1, 1, 1]


### Write to CSV

In [13]:
with open('all_clauses.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)