# Export BHSA Data for Processing

Use Text-Fabric to export a time-expression (timex) dataset and a verb dataset.

In all cases, the data is only selected from Biblical Hebrew.

## 1. Dataset for Time Expressions
* Select phrases with the function of Time.
    * Exclude phrases that occur in clauses with multiple time-function phrases.
* Features of the phrase
    * lexeme/count of prepositions, substantives, and quantifiers (cardinal numbers).
    * part of speech patterns within the phrase
* Features of the enclosing clause
    * domain
    * lex and tense of verb if present
    * position of phrase in relation to the verb 
    * position of phrase in relation to the clause
    * main vs. subordinate
    
### 1.1. Classification of Time Expressions
* See [time_phrase_analysis.ipynb]('../data_analysis/time_phrase_analysis.ipynb') for the rationale behind the classifications performed in this section.
* a spreadsheet with prepositions and adverbs/substantives is exported for manual annotation of their semantic content; the annotations for prepositions + adverbs/substantives are used to establish a unique classification of the time phrase.
* Text-Fabric Search is used to identify phrases that contain the desired preposition adverb/substantive combinations. TF Search is ideal since it can easily represent the phrase internal structure of the patterns while also ignoring adnominal elements that intervene. The search templates also allow for the method of classification to be easily scrutinized and adjusted. 
* Finally, the template-processed phrases are combined with the manual annotations. A single tag is produced and added to the timex dataset for qualifying phrases: `timex.group`. 
    
    
## 2. Dataset for Verbs

In order to compare the time phrase dataset with the broader tendencies of verbs in the Hebrew Bible, another export contains:
* **Features of verbs in the HB**
    * verb must be in a predicate phrase
    * tense and lexeme
    <br><br>
* **Features of the enclosing clause**
    * domain
    * position of verb's phrase in relation to the clause
    * main vs. subordinate
    
## 3. Dataset for All Clauses
* **A simple export where each row represents a clause, has a T/F value for the presence of a time phrase**.
    * Include book, chapter, and verse.

<hr>

In [1]:
from tf.fabric import Fabric

In [2]:
import os
from pprint import pprint
import csv, collections

In [3]:
# put custom functions in path
if __name__ == '__main__' and __package__ is None:
    os.sys.path.append('../')

In [4]:
TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c')

This is Text-Fabric 3.1.1
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

114 features found and 0 ignored


In [5]:
api = TF.load('''
                 book chapter verse
                 function kind
                 pdp sp typ rela ls
                 gloss
                 domain vt lex
                 number language
              ''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B chapter              from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B verse                from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.09s B function             from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.05s B kind                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.18s B pdp                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.26s B sp                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.30s B typ                  from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.24s B rela                 from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.15s B ls                   from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.01s B gloss                from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.03s B domain               from /Users/Cody/github/etcbc/bhsa/tf/c
   |     0.26s B vt                

In [6]:
# custom function to id weqetal verbs and for subject isolation
from my_functions.verbs import is_weqt 
from my_functions.phrases import is_subs

## 1. Dataset for Time Phrases

The time phrase cannot occur inside a clause with more than 1 time phrase.

In [7]:
timexes = []

for phrase in F.function.s('Time'):
    
    # skip non-hebrew phrases
    language = F.language.v(L.d(phrase, otype='word')[0])
    if language != 'hbo':
        continue
    
    # skip phrases with >1 TP in its clause
    clause = L.u(phrase, otype='clause')[0]
    phrase_functs = [F.function.v(ph) for ph in L.d(clause, otype='phrase')]
    if phrase_functs.count('Time') > 1:
        continue
        
    # append the relevant data
    timexes.append((clause, phrase))

print(len(timexes), 'time expressions ready for proccessing...')

3712 time expressions ready for proccessing...


### Process Time Phrases

Retrieve the phrase/clause level information.

In [8]:
# header for csv export & and how data should be assembled
header = ['book', 'chapter', 'verse', 'clause.n', 'phrase.n', 
          'phrase.trans', 'preps', 'num.preps', 'subs',
          'num.subs', 'quants', 'num.quants', 'advbs', 
          'num.advbs','pdp.pattern', 'domain', 
          'verb.lex', 'verb.tense', 'position.at.vb', 
          'position.at.cl', 'cl.dependency']

# put data here for csv export
rows = []

# get row data per time phrase
for clause_n, phrase_n in timexes:

    book, chapter, verse = T.sectionFromNode(clause_n)
        
        
    # // phrase data //
    
    # transcription
    ph_words = L.d(phrase_n, otype='word')
    ph_trans = T.text(ph_words, fmt='lex-trans-plain').strip()
    
    # substantives
    subs = [F.lex.v(w) for w in ph_words if is_subs(w)] # w/ custom funct. is_subs()
    num_subs = len(subs)
    subs_txt = '|'.join(subs) or 'ø'
    
    # prepositions
    preps = [F.lex.v(w) for w in ph_words if F.pdp.v(w) == 'prep']
    num_preps = len(preps)
    preps_txt = '|'.join(preps) or 'ø'
    
    # quantities (card == "cardinal number")
    quants = [F.lex.v(w) for w in ph_words if F.ls.v(w) == 'card']
    num_quants = len(quants)
    quants_txt = '|'.join(quants) or 'ø'
    
    # adverbs
    advbs = [F.lex.v(w) for w in ph_words if F.pdp.v(w) == 'advb']
    num_advbs = len(advbs)
    advbs_txt = '|'.join(advbs) or 'ø'
    
    # part of speech patterns
    pdp_pattern = '-'.join(F.pdp.v(w) for w in L.d(phrase_n, otype='word'))
    
    # // clause data // 
    
    domain = F.domain.v(clause_n)
        
    # get verb lex and its tense
    pred_functs = {'Pred', 'PreO', 'PreS', 'PtcO'}
    verb = [word for phrase in L.d(clause_n, otype='phrase')
               for word in L.d(phrase, otype='word')
               if F.pdp.v(word) == 'verb' 
               and F.function.v(phrase) in pred_functs
           ]
    if verb:
        verb_n = verb[0]
        verb_lex = F.lex.v(verb_n)
        verb_tense = F.vt.v(verb_n) if not is_weqt(verb_n) else 'weqt' # + hacked weqetal
    else:
        verb_lex = 'ø'
        verb_tense = 'ø'
        
    # time-phrase position to verb
    pos_at_vb = 1 if phrase_n < verb_n else 2
    pos_at_cl = F.number.v(phrase_n)
    
    # clause dependency
    cl_dependency = 'I' if F.rela.v(clause_n) == 'NA' else 'D'
    
    
    # // ship it //
    
    # package the row and append it
    row = [book, chapter, verse, clause_n, phrase_n,
           ph_trans, preps_txt, num_preps, subs_txt,
           num_subs, quants_txt, num_quants, advbs_txt, num_advbs,
           pdp_pattern, domain, verb_lex, verb_tense, 
           pos_at_vb, pos_at_cl, cl_dependency]
           
    rows.append(row)
    
print(len(rows), 'rows ready for export...')
print()
print('sample: ', rows[100])

3712 rows ready for export...

sample:  ['Genesis', 19, 34, 429263, 656759, '>MC', 'ø', 0, 'ø', 0, 'ø', 0, '>MC/', 1, 'advb', 'Q', 'CKB[', 'perf', 2, 3, 'I']


## Classifying Time Phrases

### A. Export Spreadsheets for Manual Annotations

Use the lexemes for prepositions/substantives|adverbs in the dataset.

Put common lexemes at the top of the spreadsheets. Thus, make a count first and sort them.

In [9]:
prep_lexemes = collections.Counter()
subs_lexemes = collections.Counter()

# dicts to inspect the texts these items are coming from
prep_samples = collections.defaultdict(list)
subs_samples = collections.defaultdict(list)

# indices for desired elements
prep_i = header.index('preps') 
subs_i = header.index('subs')
advb_i = header.index('advbs')

# get data from the spreadsheet above
for phrase_dat in rows:
    
    if phrase_dat[prep_i] != 'ø':
        prep_lexemes[phrase_dat[prep_i]] += 1
        prep_samples[phrase_dat[prep_i]].append('{} {}:{}'.format(phrase_dat[0], phrase_dat[1], phrase_dat[2]))
        
    if phrase_dat[subs_i] != 'ø':
        subs_lexemes[phrase_dat[subs_i]] += 1
        subs_samples[phrase_dat[subs_i]].append('{} {}:{}'.format(phrase_dat[0], phrase_dat[1], phrase_dat[2]))
        
    if phrase_dat[advb_i] != 'ø':
        subs_lexemes[phrase_dat[advb_i]] += 1
        subs_samples[phrase_dat[advb_i]].append('{} {}:{}'.format(phrase_dat[0], phrase_dat[1], phrase_dat[2]))

# export 4 spreadsheets
# 2 for semantic annotations
# 2 for example passages for each

# prepositions
prep_file = 'manual_annotations/prep_annotations'

if not os.path.exists(prep_file+'.tsv'): # prevent overwrite
    
    print('writing', prep_file, '...')
    
    with open(prep_file+'.tsv', 'w') as outfile:

        writer = csv.writer(outfile, delimiter='\t')

        sort_preps = sorted(prep_lexemes.items(), key=lambda k: k[-1], reverse=True)

        writer.writerow(['lexeme', 'annotation'])
        writer.writerows([prep[0], ''] for prep in sort_preps)

    # example passages file
    with open(prep_file+'_examples.txt', 'w') as outfile:
                        
        for prep, count in sort_preps:
            
            outfile.write(prep+'\n')
            outfile.write('; '.join(prep_samples[prep]))
            outfile.write('\n')
            outfile.write('-'*100)
            outfile.write('\n'*2)
            
else:
    print('prep annotations file present...doing nothing.')
            
# substantives

subs_file = 'manual_annotations/subs_annotations'

if not os.path.exists(subs_file+'.tsv'): # prevent overwrite

    print('writing', subs_file, '...')
    
    with open(subs_file+'.tsv', 'w') as outfile:

        writer = csv.writer(outfile, delimiter='\t')

        sort_subs = sorted(subs_lexemes.items(), key=lambda k: k[-1], reverse=True)

        writer.writerow(['lexeme', 'annotation'])
        writer.writerows([subs[0], ''] for subs in sort_subs)
    
    # example passages file
    with open(subs_file+'_examples.txt', 'w') as outfile:
                        
        for subs, count in sort_subs:
            
            outfile.write(subs+'\n')
            outfile.write('; '.join(subs_samples[subs]))
            outfile.write('\n')
            outfile.write('-'*100)
            outfile.write('\n'*2)

else:
    print('subs annotations file present...doing nothing.')

prep annotations file present...doing nothing.
subs annotations file present...doing nothing.


**The manual annotations are complete.** See the [manual annotations readme](manual_annotations/readme.md) for a report on the results.

### B. Identify Time Phrase Patterns 

Manual annotations are extracted and applied to the row. Instead of looking for ordered elements, a set of tags will be generated simply by the presence of an annotated element in a row. The tags are unordered and there are no repeated elements.

#### Import Annotations

In [10]:
# import manual preposition annotations
with open(prep_file + '.tsv', 'r') as infile:
    
    reader = csv.reader(infile, delimiter='\t')
    
    next(reader) # skip header
    
    prep_classes = dict(row for row in reader if row[1]) # take only annotated rows, ignore blank
    
    print(len(prep_classes), 'prep. annotations ready...')
    
# import manual substantive annotations
with open(subs_file + '.tsv', 'r') as infile:
    
    reader = csv.reader(infile, delimiter='\t')
    
    next(reader) # skip header
    
    subs_classes = dict(row for row in reader if row[1]) # ignore blank
    
    print(len(subs_classes), 'subs. annotations ready...')

9 prep. annotations ready...
88 subs. annotations ready...


In [11]:
prep_classes['B'] # example

'internal'

#### Modify Data Rows

A special case is made for the classes `demonstrative_near` and `demonstrative_far`. Some demonstratives were not included in the adverbs and are also added.

In [12]:
# print all demonstrative pronouns for sorting into near and far pronouns

demon_pronouns = set((F.lex.v(w), F.gloss.v(L.u(w, otype='lex')[0])) for w in F.pdp.s('prde')
                        if F.language.v(w) == 'hbo'
                    )

pprint(demon_pronouns)

{('>L===', 'these'),
 ('>LH', 'these'),
 ('HJ>', 'she'),
 ('HM', 'they'),
 ('HMH', 'they'),
 ('HNH=', 'they'),
 ('HW>', 'he'),
 ('LZ', 'this there'),
 ('LZH', 'this there'),
 ('LZW', 'this there'),
 ('Z>T', 'this'),
 ('ZH', 'this'),
 ('ZH=', 'this'),
 ('ZW=', 'this')}


In [13]:
demon_near = {'>L===', '>LH', 'LZ', 
             'LZH', 'LZW', 'Z>T', 
              'ZH', 'ZH=', 'ZW='}

demon_far = {'HJ>', 'HM', 'HMH', 
             'HNH=', 'HW>'}

demon_classes = {}

for demon in demon_near | demon_far:
    
    if demon in demon_near:
        demon_classes[demon] = 'demonstrative_near'
    else:
        demon_classes[demon] = 'demonstrative_far'

Look for annotation matches and add to the dataset in the form of a new column. Add a null value for timexes without a match.

In [14]:
header += ['timex.group'] # add new data column to dataset header

tagged = 0 # keep count of matches

for row in rows:
    
    # grab data for comparison
    prep = row[prep_i]
    subs = row[subs_i]
    advb = row[advb_i]
    
    # extract any demonstratives from the phrase and classify them
    phrase_i = header.index('phrase.n')
    demons = [F.lex(w) for w in L.d(phrase_i, otype='word')
                 if F.pdp.v(w) == 'prde'
             ]
    if len(demons) > 1 or not demons: # skip plural demonstratives, the same is done for other constructions
        demon = ''
    else:
        demon = demon_classes[demons[0]]
    
    # classify prepositions, substantives, and adverbs
    prep_tag = prep_classes.get(prep, '')
    subs_tag = subs_classes.get(subs, '')
    advb_tag = subs_classes.get(advb, '')
    
    tag_elements = [prep_tag] + sorted({subs_tag, advb_tag, demon})
    tag_elements = [el for el in tag_elements if el] # clean out empty tags

    # create the tag
    group_tag = '.'.join(tag_elements) or 'ø'
    
    if group_tag != 'ø':
        tagged += 1
    
    # add the tag to the dataset
    row.append(group_tag)
    
print(tagged, 'rows tagged...')

3248 rows tagged...


### Export Timex Data to CSV

In [15]:
# Export the .csv
with open('time_phrases.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)
    
print('done!')

done!


## 2. Dataset for Verbs


In [16]:
header = ['book', 'chapter', 'verse', 'clause.n', 
          'verb.n', 'lexeme', 'tense', 'domain', 
          'position.at.cl', 'cl.dependency']

rows = []

# gather verb data by clause
for clause in F.otype.s('clause'):

    book, chapter, verse = T.sectionFromNode(clause)
    
    # skip clauses with a time phrase
    clause_functs = set(F.function.v(phrase) for phrase in L.d(clause, otype='phrase'))
    if 'Time' in clause_functs:
        continue
    
    # skip non-Hebrew clauses
    if F.language.v(L.d(clause, otype='word')[0]) != 'hbo':
        continue
    
    # extract verb data
    pred_functs = {'Pred', 'PreO', 'PreS', 'PtcO'}
    # isolate verb from predicate phrase:
    verb = [word for phrase in L.d(clause, otype='phrase')
               for word in L.d(phrase, otype='word')
               if F.pdp.v(word) == 'verb' 
               and F.function.v(phrase) in pred_functs
           ]
    
    if verb: # get tense/lexeme
        verb_n = verb[0]
        verb_lex = F.lex.v(verb_n)
        verb_tense = F.vt.v(verb_n) if not is_weqt(verb_n) else 'weqt' # + hacked weqetal

    else: # handle verbless clauses
        verb_n = 0
        verb_lex = 'ø'
        verb_tense = 'ø'
    
    # clause level data
    domain = F.domain.v(clause)
    pos_at_cl = F.number.v(phrase)
    cl_dependency = 'I' if F.rela.v(clause) == 'NA' else 'D'

    # package and save
    row = [book, chapter, verse, clause, verb_n, 
           verb_lex, verb_tense, domain, pos_at_cl, 
           cl_dependency]
    
    rows.append(row)
    
print(len(rows), 'rows ready for export...')
print()
print('sample: ', rows[10])

82980 rows ready for export...

sample:  ['Genesis', 1, 5, 427564, 69, 'QR>[', 'perf', 'N', 2, 'I']


### Export Tense Data to CSV

In [17]:
with open('verbs.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)
    
print('done!')

done!


## 3. Dataset for All Clauses

In [18]:
header = ['book', 'chapter', 'verse', 'has.timex']
rows = []

for clause in F.otype.s('clause'):
    
    # skip non-hebrew
    language = F.language.v(L.d(clause, otype='word')[0])
    if language != 'hbo':
        continue
    
    book, chapter, verse = T.sectionFromNode(clause)
    
    has_timex = 1 if 'Time' in set(F.function.v(ph) for ph in L.d(clause, otype='phrase')) else 2
    
    row = [book, chapter, verse, has_timex]
    
    rows.append(row)
    
print(len(rows),'rows ready for export...')
print()
print('sample: ', rows[0])

86808 rows ready for export...

sample:  ['Genesis', 1, 1, 1]


### Write to CSV

In [19]:
with open('all_clauses.csv', 'w') as outfile:
    
    writer = csv.writer(outfile)
    
    writer.writerow(header)
    writer.writerows(rows)
    
print('done!')

done!
