# Time Markers Alongside Lexemes

The previous notebooks have found tendencies among certain time markers as correlated with tenses, as well as some more limited tendencies among select prepositions (namely ל). 

While examining, for instance, time markers starting with ל, it is evident that the lexeme plays an important role. For instance, the time marker לעולם is a yiqtol-dominant time marker. But a ל can likewise occur alongside a time marker such as יום, in which it appears in a narrative context.

How strongly do verb tenses correlate to particular lexemes in the HB?

Notebook 3 finds that 73% of all time markers begin with a preposition. This notebook will thus specifically focus on objects of prepositions, since these provide the simplest target starting out.

In [14]:
# load modules
import pickle, collections
import pandas as pd
from IPython.display import display, HTML

# load TF module and data
from tf.fabric import Fabric
TF = Fabric(modules='hebrew/etcbc4c', silent=True)
api = TF.load('''book chapter verse
                 pdp vt lex typ rela
              ''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B chapter              from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.11s B pdp                  from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.11s B vt                   from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.11s B lex                  from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.20s B typ                  from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.19s B rela                 from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s Feature overview: 103 for nodes; 5 for edges; 1 configs; 7 computed
  5.14s All features loaded/computed - for details use loadLog()


In [9]:
# import time markers data
tm_data_file = 'data/time_markers.pickle'

# load data
with open(tm_data_file, 'rb') as infile:
    tm_data = pickle.load(infile)

print('data available: ', ', '.join(tm_data.keys()))

data available:  markers, top_markers, stats_rows, preposition_cl_lists


In [10]:
# assign the data
markers = tm_data['markers']
top_markers = tm_data['top_markers']
stats_rows = tm_data['stats_rows']

In [11]:
print('data per time marker:')
markers['L  JWM'].keys()

data per time marker:


dict_keys(['count', 'clauses', 'tense_cl_lists', 'tense_counts', 'tense_percents', 'example_phrase'])

## How to isolate the object of the preposition in dataset?


In [13]:
genesis_1 = T.nodeFromSection(('Genesis', 1))

prep_phrases = [phrase for phrase in L.d(genesis_1, otype='phrase')
                   if F.typ.v(phrase) == 'PP'
               ]

len(prep_phrases)

80

In [27]:
# Look within phrase at pdp
for word in L.d(prep_phrases[0], otype='word'):
    
    text = T.text([word])
    pdp = F.pdp.v(word)
    
    print(word)
    print(text)
    print(pdp)

1
בְּ
prep
2
רֵאשִׁ֖ית 
subs


Unfortunately, the ETCBC dataset does not explicitly mark the relationship between a preposition and its objects. This is a shortcoming, in my opinion.

Instead, we use the feature pdp ("phrase dependent part of speech") and look for a word with a pdp value of "subs" (substantive) within a prepositional phrase.

## Make the Counts

In [42]:
# map clause nodes to list and make a count
lexeme_cl_lists = collections.defaultdict(lambda: collections.defaultdict(list))
lexeme_tense_counts = collections.defaultdict(lambda: collections.Counter())

# I will skip prepositional phrases with more than 1 substantive
# or time markers with more than one prepositional phrase
# but I will keep a count with the variable below
plus_prep = 0
plus_prep_phrase = 0

# iterate and count
for time_marker, marker_data in markers.items():
    
    # get phrase node
    time_phrase = marker_data['example_phrase']
    
    # get the prepositional phrase
    prep_phrase = [phrase_atom for phrase_atom in L.d(time_phrase, otype='phrase_atom')
                      if F.typ.v(phrase_atom) == 'PP']
    
    # skip but count markers with multiple prep. phrases
    if len(prep_phrase) > 1:
        plus_prep_phrase += 1
        continue
        
    # skip time markers without prep. phrases
    elif not prep_phrase:
        continue
    
    # isolate the prep. phrase
    prep_phrase = prep_phrase[0]
    
    # get the object of the preposition
    obj_prep = [word for word in L.d(prep_phrase, otype='word')
                   if F.pdp.v(word) == 'subs']
    
    # skip but count prep. phrases with multiple substantives
    if len(obj_prep) > 1:
        plus_prep += 1
        continue
    elif not obj_prep:
        continue
    
    # isolate obj of preposition
    obj_prep = obj_prep[0]
    
    # count the lex and its clauses
    lex = F.lex.v(obj_prep)
    
    # count by tense
    for tense, cl_list in marker_data['tense_cl_lists'].items():
    
        # map clause nodes
        lexeme_cl_lists[lex][tense].extend(cl_list)
        
        # count clauses
        lexeme_tense_counts[lex][tense] += len(cl_list)
        lexeme_tense_counts[lex]['total'] += len(cl_list)

### Display Data

In [43]:
# order and create tables
lex_count_table_order = [(tense_totals['total'], marker, tense_totals) 
                          for marker, tense_totals in lexeme_tense_counts.items()]
lex_count_table_order = sorted(lex_count_table_order, reverse=True)


# get header data
all_tenses = [key for key in sorted(lex_count_table_order[0][2].keys())
                                         if key != 'total']
table_header = ['marker', 'total'] + all_tenses


# make table rows
lex_count_table_rows = []

for total, marker, tense_totals in lex_count_table_order:
    
    tense_counts = ['{}% ({})'.format(round((tense_totals[tense]/total)*100,1), tense_totals[tense])
                       for tense in all_tenses ]
    
    row = [marker, total] + tense_counts
    
    lex_count_table_rows.append(row)
    

# display table
lex_count_table = pd.DataFrame(lex_count_table_rows, columns=table_header)
lex_count_table

Unnamed: 0,marker,total,impf,impv,infa,infc,perf,ptca,ptcp,wayq,weqt
0,JWM/,566,25.8% (146),1.9% (11),0.2% (1),2.3% (13),25.1% (142),3.4% (19),0.2% (1),29.2% (165),12.0% (68)
1,<WLM/,151,51.0% (77),3.3% (5),0.0% (0),9.3% (14),13.9% (21),4.0% (6),1.3% (2),3.3% (5),13.9% (21)
2,BQR=/,105,22.9% (24),14.3% (15),1.0% (1),2.9% (3),2.9% (3),3.8% (4),0.0% (0),45.7% (48),6.7% (7)
3,<T/,99,29.3% (29),1.0% (1),0.0% (0),3.0% (3),27.3% (27),7.1% (7),0.0% (0),27.3% (27),5.1% (5)
4,<RB/,72,37.5% (27),0.0% (0),0.0% (0),2.8% (2),4.2% (3),2.8% (2),0.0% (0),18.1% (13),34.7% (25)
5,LJLH/,35,14.3% (5),0.0% (0),0.0% (0),2.9% (1),17.1% (6),8.6% (3),0.0% (0),51.4% (18),5.7% (2)
6,CNH/,31,32.3% (10),6.5% (2),0.0% (0),12.9% (4),12.9% (4),0.0% (0),0.0% (0),22.6% (7),12.9% (4)
7,NYX/,25,84.0% (21),0.0% (0),0.0% (0),0.0% (0),16.0% (4),0.0% (0),0.0% (0),0.0% (0),0.0% (0)
8,XDC=/,19,15.8% (3),5.3% (1),0.0% (0),15.8% (3),21.1% (4),10.5% (2),0.0% (0),26.3% (5),5.3% (1)
9,MXRT/,19,0.0% (0),0.0% (0),0.0% (0),5.3% (1),0.0% (0),0.0% (0),0.0% (0),94.7% (18),0.0% (0)
