# Time Markers with Marked Tenses

This notebook examines time markers with a marked tendency to prefer a certain tense. A "marked tendency," for the purpose of this analysis, is considered a time marker with a tense that has >50% share. This analysis primarily examines the top 50 time markers.

The analysis uses the data initially gathered in `1_exploration.ipynb`.

In [1]:
import pickle, collections
import pandas as pd
from pprint import pprint
from tf.fabric import Fabric
from IPython.display import display, HTML

TF = Fabric(modules='hebrew/etcbc4c', silent=True)
api = TF.load('''book chapter verse
                 pdp vt domain lex
              ''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B chapter              from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.12s B pdp                  from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.12s B vt                   from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.02s B domain               from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.12s B lex                  from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s Feature overview: 103 for nodes; 5 for edges; 1 configs; 7 computed
  4.54s All features loaded/computed - for details use loadLog()


In [2]:
# import custom function for weqetal detection
from functions.verbs import is_weqt

In [3]:
# import time markers data
tm_data_file = 'data/time_markers.pickle'

# load data
with open(tm_data_file, 'rb') as infile:
    tm_data = pickle.load(infile)

print('data available: ', ', '.join(tm_data.keys()))

data available:  markers, top_markers, stats_rows


In [4]:
# assign the data
markers = tm_data['markers']
top_markers = tm_data['top_markers']
stats_rows = tm_data['stats_rows']

In [5]:
def predominant_counts(marker_list, marker_data, threshold=50.):
    '''
    Return a dict of all time markers with a predominance > a threshold percentage.
    Require a list of markers (ETCBC transcription),
        and a marker_data dictionary with tense percentages.
    '''
    
    tense_predominance = collections.defaultdict(list)
    
    for marker in marker_list:
        
        marker = marker[0]
        
        # look for tenses that meet the threshold; if multiple, take highest
        at_threshold = sorted(((tense, perc) for tense, perc in marker_data[marker]['tense_percents'].items()
                                 if perc >= threshold), reverse=True)
        
        tense = at_threshold[0][0] if at_threshold else None
        
        if tense:
            tense_predominance[tense].append(marker)
            
    return tense_predominance


def predominant_table(tense, markers_with_predominance):
    
    '''
    Return a data frame containing time markers with >50% of the supplied tense.
    Require a tense string.
    '''
    
    marker_rows = [stats_rows[marker] for marker in markers_with_predominance[tense]]
    
    # display the data with pd dataframe
    marker_table = pd.DataFrame(marker_rows, columns=tm_data['stats_rows']['header'])
    
    return marker_table

# Examine Tense Predominance Among Markers

In [6]:
# markers with >50% of a tense
markers_predominant_50 = predominant_counts(top_markers, markers).items()

# sort 50% markers greatest to least 
markers_predominant_50 = collections.OrderedDict(sorted(markers_predominant_50, 
                                                         key=lambda k: len(k[1]), 
                                                         reverse=True))
# markers with >40% of a tense
markers_predominant_40 = predominant_counts(top_markers, markers, threshold=40.).items()

# sort 40% markers greatest to least 
markers_predominant_40 = collections.OrderedDict(sorted(markers_predominant_40, 
                                                         key=lambda k: len(k[1]), 
                                                         reverse=True))

### How common is tense predominace compared to the total sample size (i.e. out of 50)? 

In [7]:
number_sampled = len(top_markers) # evals to 50

# >50%
num_predominant_50 = len([marker for tense in markers_predominant_50
                             for marker in markers_predominant_50[tense]])
perc_predominant_50 = round((num_predominant_50/number_sampled)*100)

# >40%
num_predominant_40 = len([marker for tense in markers_predominant_40
                             for marker in markers_predominant_40[tense]])
perc_predominant_40 = round((num_predominant_40/number_sampled)*100)

print(f'{num_predominant_50}/{number_sampled} ({perc_predominant_50}%) time markers are 50% predominant')
print(f'{num_predominant_40}/{number_sampled} ({perc_predominant_40}%) time markers are 40% dominant')

25/50 (50%) time markers are 50% predominant
39/50 (78%) time markers are 40% dominant


These figures support the thesis that time markers can be influential on tense choice. More is needed, though, to determine what other factors around a time marker are influential.

# Display Data for Time Markers with Predominant Tense

In [8]:
# a number to display w/ heading
number = 0

# display markers in order from largest counts to smallest
for tense in markers_predominant_50:

    # format header
    number += 1
    heading = f'<h2>{number}.&nbsp;&nbsp;{tense}</h2>'
    
    # get the table
    pred_table = predominant_table(tense, markers_predominant_50)
    
    
    # display 
    
    # display header
    if number == 1:
        print()
        display(HTML('<span style="text-align: center"><h1>Markers with Tense Predominance</h1></span>'))
        print()
        
    # display tables
    display(HTML(heading))
    display(pred_table)
    display(HTML('<hr>'))







Unnamed: 0,marker,occur,impf,impv,infa,infc,perf,ptca,ptcp,wayq,weqt
0,<TH,78,50.0% (39),9.0% (7),0% (0),0% (0),37.2% (29),1.3% (1),2.6% (2),0% (0),0% (0)
1,L <WLM,76,60.5% (46),3.9% (3),0% (0),7.9% (6),11.8% (9),3.9% (3),2.6% (2),2.6% (2),6.6% (5)
2,>Z,57,68.4% (39),0% (0),0% (0),0% (0),29.8% (17),1.8% (1),0% (0),0% (0),0% (0)
3,>XR,31,67.7% (21),0% (0),0% (0),0% (0),32.3% (10),0% (0),0% (0),0% (0),0% (0)
4,MXR,28,64.3% (18),17.9% (5),0% (0),0% (0),0% (0),7.1% (2),3.6% (1),0% (0),7.1% (2)
5,L NYX,24,83.3% (20),0% (0),0% (0),0% (0),16.7% (4),0% (0),0% (0),0% (0),0% (0)
6,<D MTJ,19,78.9% (15),0% (0),0% (0),0% (0),10.5% (2),10.5% (2),0% (0),0% (0),0% (0)
7,CCT JMJM,14,71.4% (10),0% (0),0% (0),0% (0),21.4% (3),0% (0),0% (0),7.1% (1),0% (0)
8,<D >NH,12,75.0% (9),0% (0),0% (0),0% (0),16.7% (2),8.3% (1),0% (0),0% (0),0% (0)
9,B JWM H CMJNJ,12,66.7% (8),0% (0),0% (0),0% (0),8.3% (1),0% (0),0% (0),16.7% (2),8.3% (1)


Unnamed: 0,marker,occur,impf,impv,infa,infc,perf,ptca,ptcp,wayq,weqt
0,B BQR,76,13.2% (10),10.5% (8),1.3% (1),3.9% (3),2.6% (2),3.9% (3),0% (0),55.3% (42),9.2% (7)
1,<D H JWM H ZH,58,5.2% (3),0% (0),0% (0),1.7% (1),29.3% (17),3.4% (2),0% (0),60.3% (35),0% (0)
2,LJLH,33,6.1% (2),3.0% (1),0% (0),3.0% (1),15.2% (5),12.1% (4),0% (0),54.5% (18),6.1% (2)
3,M MXRT,18,0% (0),0% (0),0% (0),0% (0),0% (0),0% (0),0% (0),100.0% (18),0% (0)
4,B <T H HW>,18,0% (0),0% (0),0% (0),0% (0),33.3% (6),5.6% (1),0% (0),61.1% (11),0% (0)
5,JMJM RBJM,16,31.2% (5),0% (0),0% (0),0% (0),6.2% (1),6.2% (1),0% (0),56.2% (9),0% (0)
6,B LJLH H HW>,15,0% (0),0% (0),0% (0),0% (0),13.3% (2),6.7% (1),0% (0),80.0% (12),0% (0)
7,CLCT JMJM,10,0% (0),10.0% (1),0% (0),10.0% (1),20.0% (2),0% (0),0% (0),50.0% (5),10.0% (1)


Unnamed: 0,marker,occur,impf,impv,infa,infc,perf,ptca,ptcp,wayq,weqt
0,H JWM H ZH,26,26.9% (7),3.8% (1),0% (0),0% (0),53.8% (14),3.8% (1),0% (0),3.8% (1),7.7% (2)
1,>XRJW,20,5.0% (1),0% (0),0% (0),0% (0),75.0% (15),0% (0),0% (0),20.0% (4),0% (0)
2,B JMJM H HM,18,22.2% (4),0% (0),0% (0),0% (0),50.0% (9),11.1% (2),0% (0),16.7% (3),0% (0)
3,B JMJW,12,41.7% (5),0% (0),0% (0),0% (0),58.3% (7),0% (0),0% (0),0% (0),0% (0)


Unnamed: 0,marker,occur,impf,impv,infa,infc,perf,ptca,ptcp,wayq,weqt
0,<D H <RB,41,29.3% (12),0% (0),0% (0),0% (0),2.4% (1),0% (0),0% (0),14.6% (6),53.7% (22)


# Tense Predominance Averages

What is the strength of tense predominance amongst respective tenses?

In [9]:
# display averages
print()

display(HTML('<span style="text-align:center"><h2>Predominance Averages</h2></span>'))
for tense, marker_list in markers_predominant_50.items():
    
    # get totals
    total_markers = len(marker_list)
    percents = [markers[marker]['tense_percents'][tense] for marker in marker_list]
    sum_percents = sum(percents)
    
    # get average
    average = sum_percents / total_markers
    
    show_average = f'<span style="font-size: 16pt">{tense} - {round(average, 1)}%</span>'

    display(HTML(show_average))
display(HTML('<hr>'))
print()







What's interesting about the numbers here is that they follow the order of the marker counts for each tense. In other words, the counts for tense-predominant time markers are in the following order:

1. impf - 12
2. wayq - 8
3. perf - 4
4. weqt - 1

In this initial sampling, the yiqtol corresponds with more predominant time markers (12), the most strongly (average 70%), followed by the wayyiqtol (8 pred. time markers) and a likewise high average predominance (65%).

## Inquiry: Does the counts per tense merely reflect tense averages in the HB?

What, exactly, are the most common tenses in the HB? Might the discrepancies seen here simply arise due to differences in occurrence of the verb tenses?

In [10]:
total_HB_verbs = 0
HB_tense_counts = collections.Counter()

# count all verbs in HB
for word in F.otype.s('word'):
    
    # skip non-verbs
    if F.pdp.v(word) != 'verb':
        continue
    
    tense = 'weqtl' if is_weqt(word) else F.vt.v(word)
    total_HB_verbs += 1
    
    HB_tense_counts[tense] += 1
    
# present verbs/percentages
for tense, count in HB_tense_counts.most_common():
    
    print(f'{tense} - {count} ({round((count/total_HB_verbs)*100, 1)})')

impf - 16099 (23.3)
perf - 15217 (22.0)
wayq - 14974 (21.7)
infc - 6555 (9.5)
weqtl - 5909 (8.6)
ptca - 4985 (7.2)
impv - 4307 (6.2)
ptcp - 676 (1.0)
infa - 300 (0.4)


The top three tenses are actually fairly even in overall occurrence, and the fact that the qatal is more prevalent than the wayyiqtol means that the results above are even more relevant (since the wayyiqtol there has slightly more tense predominance than the qatal).