# `שכב` valency

In [25]:
from datetime import datetime
last_modified = datetime.now()
print('Notebook last modified on {}'.format(last_modified.__str__()))

Notebook last modified on 2016-12-30 23:39:02.631000


## Methodology
The goal is to inventory and categorise the various satellites surrounding the verb שכב ("to lie") in biblical Hebrew in order to determine which elements give rise to which semantic meanings of שכב. Valency tracks the interaction between semantics and syntax.<br>
<br>
In Dyk et al. they suggest that few "watertight" methods exist to separate obligatory, complement functions from non-obligatory, adjunctive functions. (see [Dyk, Glanz, Oosting, "Analysing Valence Patterns,"](https://shebanq.ancient-data.org/shebanq/static/docs/methods/2014_Dyk_jnsl.pdf) 4-5). They apply a "distributional method" as follows:

* "Collect all occurrences of a verb with the complete patterns of elements occurring in the data."
* "Sort these by pattern."
* "Analyse the differences between the various patterns, observing what relation the separate sentence constituents have to the verb." *(Dyk et al., 6)*

Which elements to use? Dyk et al. use:
* "predicate (Pred), subject (Subj), object (Objc), complement (Cmpl), adjunct (Adju)." (7)

The valence corrections notebook ([here](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html)) contains good information on procedure. Certain functions are considered "core," i.e., many of the functions above.<br><br>
Let's begin by applying the simplest measures first. We'll keep track of how many examples of the predicate we've accounted for as we work from simpler to more complex patterns. 
<br><br>
**Here's the objectives:**<br>
1. Inventory phrase functions for relevant phrase functions; organise by these groups.
    * This part establishes the valency type of the verb. Is it transitive or intransitive? Monovalent, divalent, or trivalent? Are there examples of valence expansion or valence reduction?
2. Further subdivide the general patterns with semantic/lexical distinctions
    * Roorda & Dyk's valency correction notebook (above) further sub-categorises functions based on semantic distinctions such as location, time, instrumentality, and a few more. The database as-is does not contain these distinctions. But some may be inferred from the [features](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html): 
        * **`nametype`**
        * **`gloss`** (used in conjunction with a resource like [WordNet](http://www.nltk.org/howto/wordnet.html) or [FrameNet](http://www.nltk.org/howto/framenet.html) through the NLTK package)
        * **`uvf`** (for ה locative markers).
    * Perhaps also different prepositions might give rise to different senses?

Procedural question: should the order of elements matter? For the time being, let's keep things simple by ignoring the order of elements. This is something that can be analysed secondarily. Or we can go back if necessary. 

In [2]:
import collections as col
from tf.fabric import Fabric

TF = Fabric(modules='Hebrew/etcbc4c')
print()
api = TF.load("""otype
                 book chapter verse
                 function pdp vs
                 lex g_cons g_cons_utf8
                 nametype gloss uvf
                """)

api.makeAvailableIn(globals()) # so we don't have to say api.F.feature.v() but only F.feature.v()

This is Text-Fabric 2.0.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
106 features found and 0 ignored

  0.00s loading features ...
   |     0.05s B otype                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B chapter              from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.18s B g_cons               from /Users/Cody/github/text-

In [3]:
# collect all clauses that contain the target verb CKB with a phrase function of predicate

target = 'CKB['
stem = 'qal' # we are only studying the qal stem for now

# derived from valency corrections notebook (Roorda&Dyk)
# for now we take only functions that have a regular verb 
predicate_functions = {'Pred', 'PreS', 'PreO', 'PreC', 'PtcO', 'PrcS'} 
# complements | adjuncts:
cmpl_adj = {'Cmpl','Adju','Time','Supp', 'Objc','Subj','ModS','NCoS','IntS','PrAd'} 

def find_satellites(target, stem, pred_functions, comp_adj):
    '''
    takes a lemma&stem and returns a dict containing:
    results[clause_atom_node] = [phrase_function for pf in relevant_phrase_functions]
    '''
    satellites = dict()
    for word in F.otype.s('word'):
        lex = F.lex.v(word)
        if lex != target:
            continue
        phrase_node = L.u(word, otype='phrase')[0]
        phrase_func = F.function.v(phrase_node)
        if phrase_func not in pred_functions or F.vs.v(word) != stem:
            continue
        clause_node = L.u(phrase_node, otype = 'clause_atom')[0]
        phrase_functs = list(F.function.v(phrase) for phrase in L.d(clause_node, otype = 'phrase')\
                             if F.function.v(phrase) in pred_functions | comp_adj)
        satellites[clause_node] = phrase_functs
    return satellites

ckb_sats = find_satellites(target, stem, predicate_functions, cmpl_adj)

# display the average length of the gathered data
def avg_data(data_dictionary):
    total_datPoints = len(data_dictionary.values())
    total_datLengths = sum(len(datpoint) for datpoint in data_dictionary.values())
    return round(total_datLengths / total_datPoints, 2)

print()
print('instances found: ', len(ckb_sats))
print('avg. # of satellites: ', avg_data(ckb_sats))        
print()


instances found:  195
avg. # of satellites:  2.24



In [4]:
print('Inventory of all registered functions for CKB:')
print(set(function for function_list in ckb_sats.values() for function in function_list))

Inventory of all registered functions for CKB:
{'Time', 'PreC', 'PreS', 'Subj', 'Pred', 'Objc', 'Adju', 'PreO', 'IntS', 'Cmpl'}


Some of the elements in the function codes above are superfluous or unnecessarily connected. For example: `PreS` with a suffixed subject belongs in the same category as `Pred+Subj`. In the `IntS` function, the interrogative is (at this point) superfluous for phrase-level valency function, but the `S` of subject is necessary. Let's simplify those labels. While we're at it, we'll convert the codes into more readable forms and also convert the `Objc` label into a direct object label.<br><br>
We keep the copy of the original dictionary in case we find that the altered labels contain valuable data later during the analysis stage.

In [5]:
func_convert = {'Subj' : 'subject',    # the new labels will be attached to a string and .split()'ed
                'PreC' : 'predicate complement',
                'Adju' : 'adjunct',
                'PreO' : 'predicate direct_object',
                'Pred' : 'predicate',
                'IntS' : 'subject',
                'PreS' : 'predicate subject',
                'Objc' : 'direct_object',
                'Time' : 'adjunct', # cf. Roorda&Dyk notebook
                'Cmpl' : 'complement'
                }

simple_ckb_sats = dict()

for instance, satellites in ckb_sats.items():
    satellites = ' '.join(func_convert[fn] for fn in satellites)
    simple_ckb_sats[instance] = satellites.split()
    
print('\nOLD average num. of satellites per instance:')
print(avg_data(ckb_sats),'\n')
print('New average num. of satellites per instance:')
print(avg_data(simple_ckb_sats))


OLD average num. of satellites per instance:
2.24 

New average num. of satellites per instance:
2.41


^ a higher avg. means we've succeeded in splitting several of the combined satellites.

Now we're ready to present some data...<br>
I'll be using some HTML tricks inspired by [Gino Kalkman's notebook](https://github.com/ETCBC/Biblical_Hebrew_Analysis/blob/master/Miscellaneous/AsyndeticClauseFunctions.ipynb)

In [23]:
from IPython.display import display, HTML

def generate_table(fieldnames, data, style=''):
    '''
    returns an HTML table when fed a fieldnames list and list of dicts with fieldname keys
    optional argument to configure text style
    follows similar structure to csv dictwriter args
    '''
    column_amount = len(fieldnames)    
    table_code = '<table>'        # here is where all the code will be assembled
    table_row = '<tr>{data}</tr>' # template for table rows
    # assemble table_header
    table_header = ''
    for fieldname in fieldnames:
        table_header += '<th{style}>{header}</th>'.format(header = fieldname, style = style)
    # assemble table_data
    table_data = ''
    for data_dict in data:
        row = ''
        for fieldname in fieldnames:
            tab_data = data_dict[fieldname]
            row += '<td{style}>{data}</td>'.format(style = '', data = tab_data)
        table_data += table_row.format(data=row)
    # complete the code:
    table_code += table_row.format(data = table_header)
    table_code += table_row.format(data = table_data)
    table_code += '</table>'
    # display the code
    display(HTML(table_code))

In [24]:
# assemble some stats to display:
ckb_simple_sTats = col.Counter()
for instance, pattern in simple_ckb_sats.items():
    ckb_simple_sTats[ tuple(sorted(pattern)) ] += 1
    
# data for the HTML viewer
fieldnames = ['Total','Pattern']
ckb_simple_data = list( {'Total':total,'Pattern':', '.join(pattern)} \
                       for pattern, total in sorted(ckb_simple_sTats.items(), key = lambda k: -k[1])) 

generate_table(fieldnames, ckb_simple_data, style=' style="text-align:center"')

Total,Pattern
,
50.0,"complement, predicate"
43.0,"complement, predicate, subject"
25.0,predicate
13.0,"predicate, subject"
12.0,"adjunct, predicate"
9.0,"complement, complement, predicate, subject"
9.0,"complement, complement, predicate"
7.0,"adjunct, complement, predicate"
7.0,"direct_object, predicate"


Take note that the elements in these patterns are sorted alphabetically, not in the order of occurrence.