# `שכב` valency

In [1]:
from datetime import datetime
last_modified = datetime.now()
print('Notebook last modified on {}'.format(last_modified.__str__()))

Notebook last modified on 2017-01-02 19:08:44.310778


## Methodology
The goal is to inventory and categorise the various satellites surrounding the verb שכב ("to lie") in biblical Hebrew in order to determine which elements give rise to which semantic meanings of שכב. Valency tracks the interaction between semantics and syntax.<br>
<br>
In Dyk et al. they suggest that few "watertight" methods exist to separate obligatory, complement functions from non-obligatory, adjunctive functions. (see [Dyk, Glanz, Oosting, "Analysing Valence Patterns,"](https://shebanq.ancient-data.org/shebanq/static/docs/methods/2014_Dyk_jnsl.pdf) 4-5). They apply a "distributional method" as follows:

* "Collect all occurrences of a verb with the complete patterns of elements occurring in the data."
* "Sort these by pattern."
* "Analyse the differences between the various patterns, observing what relation the separate sentence constituents have to the verb." *(Dyk et al., 6)*

Which elements to use? Dyk et al. use:
* "predicate (Pred), subject (Subj), object (Objc), complement (Cmpl), adjunct (Adju)." (7)

The valence corrections notebook ([here](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html)) contains good information on procedure. Certain functions are considered "core," i.e., many of the functions above.<br><br>
Let's begin by applying the simplest measures first. We'll keep track of how many examples of the predicate we've accounted for as we work from simpler to more complex patterns. 
<br><br>
**Here's the objectives:**<br>
1. Inventory phrase functions for relevant phrase functions; organise by these groups.
    * This part establishes the valency type of the verb. Is it transitive or intransitive? Monovalent, divalent, or trivalent? Are there examples of valence expansion or valence reduction?
2. Further subdivide the general patterns with semantic/lexical distinctions
    * Roorda & Dyk's valency correction notebook (above) further sub-categorises functions based on semantic distinctions such as location, time, instrumentality, and a few more. The database as-is does not contain these distinctions. But some may be inferred from the [features](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html): 
        * **`nametype`**
        * **`gloss`** (used in conjunction with a resource like [WordNet](http://www.nltk.org/howto/wordnet.html) or [FrameNet](http://www.nltk.org/howto/framenet.html) through the NLTK package)
        * **`uvf`** (for ה locative markers).
    * Perhaps also different prepositions might give rise to different senses?

Procedural question: should the order of elements matter? For the time being, let's keep things simple by ignoring the order of elements. This is something that can be analysed secondarily. Or we can go back if necessary. 

In [2]:
import collections as col
from tf.fabric import Fabric

TF = Fabric(modules='Hebrew/etcbc4c')
print()
api = TF.load("""otype
                 book chapter verse
                 function pdp sp vs
                 lex g_cons g_cons_utf8
                 nametype ls gloss uvf
                """)

api.makeAvailableIn(globals()) # so we don't have to say api.F.feature.v() but only F.feature.v()

This is Text-Fabric 2.0.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
106 features found and 0 ignored

  0.00s loading features ...
   |     0.04s B otype                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B chapter              from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.19s B g_cons               from /Users/Cody/github/text-

In [3]:
# collect all clauses that contain the target verb CKB with a phrase function of predicate

target = 'CKB['
stem = 'qal' # we are only studying the qal stem for now

# derived from valency corrections notebook (Roorda&Dyk)
# for now we take only functions that have a regular verb 
predicate_functions = {'Pred', 'PreS', 'PreO', 'PreC', 'PtcO', 'PrcS'} 
# complements | adjuncts:
cmpl_adj = {'Cmpl','Adju','Time','Supp', 'Objc','Subj','ModS','NCoS','IntS','PrAd'} 

def find_satellites(target, stem, pred_functions, comp_adj):
    '''
    takes a lemma&stem and returns a dict containing:
    results[clause_atom_node] = [phrase_function for pf in relevant_phrase_functions]
    '''
    satellites = dict()
    for word in F.otype.s('word'):
        lex = F.lex.v(word)
        if lex != target:
            continue
        phrase_node = L.u(word, otype='phrase')[0]
        phrase_func = F.function.v(phrase_node)
        if phrase_func not in pred_functions or F.vs.v(word) != stem:
            continue
        clause_node = L.u(phrase_node, otype = 'clause_atom')[0]
        phrase_nodeS = L.d(clause_node, otype = 'phrase')
        phrase_functs = list(F.function.v(phrase) for phrase in phrase_nodeS if F.function.v(phrase)\
                             in pred_functions | comp_adj)
        satellites[clause_node] = phrase_functs
    return satellites

ckb_sats = find_satellites(target, stem, predicate_functions, cmpl_adj)

# display the average length of the gathered data
def avg_data(data_dictionary):
    total_datPoints = len(data_dictionary.values())
    total_datLengths = sum(len(datpoint) for datpoint in data_dictionary.values())
    return round(total_datLengths / total_datPoints, 2)

print()
print('instances found: ', len(ckb_sats))
print('avg. # of satellites: ', avg_data(ckb_sats))        
print()


instances found:  195
avg. # of satellites:  2.24



In [4]:
print('Inventory of all registered functions for CKB:')
print(set(function for function_list in ckb_sats.values() for function in function_list))

Inventory of all registered functions for CKB:
{'Time', 'Cmpl', 'PreC', 'Objc', 'PreO', 'IntS', 'Subj', 'Pred', 'PreS', 'Adju'}


Some of the elements in the function codes above are superfluous or unnecessarily connected. For example: `PreS` with a suffixed subject belongs in the same category as `Pred+Subj`. In the `IntS` function, the interrogative is (at this point) superfluous for phrase-level valency function, but the `S` of subject is necessary. Let's simplify those labels. While we're at it, we'll convert the codes into more readable forms and also convert the `Objc` label into a direct object label.<br><br>
We keep the copy of the original dictionary in case we find that the altered labels contain valuable data later during the analysis stage.

In [5]:
# the new labels will be attached to a string and .split()'ed:
func_convert = {'Subj' : 'subject',    
                'PreC' : 'predicate complement',
                'Adju' : 'adjunct',
                'PreO' : 'predicate direct_object',
                'Pred' : 'predicate',
                'IntS' : 'subject',
                'PreS' : 'predicate subject',
                'Objc' : 'direct_object',
                'Time' : 'adjunct', # cf. Roorda&Dyk notebook
                'Cmpl' : 'complement'
                }

simple_ckb_sats = dict()

for instance, satellites in ckb_sats.items():
    satellites = ' '.join(func_convert[fn] for fn in satellites)
    simple_ckb_sats[instance] = satellites.split()
    
print('\nOLD average num. of satellites per instance:')
print(avg_data(ckb_sats),'\n')
print('New average num. of satellites per instance:')
print(avg_data(simple_ckb_sats))


OLD average num. of satellites per instance:
2.24 

New average num. of satellites per instance:
2.41


^ a higher avg. means we've succeeded in splitting several of the combined satellites.

Now we're ready to present some data...<br>
I'll be using some HTML tricks inspired by [Gino Kalkman's notebook](https://github.com/ETCBC/Biblical_Hebrew_Analysis/blob/master/Miscellaneous/AsyndeticClauseFunctions.ipynb)

In [6]:
# Let's write some functions for displaying some statistics:

from IPython.display import display, HTML

def generate_table(fieldnames, data, style=''):
    '''
    returns HTML table when fed a fieldnames list and list of tuples in order
    optional argument to configure text style
    '''
    table_code = '<table>'     # here is where all the code will be assembled
    table_row = '<tr>{}</tr>'  # template for table rows
    # assemble table_header
    table_header = ''
    for fieldname in fieldnames:
        table_header += '<th{style}>{header}</th>'.format(style=style, 
                                                          header=fieldname)
    # assemble table_rows
    table_rows = ''
    for data_tuple in data:
        row = ''
        for data in data_tuple:    
            row += '<td{style}>{data}</td>'.format(style='',
                                                   data=data)
        table_rows += table_row.format(row)
    # complete the code:
    table_code += table_row.format(table_header)
    table_code += table_row.format(table_rows)
    table_code += '</table>'
    # display the code
    display(HTML(table_code))
    
def percent(amount, total):
    '''
    return a simple percentage
    '''
    return round((amount/total)*100, 2)

In [7]:
# assemble some stats to display:
ckb_simple_sTats = col.Counter()
for instance, pattern in simple_ckb_sats.items():
    ckb_simple_sTats[ tuple(sorted(pattern)) ] += 1
    
# -- data for the HTML viewer -- 

fieldnames = ['Total',' % ', 'Pattern']
all_patts = sum(ckb_simple_sTats.values())
# a list of dicts with fieldname keys:
ckb_simple_data = list( (total, percent(total, all_patts), pattern)\
                       for pattern, total in sorted(ckb_simple_sTats.items(), key = lambda k: -k[1])
                      ) 
# display table:
print()
generate_table(fieldnames, ckb_simple_data, style=' style="text-align:center"')
print()




Total,%,Pattern
,,
50.0,25.64,"('complement', 'predicate')"
43.0,22.05,"('complement', 'predicate', 'subject')"
25.0,12.82,"('predicate',)"
13.0,6.67,"('predicate', 'subject')"
12.0,6.15,"('adjunct', 'predicate')"
9.0,4.62,"('complement', 'complement', 'predicate', 'subject')"
9.0,4.62,"('complement', 'complement', 'predicate')"
7.0,3.59,"('adjunct', 'complement', 'predicate')"
7.0,3.59,"('direct_object', 'predicate')"





Take note that the elements in these patterns are sorted alphabetically, not in the order of occurrence.<br><br>

Some first observations:<br>
* ~ **`53.84%`** of the patterns have only
    * 1 complement or
    * 1 adjunct
    
    
* ~ **`19.49%`** with
    * 0 other elements except for a subject
    
    
* ~ **`5.13%`** with
    * 1 direct object
    
The rest of the patterns have complex mixtures of complements, adjuncts, and objects as well as doubles.<br><br>

Perhaps there are some further refinements we can apply to the categories. It's not certain whether the presence of a subject is relevant for the phrase functions. But it's impossible to know until the examples can be manually sorted. Before we move to that stage, we'll add some more information about the lexical and semantic qualities (and subcategories) of the groups observed above.

## Step 2: Lexical and Semantic Categories

We now have some basic groups and information to build further queries upon. This next step entails measuring the lexical and semantic qualities of the שכב satellites. Some features we're looking for:
* locative lexemes - lexemes that imply spatial distinctions, and thus movement
* agentive lexemes - lexemes that imply reception of the action
* instrumental lexemes - lexemes that imply the use of objects/tools in the action

Further subclassifications may break down by preposition use. See, for example, the activity in the Roorda&Dyk corrections notebook which contain ל and כ objects. That notebook already contains some rules defined for L/K objects:
* "start with either preposition L or K and
* the L or K in question does not carry a pronominal suffix
* should also not be followed by a body part" 
([Roorda&Dyk](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html), "Complements as LK Objects")

Since these kinds of features are not presently stored in ETCBC4c, we have to use a few tricks to procure them. In some cases, items considered "adjuncts" (i.e., "unnecessary"), will need to be reconsidered as complements (key to the semantic meaning). **Locative** and **agentive** lexemes will be facilitated by the features:
* `nametype`
    * = `topo` (place)
    * = `pers` (person)
    * = `gens` (people)
* `uvf`
    * contains locative ה
* `ls`
    * = `gentilic` (i.e. demonym)
    
I also would like to experiment with deploying Framenet or Wordnet combined with the `gloss` feature as a method of categorising lexemes. This may prove especially useful for **instrumental** terms. 

In [8]:
# First we need to organise the 4 simple groups we observed above.
# We treat group_4, mixed satellites, as a catch-all for now,
#   so that we can deal with the simpler structures first.

basic_groups = col.defaultdict(set)       # a dict keyed by group, valued by sets of clause nodes
basic_groups_check = col.defaultdict(set) # to double-check our results

# rules for groups based on counts:
         # (adj, compl, d.o.)
group_rules = { (1,0,0) : 'group_1', # 1 adj ø else
                (0,1,0) : 'group_1', # 1 comp ø else
                (0,0,1) : 'group_2', # 1 d.o. ø else
                (0,0,0) : 'group_3', # ø else
               #all else : group_4
              }

for clause_node, pattern in simple_ckb_sats.items():
    adj_count = pattern.count('adjunct')
    comp_count = pattern.count('complement')
    do_count = pattern.count('direct_object')
    count = (adj_count,comp_count,do_count) 
    if count in group_rules:
        group = group_rules[count]
    else:
        group = 'group_4'
    basic_groups[group].add(clause_node)
    basic_groups_check[group].add(tuple(sorted(pattern)))
        
# -- Double Check Our Results -- #
        
total_check = 0
for group, group_nodes in sorted(basic_groups.items()):
    total_check += len(group_nodes)
    print(group+' - ', len(group_nodes), 'members')
    for patt in basic_groups_check[group]:
        print(patt)
    print()
print('Group_all', total_check)

group_1 -  111 members
('complement', 'predicate', 'predicate')
('complement', 'predicate')
('adjunct', 'predicate', 'subject')
('adjunct', 'predicate')
('complement', 'predicate', 'subject')

group_2 -  10 members
('direct_object', 'predicate', 'subject')
('direct_object', 'predicate')

group_3 -  38 members
('predicate',)
('predicate', 'subject')

group_4 -  36 members
('adjunct', 'complement', 'predicate')
('adjunct', 'direct_object', 'predicate')
('complement', 'complement', 'predicate', 'subject')
('direct_object', 'direct_object', 'predicate', 'subject')
('adjunct', 'adjunct', 'complement', 'predicate')
('complement', 'direct_object', 'predicate', 'subject')
('adjunct', 'complement', 'predicate', 'subject')
('complement', 'complement', 'predicate')

Group_all 195


This looks good. We can move on to measuring semantics throughout each group to create more nuanced categories.

The key to these queries is the **noun**, upon which we'll perform checks for semantic data. The Roorda&Dyk notebook contains a nice example of measuring semantic ideas with a scoring system. I'll keep this in mind as I move ahead...

First, we work with group 1 and inventory the patterns in the complement phrases. The inventory will provide a basis for the semantic work.

In [9]:
# GROUP 1 - SINGLE ADJUNCT OR COMPLEMENTS
# WORD FUNCTION INVENTORIES
# query and inventory word-level, internal phrase functions in group 1

group1_phrases = dict()
group1_stats = col.Counter()

for clause_atom in basic_groups['group_1']:
    phrases = L.d(clause_atom, otype='phrase')
    target_phrase = None
    for phrase in phrases:
        func = F.function.v(phrase)
        simple_form = func_convert[func].split() if func in func_convert else func
        target_phrase = phrase if {'adjunct','complement'} & set(simple_form)\
                               else target_phrase
    pattern = tuple(F.pdp.v(word) for word in L.d(target_phrase, otype='word'))
    group1_phrases[clause_atom] = {'phrase':target_phrase, 'pattern':pattern}
    group1_stats[pattern] += 1
    
group1_header = ['Total','Pattern']
display_group1 = list((a,p) for p, a in sorted(group1_stats.items(), key=lambda k: -k[1]))

generate_table(group1_header, display_group1)

Total,Pattern
,
57.0,"('prep', 'subs')"
19.0,"('prep',)"
6.0,"('prep', 'art', 'subs')"
4.0,"('verb',)"
4.0,"('advb',)"
4.0,"('subs',)"
2.0,"('prep', 'subs', 'art', 'subs')"
2.0,"('prep', 'subs', 'prep', 'subs', 'subs')"
2.0,"('prep', 'subs', 'subs')"


The results show the prominent position of the preposition in group 1 constructions.

57 of the results contain only a preposition and a substantive. We'll focus first on this simpler form. Hopefully this will give a basis on which to process the rarer, more complex examples.

### Sense Generator
We're building a sense generator that will return 1 of 3 categories for a given lemma. The categories are:
* person
* place
* thing/object

The machine will use 4 sources to make its decisions: 
1. existing features in the ETCBC
2. lists of categorised lexemes from the [Roorda/Dyk notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html)
3. 3 and 4 are special side-projects that are in progress:
    * see [valency_wordlists](https://github.com/codykingham/textfabric_notebooks/blob/master/valency_wordlists)
    * (3) generated category lists from [openscriptures' BDB lexicon](https://github.com/openscriptures/HebrewLexicon) using part-of-speech tags in BDB
    * (4) generated category lists from Wordnet, using hypernym relations

Each of the sources will count as 1 or more parameters, that, if fulfilled, will go towards a score for the given lexeme object. The categorisation is based on a simple majority, but scores can also be returned with a "strength", for example, 3/3 or 2/3, depending on how many parameters are met out of how many are applicable. 

In [74]:
# ('prep', 'subs'), 57 examples
# LOCATIVITY - how many of these examples have a locative substantive as the obj. of the prep.?

#nametype
    #= topo (place)
    #= pers (person)
    #= gens (people)
#uvf
    #contains locative ה
#ls
    #= gentilic (i.e. demonym)

def test_locative(word_node):
    # SOURCE 1, ETCBC4c
    lex_obj = L.u(word_node, otype = 'lex')[0]
    univalent_final = F.uvf.v(word_node) if F.uvf.v(word_node) == 'H' else None
    name_type = F.nametype.v(lex_obj) if F.nametype.v(lex_obj) == 'topo' else None
    s1_score = len([s for s in (univalent_final, name_type) if s])
    # SOURCE 2, ROORDA/DYK
    lexeme = F.lex.v(word_node)
    locative_lexs = set('''
                            >RY/ >YL/ >XR/<BR/ <BRH/ <BWR/ <C==/ <JR/ <L=/ 
                            <LJ=/ <LJH/ <LJL/ <MD=/ <MDH/ <MH/ <MQ/ <MQ===/ 
                            <QB/ BJN/ BJT/ CM CMJM/ CMC/ C<R/ DRK/ FDH/ HR/
                            JM/ JRDN/ JRWCLM/ JFR>L/ MDBR/ MW<D/ MWL/ MZBX/ 
                            MYRJM/ MQWM/ MR>CWT/ MSB/ MSBH/ MVH==/ QDM/
                            SBJB/ TJMN/ TXT/ TXWT/ YPWN/
                            '''.strip().split())
    s2_score = len({lexeme}&locative_lexs)
    
    # SOURCE 3, BDB categories list
    # source 4, wordnet categories list