# `שכב` valency

In [1]:
from datetime import datetime
last_modified = datetime.now()
print('Notebook last modified on {}'.format(last_modified.__str__()))

Notebook last modified on 2017-01-15 17:43:14.195322


## Methodology
The goal is to inventory and categorise the various satellites surrounding the verb שכב ("to lie") in biblical Hebrew in order to determine which elements give rise to which semantic meanings of שכב. Valency tracks the interaction between semantics and syntax.<br>
<br>
In Dyk et al. they suggest that few "watertight" methods exist to separate obligatory, complement functions from non-obligatory, adjunctive functions. (see [Dyk, Glanz, Oosting, "Analysing Valence Patterns,"](https://shebanq.ancient-data.org/shebanq/static/docs/methods/2014_Dyk_jnsl.pdf) 4-5). They apply a "distributional method" as follows:

* "Collect all occurrences of a verb with the complete patterns of elements occurring in the data."
* "Sort these by pattern."
* "Analyse the differences between the various patterns, observing what relation the separate sentence constituents have to the verb." *(Dyk et al., 6)*

Which elements to use? Dyk et al. use:
* "predicate (Pred), subject (Subj), object (Objc), complement (Cmpl), adjunct (Adju)." (7)

The [Roorda/Dyk valency corrections notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html) informed and inspired many of the queries created in this notebook.<br><br>
Let's begin by applying the simplest measures first. We'll keep track of how many examples of the predicate we've accounted for as we work from simpler to more complex patterns. 
<br><br>
**Here's the objectives:**<br>
1. Inventory phrase functions for relevant phrase functions; organise by these groups.
    * This part establishes the valency type of the verb. Is it transitive or intransitive? Monovalent, divalent, or trivalent? Are there examples of valence expansion or valence reduction?
2. Further subdivide the general patterns with semantic/lexical distinctions
    * Roorda & Dyk's valency correction notebook (above) further sub-categorises functions based on semantic distinctions such as location, time, instrumentality, and a few more. The database as-is does not contain these distinctions. But some may be inferred from the [features](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html): 
        * **`nametype`**
        * **`gloss`** (used in conjunction with a resource like [WordNet](http://www.nltk.org/howto/wordnet.html) or [FrameNet](http://www.nltk.org/howto/framenet.html) through the NLTK package)
        * **`uvf`** (for ה locative markers).
    * Perhaps also different prepositions might give rise to different senses?

Procedural question: should the order of elements matter? For the time being, let's keep things simple by ignoring the order of elements. This is something that can be analysed secondarily. Or we can go back if necessary. 

In [2]:
from collections import *
from pprint import pprint
from tf.fabric import Fabric

TF = Fabric(modules='Hebrew/etcbc4c')
print()
api = TF.load("""otype
                 book chapter verse
                 function pdp sp vs prs st rela
                 lex g_cons_utf8 g_word_utf8 g_prs_utf8 nu
                 lex_utf8
                 nametype ls gloss uvf
                 freq_lex
                """)

api.makeAvailableIn(globals()) # so we don't have to say api.F.feature.v() but only F.feature.v()

This is Text-Fabric 2.3.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
109 features found and 0 ignored

  0.00s loading features ...
   |     0.04s B otype                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.00s B chapter              from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.18s B g_cons_utf8          from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.19

In [3]:
# collect all clauses that contain the target verb CKB with a phrase function of predicate

target = 'CKB['
stem = 'qal' # we are only studying the qal stem for now

# derived from valency corrections notebook (Roorda&Dyk)
# for now we take only functions that have a regular verb 
predicate_functions = {'Pred', 'PreS', 'PreO', 'PreC', 'PtcO', 'PrcS'} 
# complements | adjuncts:
cmpl_adj = {'Cmpl','Adju','Loca','Supp', 'Objc','Subj','ModS','NCoS','IntS','PrAd'} 

def find_satellites(target, stem, pred_functions, comp_adj):
    '''
    takes a lemma&stem and returns a dict containing:
    results[clause_atom_node] = [phrase_function for pf in relevant_phrase_functions]
    '''
    satellites = dict()
    for word in F.otype.s('word'):
        lex = F.lex.v(word)
        if lex != target:
            continue
        phrase_node = L.u(word, otype='phrase')[0]
        phrase_func = F.function.v(phrase_node)
        if phrase_func not in pred_functions or F.vs.v(word) != stem:
            continue
        clause_node = L.u(phrase_node, otype = 'clause_atom')[0]
        phrase_nodeS = L.d(clause_node, otype = 'phrase')
        phrase_functs = list(F.function.v(phrase) for phrase in phrase_nodeS if F.function.v(phrase)\
                             in pred_functions | comp_adj)
        satellites[clause_node] = phrase_functs
    return satellites

ckb_sats = find_satellites(target, stem, predicate_functions, cmpl_adj)

# display the average length of the gathered data
def avg_data(data_dictionary):
    total_datPoints = len(data_dictionary.values())
    total_datLengths = sum(len(datpoint) for datpoint in data_dictionary.values())
    return round(total_datLengths / total_datPoints, 2)

print()
print('instances found: ', len(ckb_sats))
print('avg. # of satellites: ', avg_data(ckb_sats))        
print()


instances found:  195
avg. # of satellites:  2.21



In [4]:
print('Inventory of all registered functions for CKB:')
print(set(function for function_list in ckb_sats.values() for function in function_list))

Inventory of all registered functions for CKB:
{'Cmpl', 'Loca', 'Objc', 'Pred', 'PreC', 'PreS', 'Adju', 'PreO', 'Subj', 'IntS'}


Some of the elements in the function codes above are superfluous or unnecessarily connected. For example: `PreS` with a suffixed subject belongs in the same category as `Pred+Subj`. In the `IntS` function, the interrogative is (at this point) superfluous for phrase-level valency function, but the `S` of subject is necessary. Let's simplify those labels. While we're at it, we'll convert the codes into more readable forms and also convert the `Objc` label into a direct object label.<br><br>
We keep the copy of the original dictionary in case we find that the altered labels contain valuable data later during the analysis stage.

In [5]:
# the new labels will be attached to a string and .split()'ed:
func_convert = {'Subj' : 'subject',    
                'PreC' : 'predicate complement',
                'Adju' : 'adjunct',
                'PreO' : 'predicate direct_object',
                'Pred' : 'predicate',
                'IntS' : 'subject',
                'PreS' : 'predicate subject',
                'Objc' : 'direct_object',
                'Cmpl' : 'complement',
                'Loca' : 'complement'
                }

simple_ckb_sats = dict()

for instance, satellites in ckb_sats.items():
    satellites = ' '.join(func_convert[fn] for fn in satellites)
    simple_ckb_sats[instance] = satellites.split()
    
print('\nOLD average num. of satellites per instance:')
print(avg_data(ckb_sats),'\n')
print('New average num. of satellites per instance:')
print(avg_data(simple_ckb_sats))


OLD average num. of satellites per instance:
2.21 

New average num. of satellites per instance:
2.38


^ a higher avg. means we've succeeded in splitting several of the combined satellites.

Now we're ready to present some data...<br>
I'll be using some HTML tricks inspired by [Gino Kalkman's notebook](https://github.com/ETCBC/Biblical_Hebrew_Analysis/blob/master/Miscellaneous/AsyndeticClauseFunctions.ipynb)

In [6]:
# Let's write some functions for displaying some statistics:

from IPython.display import display, HTML

def generate_table(fieldnames, data, style=''):
    '''
    returns HTML table when fed a fieldnames list and list of tuples in order
    optional argument to configure text style
    '''
    table_code = '<table>'     # here is where all the code will be assembled
    table_row = '<tr>{}</tr>'  # template for table rows
    # assemble table_header
    table_header = ''
    for fieldname in fieldnames:
        table_header += '<th{style}>{header}</th>'.format(style=style, 
                                                          header=fieldname)
    # assemble table_rows
    table_rows = ''
    for data_tuple in data:
        row = ''
        for data in data_tuple:    
            row += '<td{style}>{data}</td>'.format(style='',
                                                   data=data)
        table_rows += table_row.format(row)
    # complete the code:
    table_code += table_row.format(table_header)
    table_code += table_row.format(table_rows)
    table_code += '</table>'
    # display the code
    display(HTML(table_code))
    
def percent(amount, total):
    '''
    return a simple percentage
    '''
    return round((amount/total)*100, 2)

In [7]:
# assemble some stats to display:
ckb_simple_sTats = Counter()
for instance, pattern in simple_ckb_sats.items():
    ckb_simple_sTats[ tuple(sorted(pattern)) ] += 1
    
# -- data for the HTML viewer -- 

fieldnames = ['Total',' % ', 'Pattern']
all_patts = sum(ckb_simple_sTats.values())
# a list of dicts with fieldname keys:
ckb_simple_data = list( (total, percent(total, all_patts), pattern)\
                       for pattern, total in sorted(ckb_simple_sTats.items(), key = lambda k: -k[1])
                      ) 
# display table:
print()
generate_table(fieldnames, ckb_simple_data, style=' style="text-align:center"')
print()




Total,%,Pattern
,,
54.0,27.69,"('complement', 'predicate')"
42.0,21.54,"('complement', 'predicate', 'subject')"
26.0,13.33,"('predicate',)"
16.0,8.21,"('predicate', 'subject')"
11.0,5.64,"('complement', 'complement', 'predicate', 'subject')"
11.0,5.64,"('complement', 'complement', 'predicate')"
8.0,4.1,"('direct_object', 'predicate')"
7.0,3.59,"('adjunct', 'predicate')"
6.0,3.08,"('adjunct', 'complement', 'predicate')"





Take note that the elements in these patterns are sorted alphabetically, not in the order of occurrence.<br><br>

Some first observations:<br>
* ~ **`53.84%`** of the patterns have only
    * 1 complement or
    * 1 adjunct
    
    
* ~ **`19.49%`** with
    * 0 other elements except for a subject
    
    
* ~ **`5.13%`** with
    * 1 direct object
    
The rest of the patterns have complex mixtures of complements, adjuncts, and objects as well as doubles.<br><br>

Perhaps there are some further refinements we can apply to the categories. It's not certain whether the presence of a subject is relevant for the phrase functions. But it's impossible to know until the examples can be manually sorted. Before we move to that stage, we'll add some more information about the lexical and semantic qualities (and subcategories) of the groups observed above.

## Step 2: Lexical and Semantic Categories

We now have some basic groups and information to build further queries upon. This next step entails measuring the lexical and semantic qualities of the שכב satellites. Some features we're looking for:
* locative lexemes - lexemes that imply spatial distinctions, and thus movement
* agentive lexemes - lexemes that imply reception of the action
* instrumental lexemes - lexemes that imply the use of objects/tools in the action

Further subclassifications may break down by preposition use. See, for example, the activity in the Roorda&Dyk corrections notebook which contain ל and כ objects. That notebook already contains some rules defined for L/K objects:
* "start with either preposition L or K and
* the L or K in question does not carry a pronominal suffix
* should also not be followed by a body part" 
([Roorda&Dyk](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html), "Complements as LK Objects")

Since these kinds of features are not presently stored in ETCBC4c, we have to use a few tricks to procure them. In some cases, items considered "adjuncts" (i.e., "unnecessary"), will need to be reconsidered as complements (key to the semantic meaning). **Locative** and **agentive** lexemes will be facilitated by the features:
* `nametype`
    * = `topo` (place)
    * = `pers` (person)
    * = `gens` (people)
* `uvf`
    * contains locative ה
* `ls`
    * = `gentilic` (i.e. demonym)
    
I also would like to experiment with deploying Framenet or Wordnet combined with the `gloss` feature as a method of categorising lexemes. This may prove especially useful for **instrumental** terms. 

In [8]:
# First we need to organise the 4 simple groups we observed above.
# We treat group_4, mixed satellites, as a catch-all for now,
#   so that we can deal with the simpler structures first.

basic_groups = defaultdict(set)       # a dict keyed by group, valued by sets of clause nodes
basic_groups_check = defaultdict(set) # to double-check our results

# rules for groups based on counts:
         # (adj, compl, d.o.)
group_rules = { (1,0,0) : 'group_1', # 1 adj ø else
                (0,1,0) : 'group_1', # 1 comp ø else
                (0,0,1) : 'group_2', # 1 d.o. ø else
                (0,0,0) : 'group_3', # ø else
               #all else : group_4
              }

for clause_node, pattern in simple_ckb_sats.items():
    adj_count = pattern.count('adjunct')
    comp_count = pattern.count('complement')
    do_count = pattern.count('direct_object')
    count = (adj_count,comp_count,do_count) 
    if count in group_rules:
        group = group_rules[count]
    else:
        group = 'group_4'
    basic_groups[group].add(clause_node)
    basic_groups_check[group].add(tuple(sorted(pattern)))
        
# -- Double Check Our Results -- #
        
total_check = 0
for group, group_nodes in sorted(basic_groups.items()):
    total_check += len(group_nodes)
    print(group+' - ', len(group_nodes), 'members')
    for patt in basic_groups_check[group]:
        print(patt)
    print()
print('Group_all', total_check)

group_1 -  106 members
('complement', 'predicate', 'predicate')
('complement', 'predicate')
('adjunct', 'predicate')
('complement', 'predicate', 'subject')
('adjunct', 'predicate', 'subject')

group_2 -  11 members
('direct_object', 'predicate', 'subject')
('direct_object', 'predicate')

group_3 -  42 members
('predicate', 'subject')
('predicate',)

group_4 -  36 members
('adjunct', 'complement', 'predicate', 'subject')
('complement', 'complement', 'predicate')
('adjunct', 'complement', 'predicate')
('complement', 'direct_object', 'predicate', 'subject')
('adjunct', 'direct_object', 'predicate')
('direct_object', 'direct_object', 'predicate', 'subject')
('complement', 'complement', 'predicate', 'subject')

Group_all 195


This looks good. We can move on to measuring semantics throughout each group to create more nuanced categories.

The key to these queries is the **noun**, upon which we'll perform checks for semantic data. The Roorda&Dyk notebook contains a nice example of measuring semantic ideas with a scoring system. I'll keep this in mind as I move ahead...

First, we work with group 1 and inventory the patterns in the complement phrases. The inventory will provide a basis for the semantic work.

In [9]:
# GROUP 1 - SINGLE ADJUNCT OR COMPLEMENTS
# WORD FUNCTION INVENTORIES
# query and inventory word-level, internal phrase functions in group 1

basic_groups['group_1']

def getPhraseInventory(group, targetElements, ):
    group_phrases = dict()
    group_stats = Counter()
    for clause_atom in group:
        phrases = L.d(clause_atom, otype='phrase')
        target_phrase = None
        for phrase in phrases:
            func = F.function.v(phrase)
            simple_form = func_convert[func].split() if func in func_convert else func
            target_phrase = phrase if targetElements & set(simple_form)\
                                   else target_phrase
        
        pattern = tuple(F.pdp.v(word) for word in L.d(target_phrase, otype='word'))
        
        # nouns (and adverbs/adjectives) that do not function as a subject
        objNouns = tuple(w for w in L.d(target_phrase, otype='word') 
                           if all([
                                   F.pdp.v(w) in {'subs','prps','prin','adjv','advb'}
                                   or F.prs.v(w) not in {'n/a','absent'},
                                   not {'rec','atr'} & set(F.rela.v(subPhrase) for subPhrase in L.u(w, otype='subphrase')),
                                   F.function.v((L.u(w,otype='phrase'))) != 'Subj'
                                   ])
                        ) #/tuple

        
        group_phrases[clause_atom] = {'phrase':target_phrase, 'objNouns':objNouns, 'pattern':pattern}
        group_stats[pattern] += 1
        
    return group_phrases, group_stats

group1 = getPhraseInventory(basic_groups['group_1'], {'complement','adjunct'})

group1_phrases = group1[0]
    
group1_header = ['Total','Pattern']
display_group1 = list((a,p) for p, a in sorted(group1[1].items(), key=lambda k: -k[1]))

generate_table(group1_header, display_group1)

Total,Pattern
,
58.0,"('prep', 'subs')"
21.0,"('prep',)"
5.0,"('subs',)"
3.0,"('advb',)"
3.0,"('prep', 'art', 'subs')"
2.0,"('verb',)"
2.0,"('prep', 'adjv', 'subs')"
2.0,"('prep', 'subs', 'prep', 'subs', 'subs')"
2.0,"('prep', 'subs', 'art', 'adjv')"


The results show the prominent position of the preposition in group 1 constructions.

57 of the results contain only a preposition and a substantive. We'll focus first on this simpler form. Hopefully this will give a basis on which to process the rarer, more complex examples.

### Sense Generator
We use a sense generator that will return 1 of 3 categories for a given lemma. The categories are:
* person
* place
* thing/object

The "machine" will use 4 sources to make its decisions: 
1. existing features in the ETCBC
2. lists of categorised lexemes from the [Roorda/Dyk notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html)
3. sources 3 and 4 are special projects created for these kinds of queries:
    * see [valency_wordlists](https://github.com/codykingham/textfabric_notebooks/blob/master/valency_wordlists)
    * (3) generated category lists from [openscriptures' BDB lexicon](https://github.com/openscriptures/HebrewLexicon) using part-of-speech tags in BDB 
    * (4) generated category lists from Wordnet, using hypernym relations

Each of the sources will count as 1 or more parameters, that, if fulfilled, will go towards a score for the given lexeme object. The categorisation is based on a simple majority, but scores can also be returned with a "strength", for example, 3/3 or 2/3, depending on how many parameters are met out of how many are applicable. 

In [170]:
import json

with open('../valency_wordlists/bdbCategories.json') as bdbFile:
    bdbCategories = json.load(bdbFile)
    
with open('../valency_wordlists/wordnetCategories.json') as wnFile:
    wnCategories = json.load(wnFile)

with open('../valency_wordlists/RoordaDykCategories.json') as rdFile:
    rdCategories = json.load(rdFile)
    
def testLocation(word_node, info=False):
    lexeme = F.lex.v(word_node)
    # SOURCE 1, ETCBC4c
    lex_obj = L.u(word_node, otype = 'lex')[0]
    univalent_final = 1 if F.uvf.v(word_node) == 'H' else 0
    name_type = 1 if F.nametype.v(lex_obj) == 'topo' else 0
    s1_score = sum((univalent_final, name_type))
    # SOURCE 2, ROORDA/DYK
    s2_score = 1 if lexeme in rdCategories and rdCategories[lexeme]['category'] == 'place' else 0
    # SOURCE 3, BDB categories list
    s3_score = 1 if lexeme in bdbCategories and bdbCategories[lexeme]['category'] == 'place' else 0
    # source 4, wordnet categories list
    s4_score = 1 if lexeme in wnCategories and wnCategories[lexeme]['cat'] == 'place' else 0
    if not info: return sum((s1_score, s2_score, s3_score, s4_score))
    else: return { 's1':s1_score, 's2':s2_score,'s3':s3_score, 's4':s4_score}
    
def testAgency(wordNode, info=False):
    # SOURCE 1, ETCBC4c
    lexN = L.u(wordNode, otype = 'lex')[0]
    lex = F.lex.v(lexN)
    nametype = 1 if F.nametype.v(lexN) == 'pers' else 0
    gentilic = 1 if F.ls.v(wordNode) == 'gntl' else 0
    pronoun = 1 if F.sp.v(wordNode) == 'prps' or F.pdp.v(wordNode) == 'prps' else 0
    s1Score = sum((nametype,gentilic,pronoun))
    # SOURCE 2, Roorda/Dyk
    s2Score = 1 if lex in rdCategories and rdCategories[lex]['category'] == 'agent' else 0
    # Source 3 - BDB categories
    s3Score = 1 if lex in bdbCategories and bdbCategories[lex]['category'] == 'agent' else 0
    # Source 4 - WN categories
    s4Score = 1 if lex in wnCategories and wnCategories[lex]['cat'] == 'agent' else 0
    if not info: return sum((s1Score, s2Score, s3Score, s4Score))
    else: return { 's1':s1Score, 's2':s2Score,'s3':s3Score, 's4':s4Score}
    
def abstractObject(wordNode, info=False):
    lex = F.lex.v(wordNode)
    # Source 1, BDB categories
    s1score = 1 if lex in bdbCategories and bdbCategories[lex]['subcategory'] == 'abstract' else 0
    s2score = 1 if lex in wnCategories and wnCategories[lex]['subcategory'] == 'abstract' else 0
    if not info: return(sum((s1score, s2score)))
    else: return {'BDB': s1score, 'WordNet':s2score}

### export spreadsheet for processing these basic patterns

The spreadsheet contains sections that are pre-organised based on the estimations above. They are first organised by the simplest parameters to the more obscure parameters:

1. for every phrase in group 1, gather the number of prepositions, preposition lexeme, number of substantives, and category of substantives (agency, location, or ?). 

2. sort the groups into sub-groups by the following categories:
    * sort by number of prepositions from least to greatest
        * these are then sorted by the number of substantives
            * these are sub-sorted again by the lexeme of the preposition
                * these are once again subsorted by the category of lexeme
                
The resulting spreadsheet is cleanly categorised in ascending order from the simplest structures to the most complex.

## to fix:

* √ include the `Loca` tag! 
* √ exclude the `Time` tag
* √ `PreC` phrases are omitting nouns and adjectives resulting in false positives for the 0,0 category.
    * PreC followed by an adverbial expression that communicates location should be included with the 1,0 group.
* See clauseN `547260`, 2 Samuel 13.8
    * substantives which function as subjects should NOT be included in the noun inventory
    * we are looking for nouns that are acted upon or with
* include suffixes as noun phrases
    * 1,0 is essentially illegitimate
    * find: `prep + suffix`
* prepositions within time phrases should not be counted
    * see Eccl 2.23
* words in construct should only be counted once
    * see Ruth 3.7
    * ft. those words that share the same phrase should be excluded
    * see also `st` (state) feature; select 'absolute'
    * check for waw conjunction
* words that also count as the verb should not be counted as a substantive
    * see Prov 23.34; how to deal with this tricky example?
    * should the phrase goverened by שכב be reinterpreted as a clause atom?

In [159]:
import csv

def testCat(noun):
    '''
    test whether a given noun registers
    for agency or location lexeme types
    '''
    agency = testAgency(noun)
    loca = testLocation(noun)
    if agency or loca:
        return 'agent' if agency > loca else 'locale'
    else:
        return '?'

def writeGroups(phrasedict):
    '''
    assemble the groups in order of 
     1) number of prepositions (least to greatest)
     2) number of substantives (least to greatest)
    build the row data for the CSV analysis file
    '''
    subgroups = defaultdict(lambda: defaultdict(list))
    for clause, phraseDat in phrasedict.items():
        
        phrase = phraseDat['phrase']
        pattern = phraseDat['pattern']
        prepCount = phraseDat['pattern'].count('prep')
        nounCount = len(phraseDat['objNouns'])
        verb = tuple(F.g_word_utf8.v(v) for v in L.d(clause,otype='word') if F.pdp.v(v) == 'verb')
        prepWords = tuple(w for w in L.d(phrase,otype='word') if F.pdp.v(w) == 'prep')
        nounWords = phraseDat['objNouns']
        
        lexWords = tuple(L.u(w,otype='lex')[0] for w in L.d(clause,otype='word'))
        infrequents = tuple( str((F.lex.v(l),F.gloss.v(l))) for l in lexWords if F.freq_lex.v(l) < 30)
        verse = L.u(phrase, otype='verse')[0]
        
        subgroups[prepCount][nounCount].append(
                                            {
                                                'prepCt':prepCount,
                                                'nounCt':nounCount,
                                                'ref' : T.sectionFromNode(phrase),
                                                'clauseNode':clause,
                                                'phraseNode': phrase,
                                                'verse': T.text(L.d(verse, otype='word')),
                                                'phraseType':F.function.v(phrase),
                                                'clause': T.text(L.d(clause,otype='word')),
                                                'phrase': T.text(L.d(phrase,otype='word')),
                                                'rareTerms': '\n'.join(infrequents), 
                                                'verb' : ' '.join(verb),
                                                'prepositions': ' '.join(F.lex_utf8.v(p) for p in prepWords), 
                                                'nouns': ' '.join(F.g_word_utf8.v(n) for n in nounWords),
                                                'nounTypes':' '.join(testCat(n) for n in nounWords),
                                                'notes':' ',
                                            })
    return subgroups


def countParameter(listOfClauseDicts, ParameterKey):
    '''
    counts the preposition or nountypes amongst all of the target phrases
    the return values are used to order the CSV data
    '''
    inventory = Counter()
    for data in listOfClauseDicts:
        inventory[data[ParameterKey]] += 1
    return inventory
    
    
group1Data = writeGroups(group1_phrases)

fieldnames = ['ref','phraseNode','clauseNode','verse','rareTerms','phraseType','clause',
              'verb','nouns','phrase','prepCt','nounCt','prepositions','nounTypes','notes']

def writeAnalysis(group, filename, fieldnames):
    '''
    write the pre-assembled groups into the csv analysis file
    '''
    with open(filename,'w') as file:
        writer = csv.DictWriter(file, fieldnames = fieldnames)
        writer.writeheader()
        for prepCount, nounCounts in sorted(group.items()): # first, sort by number of prepositions
            for nounCount, clauseDat in sorted(nounCounts.items()): # second, subsort by number of substantives
                wordCatCount= countParameter(clauseDat,'nounTypes') # third, subsort by frequency of a preposition lexeme
                Catsort = sorted([(wordCatCount[clause['nounTypes']], clause) for clause in clauseDat], key = lambda k: -k[0])
                prepLexCount = countParameter(clauseDat,'prepositions') # fourth, subsort by frequency of word category
                prepsort = sorted([(prepLexCount[data[1]['prepositions']],data) for data in Catsort], key = lambda k: -k[0])
                finalData = tuple(data[1][1] for data in prepsort)   
                writer.writerows(finalData)
                
writeAnalysis(group1Data,'group1.csv', fieldnames)

# Remaking the Groups

In light of holes and findings in group 1, it is best to re-create the basic groups into finer distinctions. We can use sub-categories found in group 1 to also further categorise the groups.

The groups will be created based on the following set of rules:

* preposition
    * עם / את
        * object == pl אב ('fathers') ?
            * **death, figurative idiom**
        * object == suffix ?
            * **sexual intercourse**
        * object == agentive noun ?
            * locative satellite in between ? 
                * **literal action, with/among entity** 
            * else: 
                * **sexual intercourse**
        * else
            * **literal action, with/among entity**

    * על / ל / ב / אצל / בין, et al.
        * object == abstract object ? 
            * **literal action, manner of lying**
        * object == locative or unknown ?
            * ** literal action, spatial direction of act**

* ø preposition
    * noun/adj/adv == locative ?
        * 'sex' category in clause atom? 
            * **sexual intercourse + adjunctive object**
        * else:
            * **literal action, spatial direction of act**
    * ø noun/adj/adv
        * +verbal suffix ?
            * **sexual intercourse** 
        * else
            * **literal action, connotates sleep**
        
        
These rules are applied on a per-phrase basis to create phrase categories. Categories can be compounded together to create compound categories. I expect these distinctions to break down once multiple complement/object/adjunct phrases are combined. Once that happens, we'll adjust the rules accordingly. 

In [160]:
def isObjectNoun(w):
    if all([F.pdp.v(w) in {'subs','prps','prin','adjv','advb','nmpr'},
            not {'rec','atr'} & set(F.rela.v(subPhrase) for subPhrase in L.u(w, otype='subphrase')),
            F.function.v((L.u(w,otype='phrase'))) != 'Subj',
            F.rela.v(L.u(w, otype = 'phrase_atom')[0]) not in {'Appo'},
           ]):
        return True

def analyseSatellites(clauseAtom):
    satelliteFunctions = {'Adju','PreO','Objc','Cmpl','Loca'}
    satellites = (ph for ph in L.d(clauseAtom, otype='phrase') if F.function.v(ph) in satelliteFunctions)
    satData = defaultdict(dict)
    
    for sat in satellites:
        words = L.d(sat, otype='word')
        prepositions = tuple(w for w in words if F.pdp.v(w) == 'prep')
        objectNouns = tuple(w for w in words if isObjectNoun(w))
        categories = []
        prepAndObjs = []
        
        # apply the rules following the logic in the above notes
        if prepositions:
                # get suffixed objects from suffixed prepositions or keep preposition objects
            objectNouns = tuple(F.g_prs_utf8.v(prep) for prep in prepositions) \
                          if not objectNouns else objectNouns
            prepAndObj = tuple(zip(prepositions, objectNouns)) # map prepositions to their objs
            
            # Calculate the categories for prepositions:    
            for prep, pObj in prepAndObj:
                if F.lex.v(prep) in {'<M','>T','>T=='}:
                    if F.lex.v(pObj) == '>B/' and F.nu.v(pObj) == 'pl': 
                        category = 'death'
                    elif type(pObj) == str or testAgency(pObj): # suffix obj or agentive obj.
                        
                        # determine whether an intervening preposional phrase occurs
                        # such a phrase can block the sexual meaning as seen in texts like:
                        # Job 20.11 and 2 Sam 11.3
                        verb = next(w for w in L.d(clauseAtom, otype='word') if F.pdp.v(w) == 'verb')
                        first, second = sorted((verb,prep)) # verb can come before or after
                        # now calculate all the intervening words and include if one is a preposition
                        # but it cannot be a 'with' preposition
                        interveningToVerb = set(F.pdp.v(n) for n in range(first+1, second)
                                               if F.lex.v(n) not in {'<M','>T','>T=='})
                        if any(['prep' in interveningToVerb,
                                'literal.prepSpatial' in {cat for sat in satData[clauseAtom]
                                                          for cat in satData[clauseAtom][sat]['categories']}
                               ]):
                            category = 'literal.withEntity'
                        else:
                            category = 'sex'
                    else:
                        category = 'literal.withEntity'
                else: #'<L','B','L','>YL/','BJN/', et. al
                    if abstractObject(pObj) or F.lex.v(prep) == 'K':
                        category = 'literal.manner'
                    else:
                        category = 'literal.prepSpatial'
                categories.append(category)
                prepAndObjs.extend(prepAndObj)
        elif objectNouns:
            for obj in objectNouns:
                if testLocation(obj):
                    category = 'literal.spatialObj'
                else:
                    if any(['sex' in categories,
                            'sex' in {cat for sat in satData[clauseAtom]
                                      for cat in satData[clauseAtom][sat]['categories']}]):
                        category = 'sex.object'
                    else:
                        category = 'literal.unknown'
                categories.append(category)
        else:
            verb = tuple(w for w in words if F.function.v(L.u(w, otype='phrase')[0]) == 'PreO')
            if verb and F.prs.v(verb[0]):
                category = 'sex'
            else:
                category = 'literal.general'
            categories.append(category)
        
        satData[clauseAtom][sat] = {'prepObjcs': prepAndObjs,
                                    'function' : F.function.v(sat),
                                    'categories' : categories,
                                    'objects' : objectNouns
                                   }
    if not satData:
        satData[clauseAtom] = {}
    return satData        

In [161]:
# print out some samples:

for ca in list(basic_groups['group_4'])[:2]:
    print(T.sectionFromNode(ca))
    print(T.text(L.d(ca, otype='word')))
    pprint(analyseSatellites(ca))
    print()

('Deuteronomy', 31, 16)
הִנְּךָ֥ שֹׁכֵ֖ב עִם־אֲבֹתֶ֑יךָ 
defaultdict(<class 'dict'>,
            {536065: {668165: {'categories': ['death'],
                               'function': 'Cmpl',
                               'objects': (111042,),
                               'prepObjcs': [(111041, 111042)]}}})

('2_Samuel', 11, 9)
וַיִּשְׁכַּ֣ב אוּרִיָּ֗ה פֶּ֚תַח בֵּ֣ית הַמֶּ֔לֶךְ אֵ֖ת כָּל־עַבְדֵ֣י אֲדֹנָ֑יו 
defaultdict(<class 'dict'>,
            {546951: {700620: {'categories': ['literal.unknown'],
                               'function': 'Cmpl',
                               'objects': (165892,),
                               'prepObjcs': []},
                      700621: {'categories': ['literal.withEntity'],
                               'function': 'Adju',
                               'objects': (165897,),
                               'prepObjcs': [(165896, 165897)]}}})



## Write the New Groups to CSV

1. apply the new function to all clauses in the ckb_sats dictionary
2. organise the groups based on 3 tiers of subcategorisation: 
   * A. the most prevalent satellite counts within all clauses
   * B. the most prevalent satellite categories and category combinations within A
   * C. the most prevalent prepositional phrases within B
3. gather reference information, plain text, labels, etc. for the csv doc
4. write to the csv file

In [162]:
# 1. Analyse satellites
ckbSatellites = {}
for clauseAtom in ckb_sats:
    ckbSatellites.update(analyseSatellites(clauseAtom))

In [163]:
# 2. arrange the group orderings
# A.
satCounts = defaultdict(list)
for clauseAtom, satellites in ckbSatellites.items():
    satCount = len(satellites)
    satCounts[satCount].append(clauseAtom)
satOrdered = sorted(((len(group),group) for count,group in satCounts.items()), reverse = True)

# B.
catOrdereds = list()
for satCount, clauseAtoms in satOrdered:
    catCounts = defaultdict(list)
    for ca in clauseAtoms:
        cats = ''
        for satellite, satData in ckbSatellites[ca].items():
            cats += ' '.join(satData['categories'])
        catCounts[cats].append(ca)
    catOrdered = sorted(((satCount, len(group), group) for group in catCounts.values()), reverse = True)
    catOrdereds.extend(catOrdered)

# C.
prepOrdereds = list()
for satCount, catCount, clauseAtoms in catOrdereds:
    prepCounts = defaultdict(list)
    for ca in clauseAtoms:
        preps = ''
        for satellite, satData in ckbSatellites[ca].items():
            preps += ' '.join(F.lex_utf8.v(po[0]) if type(po[0]) == int else po[0]
                              for po in satData['prepObjcs'])
        prepCounts[preps].append(ca)
    prepOrdered = sorted(((satCount, catCount, len(group), group) 
                         for group in prepCounts.values()), reverse = True)
    prepOrdereds.extend(prepOrdered)
    
writeOrder = prepOrdereds

In [164]:
# 3.-4. Gather data and write to csv

fieldnames = ['reference','clauseAtom','Verse','Rare Terms','Clause']
satelliteFields = ['SATELLITE: ','prepCount','ObjCount','Function','Category','(Preposition) + Objects']
satelliteCount = max(satCounts)
for satCount in range(0, satelliteCount):
    fieldnames.extend(satelliteFields)
    
rowFormat = ''

with open('CKB_valency_groups.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(fieldnames)
    
    for satCount, catCount, prepCount, clauseAtoms in writeOrder:
        for ca in clauseAtoms:
            reference = T.sectionFromNode(ca)
            verse = T.text(L.d(L.u(ca, otype='verse')[0], otype='word'))
            lexWords = tuple(L.u(w,otype='lex')[0] for w in L.d(ca,otype='word'))
            rareTerms = '\n '.join(str((F.lex.v(l),F.gloss.v(l))) for l in lexWords if F.freq_lex.v(l) < 30)
            clause = T.text(L.d(ca, otype='word'))
            
            row = list((reference,
                        ca,
                        verse,
                        rareTerms,
                        clause))
            
            for satellite, satDat in ckbSatellites[ca].items():
                
                
                objects = ' '.join(F.g_word_utf8.v(w) or w for prepObj in satDat['prepObjcs']
                          for w in prepObj) if satDat['prepObjcs'] else \
                          ' '.join(F.g_word_utf8.v(w) for w in satDat['objects']) if satDat['objects'] else\
                          'none'
                
                satColumn = list((satellite,
                                  len(satDat['prepObjcs']),
                                  len(satDat['objects']),
                                  F.function.v(satellite),
                                  ' '.join(satDat['categories']),
                                  objects
                                  ))
                row.extend(satColumn)
            
            writer.writerow(row)