<h1 style='text-align: center; font-size:300%'>`שכב` Valency</h1>
<p style ='text-align: center;'><strong>What factors determine how the verb שכב ('to lie') is translated?</strong></p>

In [47]:
from datetime import datetime
print('Notebook last modified on {}'.format(datetime.now().__str__()))

Notebook last modified on 2017-01-15 22:42:43.568100


# Methodology
The goal is to inventory and categorise the various satellites surrounding the verb שכב ("to lie") in biblical Hebrew in order to determine which elements give rise to which semantic meanings of שכב. Valency tracks the interaction between semantics and syntax.

In Dyk et al. they suggest that few "watertight" methods exist to separate obligatory, complement functions from non-obligatory, adjunctive functions. (see [Dyk, Glanz, Oosting, "Analysing Valence Patterns,"](https://shebanq.ancient-data.org/shebanq/static/docs/methods/2014_Dyk_jnsl.pdf) 4-5). They apply a "distributional method" as follows:

* "Collect all occurrences of a verb with the complete patterns of elements occurring in the data."
* "Sort these by pattern."
* "Analyse the differences between the various patterns, observing what relation the separate sentence constituents have to the verb." *(Dyk et al., 6)*

For this notebook, adjuncts and complements are treated more or less the same since some of the categories are mis-labeled in etcbc4c. 

The [Roorda/Dyk valency corrections notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html) informed and inspired many of the queries created in this notebook.

The [CKB_valency_rough.ipynb](https://github.com/codykingham/textfabric_notebooks/tree/master/שכב%20valency/preliminary/CKB_valency_rough.ipynb) is the predecessor to this one. There, satellite groups are organised based on rough, basic parameters like the presence or absence of a preposition. The rough spreadsheets exported by that notebook form the background for the rules used in this notebook. 

## Procedures

### 1. Gather clauses in which  שכב  serves the predicate role.
### 2. Process the clause's satellites based on a checklist of rules; these include checking for various idioms, prepositions, and noun types.
### 3. Export the clauses and their satellites to a segmented, organised spreadsheet for further analysis.
### 4. See the final results [here](https://github.com/codykingham/textfabric_notebooks/tree/master/שכב%20valency/CKB_valency_groups.csv)

In [48]:
# load and import necessary modules

import json, csv
from collections import *
from pprint import pprint
from tf.fabric import Fabric

TF = Fabric(modules='Hebrew/etcbc4c')
api = TF.load("""otype
                 book chapter verse
                 function pdp sp vs prs st rela
                 lex g_cons_utf8 g_word_utf8 g_prs_utf8 nu
                 lex_utf8
                 nametype ls gloss uvf
                 freq_lex
                """)
api.makeAvailableIn(globals())

This is Text-Fabric 2.3.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
109 features found and 0 ignored
  0.00s loading features ...
   |     0.04s B otype                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B chapter              from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.20s B g_cons_utf8          from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.34s

## 1. Gather clauses in which  שכב  serves the predicate role.

In [49]:
target = 'CKB['
stem = 'qal'
predicateFunctions = {'Pred', 'PreS', 'PreO', 'PreC', 'PtcO', 'PrcS'} 

ckbClauses = set(L.u(w, otype='clause_atom')[0] for w in F.otype.s('word') 
                     if F.lex.v(w) == target 
                     and F.vs.v(w) == stem
                     and F.function.v(L.u(w, otype='phrase')[0]) in predicateFunctions
                )
len(ckbClauses)

195

## 2. Process the clause's satellites based on a checklist of rules; these include checking for various idioms, prepositions, and noun types.

We need a number of resources to accomplish this.

**First**, we use a sense generator that will return 1 of 3 categories for a given lemma. The categories are:
* person
* place
* object (if it is 'abstract')

The functions use 4 sources to make the decisions: 
1. existing features in the ETCBC
2. lists of categorised lexemes from the [Roorda/Dyk notebook](https://shebanq.ancient-data.org/shebanq/static/docs/tools/valence/corr_enrich.html)
3. sources 3 and 4 are special projects created for these kinds of queries:
    * see [valency_wordlists](https://github.com/codykingham/textfabric_notebooks/blob/master/valency_wordlists)
    * (3) generated category lists from [openscriptures' BDB lexicon](https://github.com/openscriptures/HebrewLexicon) using part-of-speech tags in BDB 
    * (4) generated category lists from Wordnet, using hypernym relations

Each of the sources will count as 1 or more parameters, that, if fulfilled, will go towards a score for the given lexeme object.

### Sense Generators: 

*TO-DO: combine the 3 generators into 1 simplified function (15.01.17)*

In [50]:
with open('../valency_wordlists/bdbCategories.json') as bdbFile:
    bdbCategories = json.load(bdbFile)
with open('../valency_wordlists/wordnetCategories.json') as wnFile:
    wnCategories = json.load(wnFile)
with open('../valency_wordlists/RoordaDykCategories.json') as rdFile:
    rdCategories = json.load(rdFile)    

def testLocation(word_node, info=False):
    lexeme = F.lex.v(word_node)
    # SOURCE 1, ETCBC4c
    lex_obj = L.u(word_node, otype = 'lex')[0]
    univalent_final = 1 if F.uvf.v(word_node) == 'H' else 0
    name_type = 1 if F.nametype.v(lex_obj) == 'topo' else 0
    s1_score = sum((univalent_final, name_type))
    # SOURCE 2, ROORDA/DYK
    s2_score = 1 if lexeme in rdCategories and rdCategories[lexeme]['category'] == 'place' else 0
    # SOURCE 3, BDB categories list
    s3_score = 1 if lexeme in bdbCategories and bdbCategories[lexeme]['category'] == 'place' else 0
    # source 4, wordnet categories list
    s4_score = 1 if lexeme in wnCategories and wnCategories[lexeme]['cat'] == 'place' else 0
    if not info: return sum((s1_score, s2_score, s3_score, s4_score))
    else: return { 's1':s1_score, 's2':s2_score,'s3':s3_score, 's4':s4_score}
    
def testAgency(wordNode, info=False):
    # SOURCE 1, ETCBC4c
    lexN = L.u(wordNode, otype = 'lex')[0]
    lex = F.lex.v(lexN)
    nametype = 1 if F.nametype.v(lexN) == 'pers' else 0
    gentilic = 1 if F.ls.v(wordNode) == 'gntl' else 0
    pronoun = 1 if F.sp.v(wordNode) == 'prps' or F.pdp.v(wordNode) == 'prps' else 0
    s1Score = sum((nametype,gentilic,pronoun))
    # SOURCE 2, Roorda/Dyk
    s2Score = 1 if lex in rdCategories and rdCategories[lex]['category'] == 'agent' else 0
    # Source 3 - BDB categories
    s3Score = 1 if lex in bdbCategories and bdbCategories[lex]['category'] == 'agent' else 0
    # Source 4 - WN categories
    s4Score = 1 if lex in wnCategories and wnCategories[lex]['cat'] == 'agent' else 0
    if not info: return sum((s1Score, s2Score, s3Score, s4Score))
    else: return { 's1':s1Score, 's2':s2Score,'s3':s3Score, 's4':s4Score}
    
def abstractObject(wordNode, info=False):
    lex = F.lex.v(wordNode)
    # Source 1, BDB categories
    s1score = 1 if lex in bdbCategories and bdbCategories[lex]['subcategory'] == 'abstract' else 0
    s2score = 1 if lex in wnCategories and wnCategories[lex]['subcategory'] == 'abstract' else 0
    if not info: return(sum((s1score, s2score)))
    else: return {'BDB': s1score, 'WordNet':s2score}

### Rules for processing satellites (phrases):

**Second, **the groups of satellite types are created based on the following set of rules. Each rule results in a given sense which, based on the syntax, represents the most likely interpretation of the verb שכב in a given passage.

* preposition
    * עם / את
        * object == pl אב ('fathers') ?
            * **death, figurative idiom**
        * object == suffix ?
            * **sexual intercourse**
        * object == agentive noun ?
            * locative satellite in between ? 
                * **literal action, with/among entity** 
            * else: 
                * **sexual intercourse**
        * else
            * **literal action, with/among entity**

    * על / ל / ב / אצל / בין, et al.
        * object == abstract object ? 
            * **literal action, manner of lying**
        * object == locative or unknown ?
            * ** literal action, spatial direction of act**

* ø preposition
    * noun/adj/adv == locative ?
        * 'sex' category in clause atom? 
            * **sexual intercourse + adjunctive object**
        * else:
            * **literal action, spatial direction of act**
    * ø noun/adj/adv
        * +verbal suffix ?
            * **sexual intercourse** 
        * else
            * **literal action, connotates sleep**
        
        
These rules are applied on a per-phrase basis to create phrase categories. Categories can be compounded together to create compound categories.

In [51]:
def isObjectNoun(w):
    '''
    returns whether a given word is an "object noun"
    the term is used loosely to refer to words that:
    1. function as objects of prepositions, or
    2. direct the force of the verb in some way
    See for example the particle שם ('there')
    '''
    if all([F.pdp.v(w) in {'subs','prps','prin','adjv','advb','nmpr'},
            not {'rec','atr'} & set(F.rela.v(subPhrase) for subPhrase in L.u(w, otype='subphrase')),
            F.function.v((L.u(w,otype='phrase'))) != 'Subj',
            F.rela.v(L.u(w, otype = 'phrase_atom')[0]) not in {'Appo'},
           ]):
        return True

def analyseSatellites(clauseAtom):
    '''
    applies the rules for the semantic interpretation of the verb CKB
    in a certain clause atom, based on the presence of certain satellite types;
    in some cases, the presence of additional satellites condition the
    meaning of another satellite
    
    The rules follow closely those outlined in the markdown cell above, although
    the resulting tags are more nuanced.
    
    The function returns nested dictionaries that contain satellites as keys, each with
    their own dictionaries that contain additional keys and data on the satellite. These 
    are: preposition + their object, function of the phrase (satellite), categories determined
    by the function, and object "nouns" within the clause
    '''
    satelliteFunctions = {'Adju','PreO','Objc','Cmpl','Loca'}
    satellites = (ph for ph in L.d(clauseAtom, otype='phrase') if F.function.v(ph) in satelliteFunctions)
    satData = defaultdict(dict)
    
    for sat in satellites:
        words = L.d(sat, otype='word')
        prepositions = tuple(w for w in words if F.pdp.v(w) == 'prep')
        objectNouns = tuple(w for w in words if isObjectNoun(w))
        categories = []
        prepAndObjs = []
        
        # apply the rules following the logic in the above notes
        if prepositions:
                # get suffixed objects from suffixed prepositions or keep preposition objects
            objectNouns = tuple(F.g_prs_utf8.v(prep) for prep in prepositions) \
                          if not objectNouns else objectNouns
            prepAndObj = tuple(zip(prepositions, objectNouns)) # map prepositions to their objs
            
            # Calculate the categories for prepositions:    
            for prep, pObj in prepAndObj:
                if F.lex.v(prep) in {'<M','>T','>T=='}:
                    if F.lex.v(pObj) == '>B/' and F.nu.v(pObj) == 'pl': 
                        category = 'death'
                    elif type(pObj) == str or testAgency(pObj): # suffix obj or agentive obj.
                        
                        # determine whether an intervening preposional phrase occurs
                        # such a phrase can block the sexual meaning as seen in texts like:
                        # Job 20.11 and 2 Sam 11.3
                        verb = next(w for w in L.d(clauseAtom, otype='word') if F.pdp.v(w) == 'verb')
                        first, second = sorted((verb,prep)) # verb can come before or after
                        # now calculate all the intervening words and include if one is a preposition
                        # but it cannot be a 'with' preposition
                        interveningToVerb = set(F.pdp.v(n) for n in range(first+1, second)
                                               if F.lex.v(n) not in {'<M','>T','>T=='})
                        if any(['prep' in interveningToVerb,
                                'literal.prepSpatial' in {cat for sat in satData[clauseAtom]
                                                          for cat in satData[clauseAtom][sat]['categories']}
                               ]):
                            category = 'literal.withEntity'
                        else:
                            category = 'sex'
                    else:
                        category = 'literal.withEntity'
                else: #'<L','B','L','>YL/','BJN/', et. al
                    if abstractObject(pObj) or F.lex.v(prep) == 'K':
                        category = 'literal.manner'
                    else:
                        category = 'literal.prepSpatial'
                categories.append(category)
                prepAndObjs.extend(prepAndObj)
        elif objectNouns:
            for obj in objectNouns:
                if testLocation(obj):
                    category = 'literal.spatialObj'
                else:
                    if any(['sex' in categories,
                            'sex' in {cat for sat in satData[clauseAtom]
                                      for cat in satData[clauseAtom][sat]['categories']}]):
                        category = 'sex.object'
                    else:
                        category = 'literal.unknown'
                categories.append(category)
        else:
            verb = tuple(w for w in words if F.function.v(L.u(w, otype='phrase')[0]) == 'PreO')
            if verb and F.prs.v(verb[0]):
                category = 'sex'
            else:
                category = 'literal.general'
            categories.append(category)
        
        satData[clauseAtom][sat] = {'prepObjcs': prepAndObjs,
                                    'function' : F.function.v(sat),
                                    'categories' : categories,
                                    'objects' : objectNouns
                                   }
    if not satData:
        satData[clauseAtom] = {}
    return satData        

### Analyse the CKB clauses



In [52]:
ckbSatellites = {}
for clauseAtom in ckbClauses:
    ckbSatellites.update(analyseSatellites(clauseAtom))

### sample

In [53]:
# see an example of the analyser output here:
pprint(list(ckbSatellites.items())[:2])

[(536065,
  {668165: {'categories': ['death'],
            'function': 'Cmpl',
            'objects': (111042,),
            'prepObjcs': [(111041, 111042)]}}),
 (551428,
  {713724: {'categories': ['death'],
            'function': 'Cmpl',
            'objects': (188497,),
            'prepObjcs': [(188496, 188497)]}})]


## 3. Export the clauses and their satellites to a segmented, organised spreadsheet for further analysis.
1. organise the groups based on 3 tiers of subcategorisation: 
   * A. the most prevalent satellite counts within all clauses
   * B. the most prevalent satellite categories and category combinations within A
   * C. the most prevalent prepositional phrases within B
2. gather reference information, plain text, labels, etc.; write to the csv file

### organise the groups based on 3 tiers of subcategorisation:

In [54]:
# A.
satCounts = defaultdict(list)
for clauseAtom, satellites in ckbSatellites.items():
    satCount = len(satellites)
    satCounts[satCount].append(clauseAtom)
satOrdered = sorted(((len(group),group) for count,group in satCounts.items()), reverse = True)

# B.
catOrdereds = list()
for satCount, clauseAtoms in satOrdered:
    catCounts = defaultdict(list)
    for ca in clauseAtoms:
        cats = ''
        for satellite, satData in ckbSatellites[ca].items():
            cats += ' '.join(satData['categories'])
        catCounts[cats].append(ca)
    catOrdered = sorted(((satCount, len(group), group) for group in catCounts.values()), reverse = True)
    catOrdereds.extend(catOrdered)

# C.
prepOrdereds = list()
for satCount, catCount, clauseAtoms in catOrdereds:
    prepCounts = defaultdict(list)
    for ca in clauseAtoms:
        preps = ''
        for satellite, satData in ckbSatellites[ca].items():
            preps += ' '.join(F.lex_utf8.v(po[0]) if type(po[0]) == int else po[0]
                              for po in satData['prepObjcs'])
        prepCounts[preps].append(ca)
    prepOrdered = sorted(((satCount, catCount, len(group), group) 
                         for group in prepCounts.values()), reverse = True)
    prepOrdereds.extend(prepOrdered)
    
writeOrder = prepOrdereds

### gather reference information, plain text, labels, etc.; write to the csv file

In [62]:
fieldnames = ['reference','clauseAtom','Verse','Rare Terms','Clause','SatCount']
satelliteFields = ['SATELLITE: ','prepCount','ObjCount','Function','Category','(Preposition) + Objects']
satelliteCount = max(satCounts)
for satCount in range(0, satelliteCount):
    fieldnames.extend(satelliteFields)

with open('CKB_valency_groups.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(fieldnames)
    
    for satCount, catCount, prepCount, clauseAtoms in writeOrder:
        for ca in clauseAtoms:
            reference = T.sectionFromNode(ca)
            verse = T.text(L.d(L.u(ca, otype='verse')[0], otype='word'))
            lexWords = tuple(L.u(w,otype='lex')[0] for w in L.d(ca,otype='word'))
            rareTerms = '\n '.join(str((F.lex.v(l),F.gloss.v(l))) for l in lexWords if F.freq_lex.v(l) < 30)
            clause = T.text(L.d(ca, otype='word'))
            
            row = list((reference,
                        ca,
                        verse,
                        rareTerms,
                        clause,
                        len(ckbSatellites[ca])
                      ))
            
            for satellite, satDat in ckbSatellites[ca].items():
                
                
                objects = ' '.join(F.g_word_utf8.v(w) or w 
                                   for prepObj in satDat['prepObjcs']
                                   for w in prepObj) \
                                   if satDat['prepObjcs'] \
                                   else ' '.join(F.g_word_utf8.v(w) for w in satDat['objects']) \
                                   if satDat['objects']\
                                   else 'none'

                satColumn = list((satellite,
                                  len(satDat['prepObjcs']),
                                  len(satDat['objects']),
                                  F.function.v(satellite),
                                  ' '.join(satDat['categories']),
                                  objects
                                  ))
                row.extend(satColumn)
                
            # fill in empty columns
            for column in range(len(fieldnames) - len(row)): 
                row.append('')
            
            writer.writerow(row)

### Browse the final results at:

[CKB_valency_groups.csv](https://github.com/codykingham/textfabric_notebooks/tree/master/שכב%20valency/CKB_valency_groups.csv)