# Context Selection Discovery


## Assumptions
In this notebook, I will explore the relevancy of various contexts for semantic information. I begin with a few assumptions:

1. Semantic meaning is inextricably linked with syntactic structure and vice versa. Here I depend on construction grammar theory similar to Goldberg [(1995)](https://books.google.nl/books/about/Constructions.html?id=HzmGM0qCKtIC&redir_esc=y), but also informed by Talstra's theory on syntax and semantics [(1982)](/bibliography.txt).

2. Following from 1, the constituents within a noun's clause play a key role in distinguishing a noun's meaning. This is a principle that all lexicographers acknowledge when they survey a word's use in context.

3. Nouns that are similar in meaning will exhibit similar collocational preferences with respect to their constituents. This principle is also known as the distributional principle, attributed to Firth: "you shall know a word by the company it keeps" (e.g. [1962](/bibliography.txt)). 

## Candidate Constituents

I hypothesize that the following constituency relations can be especially important for determining semantic meaning:

1. object(noun) -> verb
2. subject(noun) -> verb
3. noun -> coordinate(noun)
4. apposition(noun) -> noun
5. verbing like a <- (noun)
6. subject(noun) -> is a {adjective}(noun)

To test these 6 candidates, I will run some queries below in BHSA using TF Search and some templates. I will manually inspect the clauses to see if there are any other considerations which should be made when constructing the selection parameters for notebook 3.

I will especially be testing groups of similar words identified through the [word2vec experiment](2. word2vec Experiment.ipynb).

Note that throughout this NB, I will examine a larger extract of the results. But for the purpose of space, after examination I limit the output to 5 results. To see all of the results from which I draw my conclusions, change "limit" to `5` for every `show_results` call. 

In [1]:
# First, I load the necesssary modules, data, and helper functions.
import collections
from tf.fabric import Fabric
from functions.helpers import show_results, filter_results

# load BHSA data into TF
TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c', silent=True)
api = TF.load('''
                book chapter verse
                function sp pdp mother
                rela
                g_lex_utf8 trailer_utf8
              ''', silent=True)
api.makeAvailableIn(globals()) # globalize TF methods

### object -> verb
noun is within an object clause

In [2]:
obj_vrb = '''
clause
    phrase function=Objc
        word {wPar}
    phrase function=Pred|PreS
        word pdp=verb {vPar}
'''

Bread and wine, as suggested by a previous iteration of the word2vec experiment.

In [43]:
dinner = 'lex=LXM/|JJN/'

S.search(obj_vrb.format(wPar=dinner, vPar=''))
dinner_results = sorted(filter_results(S.fetch(), levels=['word']))

show_results(dinner_results, limit=5, highlight=[1, 3])

211 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


In the expanded results, while there is some cross over between לחם and יין, such as the hiphil of יצא, the words consistently prefer two different verb types. לחם prefers the verb אכל whereas יין prefers verbs such as שתה or שקה. While these two words appeared together in a previous iteration of the word2vec experiment, they are different. Food is eaten. Drink is drunken. To test the hypothesis now, I apply the same search to see if the nouns that collect around אכל are similar in meaning, as I expect them to be.

In [44]:
S.search(obj_vrb.format(wPar='sp=subs', vPar='lex=>KL['))
food_results = sorted(filter_results(S.fetch(), levels={'word'}))

show_results(food_results, limit=5, highlight=[1, 3])

308 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


And with drinking...

In [45]:
S.search(obj_vrb.format(wPar='sp=subs', vPar='lex=CTH['))
drink_results = sorted(filter_results(S.fetch(), levels={'word'}))
         
show_results(drink_results, limit=5, highlight=[1, 3])

85 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


And now I explore how productive the general pattern might be...

In [47]:
anynoun = 'sp=subs pdp=subs'

S.search(obj_vrb.format(wPar=anynoun, vPar=''))
ovb_results = sorted(filter_results(S.fetch(), levels=['word']))

show_results(ovb_results, limit=5, highlight=[1, 3], random=True)

15696 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


Looking through the expanded results, it appears that the assumption is good. However, a few important caveats are necessary.

* Polysemy might complicate some results. For instance, בשׂר occurs quite frequently, which of course is expected since it can mean something like "meat". But there are other uses of this noun that do not relate to meat that is eaten. 

* Figures of speech, especially in discursive portions, could also introduce some spurious correlations. For instance, in Gen 47:22, Jacob's family eats the חֹק of Pharaoh. I.e. it is their apportionment from him. Another example is the עפר which the serpent will eat in the Gen 3:14 curse. 
    * A side example here is the use of the construct to create uniquely defined nouns. For instance, in Deut 32:14 the term דם־ענב is used, which is of course quite different than דם when used elsewhere! Another is the use of מימי רגליהם in Isa 36:16, a euphemism for urine.

Excluding discursive clauses could mitigate some of this potential bias. Also, further refining of the word relations might be achieved later on after the initial vector analysis.

**important observation**: Upon entering the verb שקה ("give drink") in a previous version of the above search, I found an important feature which should be added to the context selection function. The stem of the verb can be a crucial distinction of meaning. Most instances of שקה are in the hiphil. But it can also be seen in the use of the hiphil of יצא for "to bring out," as it is used with bread and wine. 

Thus: **The verb stem should be attached to the verb lexeme during the selection process.**

Based on my visual inspection here, the context of `object(noun) -> verb` should be weighted heavily.

### subject -> verb

Do certain subjects prefer certain verbs? My expectation is that this is a less important consideration. I also expect that certain verb roots should be excluded, most especially the verb היה, which probably contributes very little to disambiguation in these cases. For these searches, I exclude matches that have היה as the primary verb.

In [32]:
sbj_vrb = '''
clause
    phrase function=Subj
        word {wPar}
    phrase function=Pred|PreS
        word pdp=verb {vPar}
'''

With two synonyms for "chief," as found by word2vec...

In [33]:
chiefs = 'lex=RB==/|FR/'

S.search(sbj_vrb.format(wPar=chiefs, vPar=''))
chiefs_results = sorted(S.fetch())

show_results(chiefs_results, limit=5, highlight=[2, 4])

154 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


There are many different verbs that occur with these nouns. But there are some encouraging tendencies:

* Verbs that would be associated with human beings are frequently used: ראה, אמר, קום, פקד, ספר, זכר, קרב, שמע. 
* Another nice find is that many of these verbs also hint at their subject's status, such as קום, ספר,זכר, פקד.

Now with son and brother, also identified by word2vec.

In [34]:
son_brother = 'lex=BN/|>X/'

S.search(sbj_vrb.format(wPar=son_brother, vPar=''))
sb_results = sorted(S.fetch())

show_results(sb_results, limit=5, highlight=[2, 4])

1153 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


And now the general pattern...

In [50]:
S.search(sbj_vrb.format(wPar=anynoun, vPar=''))
sbv_results = sorted(S.fetch())

show_results(sbv_results, limit=5, highlight=[1, 3], random=True)

15499 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


It's a bit harder to see any prevalent verbs amongst these groups. It's obvious that many verbs are shared with the "chiefs", such as אמר. Verbs of movement like בוא occur a lot, it seems.

**suggestion**: It may be a good idea to weigh this relationship less heavily than the object -> verb relationship. So many different kinds of verbs are likely to be used with these nouns, and there is likely to be heavy overlap amongst the living nouns.

### Coordinates

Words which occur in coordination are probably related somehow. The word2vec experiment picked up on this by grouping animals and temple items together that are included in biblical lists.

For this search it is better to do a loop with TF methods.

In [35]:
coord_results = []

for sp in F.otype.s('subphrase'):

    if F.rela.v(sp) != 'par':
        continue
        
    sp_pos = set(F.sp.v(w) for w in L.d(sp, otype='word'))
    co_pos = set(F.sp.v(w) for w in L.d(E.mother.f(sp)[0], otype='word'))
    
    if 'subs' in sp_pos and 'subs' in co_pos:
        
        result = (L.u(sp, otype='clause')[0], sp, E.mother.f(sp)[0])
        coord_results.append(result)
        
show_results(coord_results, limit=5, highlight=[1, 2], random=True)

7598 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


These results are quite encouraging. I see a lot of similar meanings reflected in the parallel relationships. This feature should indeed be considered.

### apposition(noun) -> noun

In [51]:
appo_noun = '''

clause
    phrase_atom
        word sp=subs pdp=subs
    <: phrase_atom rela=Appo
        word sp=subs pdp=subs
        
'''

In [59]:
S.search(appo_noun)

app_results = sorted(filter_results(S.fetch(), levels=['word']))

show_results(app_results, limit=5, highlight=[1, 3], random=True)

1828 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


This appears to be a fairly productive pattern, but is better suited for proper nouns, a group I am considering to exclude from the initial tests. These patterns might be included, with caution.

### Verbing Like a Noun

This is a pattern which tells about the affect that a given verb has ([Veale 2012](https://www.academia.edu/2884622/Teaching_WordNet_to_Sing_like_an_Angel_and_Cry_like_a_Baby_Learning_Affective_Stereotypical_Behaviors_from_the_Web)). But if a given verb lexeme occurs in this pattern with different nouns, it could suggest that the nouns filling the slot are similar in some way.

In [10]:
verbing_noun = '''

clause
    phrase function=Pred
        word pdp=verb
    <: phrase typ=PP function=Adju
        word lex=K
        <: word pdp=subs

'''

In [11]:
S.search(verbing_noun)

vn_results = sorted(r for r in S.fetch()
                       if F.lex.v(r[2]) not in {'HJH[', '<FH[', 'BW>[', 'JY>[', 'QR>['} # stop words
                       and F.function.v(r[3]) != 'Time'
                   )

show_results(vn_results, limit=5, highlight=[2, 5])

70 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


How many of these noun matches share a same verb?

In [12]:
# check all the matches against each other for verb identity but with different noun matches

for r1 in vn_results[:5]: # all results cut off!
    
    r1_verb = F.lex.v(r1[2]) 
    r1_noun = F.lex.v(r1[5])
    
    for r2 in vn_results:
        
        r2_verb = F.lex.v(r2[2])
        r2_noun = F.lex.v(r2[5])
        
        if all([r1_verb == r2_verb, # is same verb
                r1 != r2, # not the same result
                r1_noun != r2_noun # not the same noun lexeme
               ]):
            
            show_results((r1, r2), highlight=[5])

2 results



-------------------- 



-------------------- 

2 results



-------------------- 



-------------------- 



Surprisingly, there are relatively few of these results. Also, due to the indiscernible function of the כ phrase, a lot of spurious results plague this pattern. For instance, it is not possible to separate a valid result that compares "crouching" to a לביא and or a אריה (e.g. Num 24:9 et al) from one that makes a comparison which is not related to the verbal affect (e.g. כדבר) in "he did according to the word..."). As seen above, the application of a few stop words prevents some of these problems from seeping in. But even with the stop words, there are certain results that do not match.

In all, there are 70 matches at stake. They should either be weighted lightly or not included in the set.

### Noun is a {adj} Noun

In [26]:
noun_noun = '''

clause
    phrase function=Subj
        word sp=subs
        
    phrase function=PreC
        =: word sp=subs
        subphrase rela=atr
'''

In [28]:
S.search(noun_noun)
nn_results = sorted(filter_results([r for r in S.fetch() if F.typ.v(r[3]) == 'NP'],
                    levels=['word'])
                   )

show_results(nn_results, limit=5, highlight=[1, 3])

65 results



-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


These patterns yield a number of promising results. But they are also plagued with a high amount of metaphorical uses. For instance, Jer 9:7 compares the לשון with חץ "arrows." On the other hand, relations such קמה "height" and אמות "cubits" (Ex 27:18) or קלעי "curtains" and שש "linen" (Ex 38:16) are very valuable. Perhaps these patterns can be processed later on. 

## Summary

The three most productive patterns, it seems, is the `object(noun) -> verb`, `conjunct(noun) -> noun`, and `subject(noun) -> verb` relations.

The `apposition` pattern may be usable. But I am cautious about it. It seems better suitable for proper nouns. I can experiment with what effect they have on the results.

As for the two others, `verbing like a noun` and `noun is a {adj} noun`, while they produce promising results, there is also a lot of low-quality results mixed in. I will thus initially exclude them from the context selection process.

In terms of weighting for the three productive patterns: it is best to wait and test several results. Also, what needs to be considered with the weighting is whether normal cooccurrences will be counted or not. For instance, should I choose to augment the syntactic cooccurrences with those that occur within a certain window of words, I may want to give the windowed words a light weight and the syntactic paths a heavier one. 