# Context Selection Discovery


## Assumptions
In this notebook, I will explore the relevancy of various contexts for semantic information. I begin with a few assumptions:

1. Semantic meaning is inextricably linked with syntactic structure and vice versa. Here I depend on construction grammar theory similar to Goldberg [(1995)](https://books.google.nl/books/about/Constructions.html?id=HzmGM0qCKtIC&redir_esc=y), but also informed by Talstra's theory on syntax and semantics [(1982)](/bibliography.txt).

2. Following from 1, the constituents within a noun's clause play a key role in distinguishing a noun's meaning. This is a principle that all lexicographers acknowledge when they survey a word's use in context.

3. Nouns that are similar in meaning will exhibit similar collocational preferences with respect to their constituents. This principle is also known as the distributional principle, attributed to Firth: "you shall know a word by the company it keeps" (e.g. [1962](/bibliography.txt)). 

## Candidate Constituents

I hypothesize that the following constituency relations can be expecially important for determining semantic meaning:

1. object(noun) -> verb
2. subject(noun) -> verb
3. noun -> coordinate(noun)
4. verbing like a <- (noun)
5. subject(noun) -> was a {adjective}(noun)

To test these 5 candidates, I will run some queries below in BHSA using TF Search and some templates. I will manually inspect the clauses to see if there are any other considerations which should be made when constructing the selection parameters for notebook 3.

I will especially be testing groups of similar words identified through the [word2vec experiment](2. word2vec Experiment.ipynb).

In [9]:
# First, I load the necesssary modules, data, and helper functions.
import collections
from tf.fabric import Fabric
from functions.helpers import show_results, filter_results

# load BHSA data into TF
TF = Fabric(locations='~/github/etcbc/bhsa/tf', modules='c', silent=True)
api = TF.load('''
                book chapter verse
                function sp pdp
                g_lex_utf8 trailer_utf8
              ''', silent=True)
api.makeAvailableIn(globals()) # globalize TF methods

### object -> verb
noun is within an object clause

In [10]:
obj_vrb = '''
clause
    phrase function=Objc
        word {wPar}
    phrase function=Pred|PreS
        word pdp=verb {vPar}
'''

Bread and wine, as suggested by a previous iteration of the word2vec experiment.

In [16]:
dinner = 'lex=LXM/|JJN/'

S.search(obj_vrb.format(wPar=dinner, vPar=''))
dinner_results = sorted(S.fetch())

show_results(dinner_results[:5], highlight=[2, 4])

-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 



In the expanded results, while there is some cross over between לחם and יין, such as the hiphil of יצא, the words consistently prefer two different verb types. לחם prefers the verb אכל whereas יין prefers verbs such as שתה or שקה. While these two words appeared together in a previous iteration of the word2vec experiment, they are different. Food is eaten. Drink is drunken. To test the hypothesis now, I apply the same search to see if the nouns that collect around אכל are similar in meaning, as I expect them to be.

In [26]:
S.search(obj_vrb.format(wPar='sp=subs', vPar='lex=>KL['))
food_results = sorted(filter_results(S.fetch(), levels={'word'}))

show_results(food_results, limit=5, highlight=[1, 3])

-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


And with drinking...

In [25]:
S.search(obj_vrb.format(wPar='sp=subs', vPar='lex=CTH['))
drink_results = sorted(filter_results(S.fetch(), levels={'word'}))
         
show_results(drink_results, limit=5, highlight=[1, 3])

-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


Looking through the expanded results, it appears that the assumption is good. However, a few important caveats are necessary.

* Polysemy might complicate some results. For instance, בשׂר occurs quite frequently, which of course is expected since it can mean something like "meat". But there are other uses of this noun that do not relate to meat that is eaten. 

* Figures of speech, especially in discursive portions, could also introduce some spurious correlations. For instance, in Gen 47:22, Jacob's family eats the חֹק of Pharaoh. I.e. it is their apportionment from him. Another example is the עפר which the serpent will eat in the Gen 3:14 curse. 
    * A side example here is the use of the construct to create uniquely defined nouns. For instance, in Deut 32:14 the term דם־ענב is used, which is of course quite different than דם when used elsewhere! Another is the use of מימי רגליהם in Isa 36:16, a euphemism for urine.

Excluding discursive clauses could mitigate some of this potential bias. Also, further refining of the word relations might be achieved later on after the initial vector analysis.

**important observation**: Upon entering the verb שקה ("give drink") in a previous version of the above search, I found an important feature which should be added to the context selection function. The stem of the verb can be a crucial distinction of meaning. Most instances of שקה are in the hiphil. But it can also be seen in the use of the hiphil of יצא for "to bring out," as it is used with bread and wine. 

Thus: **The verb stem should be attached to the verb lexeme during the selection process.**

Based on my visual inspection here, the context of `object(noun) -> verb` should be weighted heavily.

### subject -> verb

Do certain subjects prefer certain verbs? My expectation is that this is a less important consideration. I also expect that certain verb roots should be excluded, most especially the verb היה, which probably contributes very little to disambiguation in these cases. For these searches, I exclude matches that have היה as the primary verb.

In [28]:
sbj_vrb = '''
clause
    phrase function=Subj
        word {wPar}
    phrase function=Pred|PreS
        word pdp=verb {vPar}
'''

With two synonyms for "chief," as found by word2vec...

In [30]:
chiefs = 'lex=RB==/|FR/'

S.search(sbj_vrb.format(wPar=chiefs, vPar=''))
chiefs_results = sorted(S.fetch())

show_results(chiefs_results, limit=5, highlight=[2, 4])

-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


There are many different verbs that occur with these nouns. But there are some encouraging tendencies:

* Verbs that would be associated with human beings are frequently used: ראה, אמר, קום, פקד, ספר, זכר, קרב, שמע. 
* Another nice find is that many of these verbs also hint at their subject's status, such as קום, ספר,זכר, פקד.

Now with son and brother, also identified by word2vec.

In [35]:
son_brother = 'lex=BN/|>X/'

S.search(sbj_vrb.format(wPar=son_brother, vPar=''))
sb_results = sorted(S.fetch())

show_results(sb_results, limit=5, highlight=[2, 4])

-------------------- 



-------------------- 



-------------------- 



-------------------- 



-------------------- 

results cut off at 5


It's a bit harder to see any prevalent verbs amongst these groups. It's obvious that many verbs are shared with the "chiefs", such as אמר. Verbs of movement like בוא occur a lot, it seems.

**suggestion**: It may be a good idea to weigh this relationship less heavily than the object -> verb relationship. So many different kinds of verbs are likely to be used with these nouns, and there is likely to be heavy overlap amongst the living nouns.