# 4. Semantic Space Construction

In this NB, I will apply the new `heads` edge feature to extract head nouns from their phrase and record their co-occurring verbs, subjects, objects, and coordinates. Each of these relationships is assigned a weight. Those co-occurrences are then placed into a matrix. Then I assign an associational measure to the counts

In [35]:
import collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tf.fabric import Fabric

TF = Fabric(locations='~/github', modules=['etcbc/bhsa/tf/c', 'semantics/tf/c'])
api = TF.load('''
                book chapter verse
                function lex vs language
                pdp
                heads
              ''')
api.makeAvailableIn(globals())

This is Text-Fabric 3.2.2
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

116 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.01s B chapter              from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.01s B verse                from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.10s B function             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.19s B lex                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.16s B vs                   from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.13s B language             from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.13s B pdp                  from /Users/cody/github/etcbc/bhsa/tf/c
   |     0.06s B heads                from /Users/cody/github/semantic

## Gather and Count Noun Relations

Now I will gather nouns from the Hebrew Bible and count syntactic co-occurrences.

In [149]:
# configure weights
path_weights = {'Subj': {'Pred': 1,
                         'Objc': .5
                        },
                'Objc': {
                         'Pred': 1,
                         'Subj': .5
                        },
                'coor': 1
               }

In [150]:
cooccurrences = collections.defaultdict(lambda: collections.Counter()) # noun counts here

# Subj/Objc Counts
for phrase in F.otype.s('phrase'):
    
    # skip non-Hebrew sections
    language = F.language.v(L.d(phrase, 'word')[0]) 
    if language != 'Hebrew':
        continue
    
    # skip non subject/object phrases
    function = F.function.v(phrase)
    if function not in {'Subj', 'Objc'}:
        continue
        
    # get head nouns
    nouns = set(F.lex.v(w) for w in E.heads.f(phrase)) # count lexemes only once
    if not nouns:
        continue
        
    # gather contextual data
    clause = L.u(phrase, 'clause')[0]
    good_paths = path_weights[function]
    paths = [phrase for phrase in L.d(clause, 'phrase')
                if F.function.v(phrase) in good_paths.keys()
            ]
    
    # make the counts
    for path in paths:
        
        pfunct = F.function.v(path)
        weight = good_paths[pfunct]
        
        # count for verb
        if pfunct == 'Pred':
            verb = [w for w in L.d(path, 'word') if F.pdp.v(w) == 'verb'][0]
            verb_lex = F.lex.v(verb)
            verb_stem = F.vs.v(verb)
            verb_basis = verb_lex + '.' + verb_stem
            for noun in nouns:
                cooccurrences[noun][verb_basis] += 1
            
        # count for subj/obj
        else:
            conouns = E.heads.f(path)
            cnoun_bases = set(F.lex.v(w) for w in conouns)
            counts = dict((basis, weight) for basis in cnoun_bases)
            for noun in nouns:
                cooccurrences[noun].update(counts)
                
    # count coordinates
    for noun in nouns:
        for cnoun in nouns:
            if cnoun != noun:
                cooccurrences[noun][cnoun] += path_weights['coor']
            
cooccurrences = pd.DataFrame(cooccurrences).fillna(0)                
                
print(len(cooccurrences.columns), 'nouns')
print(len(cooccurrences.index), 'cooccurrences')

3023 nouns
3847 cooccurrences


## Apply Association Measure

log-likelihood ratio G2

## Calculate Similarities