# First Analysis of the Experiment Data

In [17]:
import numpy as np
import pandas as pd
import collections, os, sys, random, time, pickle, dill, copy, re
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex lex_utf8 vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code txt instruction
                heads prep_obj 
                prs prs_gn prs_nu prs_ps
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')

This is Text-Fabric 4.4.1
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.4.1</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

In [18]:
print('Loading cached experiments...')
with open('/Users/cody/Documents/experiments.dill', 'rb') as infile:
    experiments = dill.load(infile)
    print(f'{len(experiments)} experiments loaded.')

Loading cached experiments...
50 experiments loaded.


## Experiments and their Shapes

In [3]:
for shape, exp in sorted((experiments[exp].data.shape, exp) for exp in experiments):
    print(f'{exp}:\t{shape}')

vi_subj_animacy:	(2, 180)
vi_objc_pa:	(2, 714)
vi_adj+_pa:	(2, 734)
vi_cmpl_pa:	(2, 734)
vi_coad_pa:	(2, 734)
vf_obj_pa:	(3, 694)
vd_domain_simple:	(3, 704)
vf_cmpl_pa:	(4, 725)
vi_allarg_pa:	(4, 786)
vi_objc_animacy:	(5, 173)
vf_adju_pa:	(7, 733)
vf_obj_animacy:	(8, 127)
vf_coad_pa:	(8, 734)
vg_tense:	(8, 734)
vi_cmpl_animacy:	(39, 174)
vf_argAll_pa:	(43, 703)
vi_adj+_animacy:	(46, 108)
vi_coad_animacy:	(51, 241)
vd_domain_embed:	(73, 646)
vf_cmpl_animacy:	(88, 158)
vi_allarg_animacy:	(92, 370)
vf_adju_animacy:	(96, 78)
vf_coad_animacy:	(200, 192)
vi_subj_domain:	(247, 231)
vd_par_lex:	(305, 365)
vf_argAll_animacy:	(378, 207)
vi_objc_domain:	(448, 245)
vf_obj_domain:	(584, 213)
vi_cmpl_domain:	(1033, 223)
vf_cmpl_domain:	(1128, 207)
vi_adj+_domain:	(1219, 217)
vf_adju_domain:	(1575, 180)
vi_coad_domain:	(1735, 386)
vi_subj_lex:	(1959, 290)
vi_objc_lex:	(2251, 305)
vi_allarg_domain:	(2902, 527)
vf_obj_lex:	(3012, 274)
vf_coad_domain:	(3055, 301)
vi_adj+_lex:	(3478, 295)
vf_cmpl_lex:	(4

## Deriving a Verb Specimen Set

Which verbs will comprise this study? This depends on which verbs are accounted for in the dataset. The presence/absence (pa) experiments count the most basic features, namely, the simple presence or absence of a given verbal argument: objects, complements, adjuncts. I will first try to derive a specimen set by taking the intersection of all the pa experiments.

In [10]:
specimens = set(experiments['vi_objc_pa'].data.columns)
print(f'Starting with specimens count of {len(specimens)}')

for exp_name, experiment in experiments.items():
    
    if exp_name == 'vi_objc_pa' or not re.match('.*_pa', exp_name): # skip first pa experiments or non-pa exps.
        continue
        
    exp_lexemes = set(experiment.data.columns)
    specimens = exp_lexemes & specimens
    
print(f'Complete with specimen count of {len(specimens)}')

Starting with specimens count of 714
Complete with specimen count of 297


### Count Lexemes without Stems

All verbs have their stem appended. How many plain lexemes are in the specimen set?

In [47]:
plain_lexs = set(lex.split('.')[0] for lex in specimens)
lex_nodes = [(next(l for l in F.otype.s('lex') if F.lex.v(l) == lex),) for lex in plain_lexs]

print(f'Number of plain lexemes accounted for: {len(plain_lexs)}\n')
B.show(lex_nodes)

Number of plain lexemes accounted for: 243




##### Passage 1



##### Passage 2



##### Passage 3



##### Passage 4



##### Passage 5



##### Passage 6



##### Passage 7



##### Passage 8



##### Passage 9



##### Passage 10



##### Passage 11



##### Passage 12



##### Passage 13



##### Passage 14



##### Passage 15



##### Passage 16



##### Passage 17



##### Passage 18



##### Passage 19



##### Passage 20



##### Passage 21



##### Passage 22



##### Passage 23



##### Passage 24



##### Passage 25



##### Passage 26



##### Passage 27



##### Passage 28



##### Passage 29



##### Passage 30



##### Passage 31



##### Passage 32



##### Passage 33



##### Passage 34



##### Passage 35



##### Passage 36



##### Passage 37



##### Passage 38



##### Passage 39



##### Passage 40



##### Passage 41



##### Passage 42



##### Passage 43



##### Passage 44



##### Passage 45



##### Passage 46



##### Passage 47



##### Passage 48



##### Passage 49



##### Passage 50



##### Passage 51



##### Passage 52



##### Passage 53



##### Passage 54



##### Passage 55



##### Passage 56



##### Passage 57



##### Passage 58



##### Passage 59



##### Passage 60



##### Passage 61



##### Passage 62



##### Passage 63



##### Passage 64



##### Passage 65



##### Passage 66



##### Passage 67



##### Passage 68



##### Passage 69



##### Passage 70



##### Passage 71



##### Passage 72



##### Passage 73



##### Passage 74



##### Passage 75



##### Passage 76



##### Passage 77



##### Passage 78



##### Passage 79



##### Passage 80



##### Passage 81



##### Passage 82



##### Passage 83



##### Passage 84



##### Passage 85



##### Passage 86



##### Passage 87



##### Passage 88



##### Passage 89



##### Passage 90



##### Passage 91



##### Passage 92



##### Passage 93



##### Passage 94



##### Passage 95



##### Passage 96



##### Passage 97



##### Passage 98



##### Passage 99



##### Passage 100


**120 more results skipped** because we show a maximum of 100 results at a time

We have a good and diverse dataset here. Let's see what the makeup of stems are.

### Stem Counts

In [46]:
stem_counts = collections.Counter(lex.split('.')[1] for lex in specimens)

stem_counts.most_common()

[('qal', 167),
 ('hif', 59),
 ('piel', 41),
 ('nif', 24),
 ('hit', 4),
 ('hof', 1),
 ('hsht', 1)]

## Coverage of Specimens throughout the Experiments

There are 297 verb specimens present with 243 plain lexical forms. Note that many of the experiments have different numbers of specimens present, due to the various requirements per experiment. We intend to use as much data as is present to inform the verb clustering. But are there some verbs whose coverage throughout all 50 experiments is especially lacking?

Below we count per lexeme how many different experiments it occurs in. The number itself is then counted.

In [64]:
spread_count = collections.Counter()
accounted_exps = set()

for lex in specimens:
    count = 0
    for exp_name, exp in experiments.items():
        if lex in exp.data.columns:
            count += 1
            accounted_exps.add(exp_name)
    spread_count[count] += 1
    
spread_count.most_common()

[(50, 26),
 (48, 14),
 (17, 14),
 (44, 13),
 (18, 13),
 (25, 13),
 (19, 12),
 (41, 12),
 (21, 11),
 (40, 11),
 (42, 10),
 (49, 10),
 (32, 10),
 (29, 9),
 (37, 8),
 (23, 8),
 (39, 8),
 (34, 8),
 (33, 7),
 (46, 7),
 (36, 7),
 (22, 7),
 (31, 6),
 (30, 6),
 (24, 6),
 (47, 5),
 (27, 5),
 (26, 5),
 (43, 5),
 (38, 4),
 (20, 4),
 (45, 4),
 (35, 3),
 (16, 2),
 (15, 2),
 (28, 2)]

What is the lowest accounted-for lexemes and how many of them are there?

In [65]:
min(spread_count.most_common())

(15, 2)

In [66]:
sum([amount for count, amount in spread_count.most_common() if count<20])

43

In [68]:
43/297 # ratio of total specimens within 20 or less experiments

0.1447811447811448

Only 2 lexemes are accounted for in only 15 of the experiments. 43 of the 297 verb specimens (14%) have less than 20 applicable experiments. These are acceptable amounts. Verbs that do not have an attested object, for instance, will not appear in 3 of 4 object experiments. But they *will* occur in the presence/absence object experiment. So it can be expected that certain verbs will be less accounted-for in various experiments.