# Data Validation and Discovery

The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in [project_code/parameters.py](project_code/parameters.py). Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis. I obviously want to avoid that!

In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features.  At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising. 

### Data Preparation

In [1]:
import numpy as np
import pandas as pd
import collections, os, sys, random, time, pickle, dill, copy, re
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code txt instruction
                heads prep_obj
                prs prs_gn prs_nu prs_ps
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')

This is Text-Fabric 4.4.1
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.4.1</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. I will load both types into a single dictionary.

In [2]:
success = []

## Build or Load Experiment Data

In [3]:
run_new, cached_data = True, False

if run_new:
    
    good_runs = []
    
    from project_code.parameters import * # import all definitions and functions from the parameters module
    
    experiments = {}

    for label, exp_type in params.items():    
        print(f'\nprocessing {label} experiments...')
        for name, experiment_params in exp_type.items():

            if name in success:
                continue

            print(f'\tbuilding {name}...')
            is_frame = False if label =='inventory' else True
            min_obs = 10 if name != 'vd_par_lex' else 1
            experiments[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame, min_observation=min_obs)  
            good_runs.append(name)
            print('\t\tfinished')

    print('finished processing experiments...')
    print(f'\t{len(experiments)} experiments loaded.')
    
    print('dumping experiment into cache...')
    with open('/Users/cody/Documents/experiments.dill', 'wb') as outfile:
        dill.dump(experiments, outfile)
    print('\tDone!')
    
else:
    with open('/Users/cody/Documents/experiments.dill', 'rb') as infile:
        experiments = dill.load(infile)
        print(f'{len(experiments)} experiments loaded.')

Preparing frame validation data for vf_argAll_pa...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_lex...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_sd...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_an...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_pa...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_lex...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_sd...
	preparing good mother set...
	

## Analysis

### Shapes, Bases * Targets

Here the experiments are sorted by their basis units or tests, i.e. domain, lexeme, presence/absence (pa). We sort by length below:

In [11]:
for shape, exp in sorted((experiments[exp].data.shape, exp) for exp in experiments):
    print(f'{exp}:\t{shape}')

vi_subj_animacy:	(2, 180)
vi_objc_pa:	(2, 714)
vi_adj+_pa:	(2, 734)
vi_cmpl_pa:	(2, 734)
vi_coad_pa:	(2, 734)
vf_obj_pa:	(3, 694)
vd_domain_simple:	(3, 704)
vf_cmpl_pa:	(4, 725)
vi_allarg_pa:	(4, 786)
vi_objc_animacy:	(5, 173)
vf_adju_pa:	(7, 733)
vf_obj_animacy:	(8, 127)
vf_coad_pa:	(8, 734)
vg_tense:	(8, 734)
vi_cmpl_animacy:	(39, 174)
vf_argAll_pa:	(43, 703)
vi_adj+_animacy:	(46, 108)
vi_coad_animacy:	(51, 241)
vd_domain_embed:	(73, 646)
vf_cmpl_animacy:	(88, 158)
vi_allarg_animacy:	(92, 370)
vf_adju_animacy:	(96, 78)
vf_coad_animacy:	(200, 192)
vi_subj_domain:	(247, 231)
vd_par_lex:	(305, 365)
vf_argAll_animacy:	(378, 207)
vi_objc_domain:	(448, 245)
vf_obj_domain:	(584, 213)
vi_cmpl_domain:	(1033, 223)
vf_cmpl_domain:	(1128, 207)
vi_adj+_domain:	(1219, 217)
vf_adju_domain:	(1575, 180)
vi_coad_domain:	(1735, 386)
vi_subj_lex:	(1959, 290)
vi_objc_lex:	(2251, 305)
vi_allarg_domain:	(2902, 527)
vf_obj_lex:	(3012, 274)
vf_coad_domain:	(3055, 301)
vi_adj+_lex:	(3478, 295)
vf_cmpl_lex:	(4

In [6]:
print(experiments['vi_objc_animacy'].data.index.unique())
#print(experiments['vi_o_domain'].data.index.unique())

Index(['>CR_animate', 'C_animate', 'KJ_animate', 'KJ_inanimate', 'animate',
       'inanimate'],
      dtype='object')


In [7]:
#len(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'])

In [8]:
#B.prettySetup(features={'rela'})

#B.show(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'][:10], withNodes=True)

Complement domain elements are more sophisticated, since these combine domain tags with preposition lexemes:

## Frame Seal-Check

Checking to make sure that there are no missing elements within frame tests.



In [8]:
frame_exps = {'vf_argAll_pa','vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_animacy',
              'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_animacy',
              'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_animacy', 
              'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_animacy'}

problems = collections.defaultdict(list)

for exp in frame_exps:
    
    samples = [(basis, match) for basis in experiments[exp].basis2result for match in experiments[exp].basis2result[basis]]

    for basis, sample in samples:
        
        clauses = sorted(n for n in sample if F.otype.v(n) == 'clause')
        target_clause = next(cl for cl in clauses if next((m for m in E.mother.f(cl)),0) not in clauses)
        
        # get expected relas based on experiment name
        exp2functs = {
                        'cmpl': {'Cmpl'},
                        'adj': {'Adju', 'PrAd', 'Time', 'Loca'},
                        'obj': {'Objc', 'PreO', 'PtcO'},
                        'arg': {'Cmpl', 'Adju', 'PrAd', 'Time', 'Loca', 'Objc', 'PreO', 'PtcO'}
                    }
        exp_key = next(key for key in exp2functs if re.search(key, exp))
        expected_functs = exp2functs[exp_key]
        
        # check to make sure all matching phrase functions are accounted for in the frame result
        for phrase in L.d(target_clause, 'phrase'):
            if F.function.v(phrase) in expected_functs and phrase not in sample and target_clause not in {514162}:
                problems[exp].append((basis, sample, f'missing {phrase} a {F.otype.v(phrase)} with {F.function.v(phrase)}'))
                
        # check all daughter relations
        for d_cl in E.mother.t(target_clause):
            if F.rela.v(d_cl) in expected_functs and d_cl not in sample:
                problems[exp].append((basis, sample, f'missing {d_cl} a {F.otype.v(d_cl)} with {F.rela.v(d_cl)}'))
                
len(problems)

0

## Filter Effects on Lexical Distribution

I have made a number of delimitations on how data is selected, the full scope of which can be seen in parameters.py. One important example of these delimitations is on the object frames, which exclude target clauses with relative particles. This is because a relative particle in the ETCBC primarily serves in its role as a connector to the mother clause, but the database does not specify what role the relative particle plays within its immediately enclosing clause. Often these particles serve as objects of the verb. But because the ETCBC does not disambiguate its clause-internal role, these cases must be excluded.

How do these kinds of exclusions affect verb lexeme distributions? Are there verbs whose distribution becomes significantly under-represented due to the selection restrictions? This might be the case, for example, if a verb lexeme is interconnected with an excluded construction.

In this test, I iterate through the presence/absence experiments. I make comparisons between a verb's overall occurrence ratio in the raw template search versus the experiments.

In [9]:
# standard target clause requirements
pred_target = '''

c1:clause
    p1:phrase

    /with/
    clause typ#Ptcp
        p:phrase function={pred_funct}
            -heads> word pdp=verb language=Hebrew
        p = p1
    /or/
    clause typ=Ptcp
        p:phrase function={ptcp_funct}
            -heads> word pdp=verb language=Hebrew
        p = p1
    /-/

        target:word pdp=verb
    
{basis}

lex freq_lex>9
   lexword:word 
   lexword = target
'''

all_preds = 'Pred|PreO|PreS|PtcO' # all predicate phrase functions
all_ptcp = 'PreC|PtcO'

baseline = B.search(pred_target.format(basis='', pred_funct=all_preds, ptcp_funct=all_ptcp))
base_lexs = collections.Counter(f'{F.lex.v(r[2])}.{F.vs.v(r[2])}' for r in baseline)
base_lexs = pd.Series(base_lexs)
base_lex_ratio = base_lexs / base_lexs.sum()

print(f'number of base lexemes (+stems): {len(base_lexs)}')

base_lexs.sort_values(ascending=False).head()

65323 results
number of base lexemes (+stems): 1749


>MR[.qal    5273
HJH[.qal    3533
<FH[.qal    2446
BW>[.qal    1969
NTN[.qal    1910
dtype: int64

In [10]:
for exp in experiments:
    
    # skip non frame/presence absence experiments
    if not re.search('vf.*_pa', exp):
        continue

    # get lex sums/ratios for experiment
    lex_sums = experiments[exp].data.sum()
    lex_ratio = lex_sums / lex_sums.sum()
    
    # make comparisons between base and experiment
    base_dif = lex_ratio.subtract(base_lex_ratio)
    
    #base_lex_ratio.combine(lex_ratio, lambda s1, s2: s1-s2) # absolute differences
    
    # print biggest differences
    print(f'{exp} differences from base:')
    print('\tPLUS:')
    print(base_dif.sort_values(ascending=False).head(20))
    print('\tMINUS:')
    print(base_dif.sort_values().head(20))
    print('\n', '-'*30, '\n')

vf_argAll_pa differences from base:
	PLUS:
>MR[.qal    0.008794
HJH[.qal    0.004967
LQX[.qal    0.001538
R>H[.qal    0.001506
CM<[.qal    0.001384
BW>[.qal    0.001372
HLK[.qal    0.001281
CWB[.qal    0.001245
QR>[.qal    0.000910
>KL[.qal    0.000853
QWM[.qal    0.000852
JD<[.qal    0.000718
FJM[.qal    0.000710
NF>[.qal    0.000678
MWT[.qal    0.000656
BW>[.hif    0.000611
CWB[.hif    0.000572
<NH[.qal    0.000566
JR>[.qal    0.000522
<LH[.qal    0.000513
dtype: float64
	MINUS:
YWH[.piel   -0.002399
<FH[.qal    -0.001568
DBR[.piel   -0.000813
BXR[.qal    -0.000704
MY>[.nif    -0.000685
CB<[.nif    -0.000663
JTR[.nif    -0.000478
C>R[.nif    -0.000428
NG<[.qal    -0.000401
XV>[.qal    -0.000373
<FH[.nif    -0.000327
KTB[.qal    -0.000309
GWR[.qal    -0.000294
DBR[.qal    -0.000284
XV>[.hif    -0.000282
QR>[.nif    -0.000248
JCB[.qal    -0.000237
XPY[.qal    -0.000182
GLH[.hif    -0.000179
BNH[.qal    -0.000164
dtype: float64

 ------------------------------ 

vf_obj_pa differences fr

One of the effects that can be seen is that other stems than qal tend to receive a slightly smaller representation in the experiment samples. In the object presence/absence experiment, one case of potential interest if the selection's negative affect on niphal representations. Let's see why that could be the case... 

In [11]:
nifal_find = pred_target.format(basis='''

w3:word lex=MY>[ vs=nif

w3 = target

''', pred_funct=all_preds, ptcp_funct=all_ptcp)

nifal_find = B.search(nifal_find)

B.show(nifal_find[5:15])

134 results



##### Passage 1



##### Passage 2



##### Passage 3



##### Passage 4



##### Passage 5



##### Passage 6



##### Passage 7



##### Passage 8



##### Passage 9



##### Passage 10



##### Passage 11


The basic survey above gives a tip. The exclusion of relative particles may negatively affect niphal representation due to constructions such as אשר יִמָצֵא. Let's see if the niphal accounts for a higher proportion of these constructions than qal.

In [12]:
without_rela = pred_target.format(basis='''
c2:clause
/without/
    phrase function=Rela
/-/
c1 = c2
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

with_rela = pred_target.format(basis='''
    phrase function=Rela
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

def rela_vs_noRela(relaPat, noRelaPat):
    
    vs_count = collections.defaultdict(lambda: collections.Counter())
    
    for r in B.search(noRelaPat):
        vs_count['øRela'][F.vs.v(r[2])] += 1
    for r in B.search(relaPat):
        vs_count['Rela'][F.vs.v(r[2])] += 1

    rela_count = pd.Series(vs_count['Rela'])
    no_rela_count = pd.Series(vs_count['øRela'])

    rela_prop = rela_count / rela_count.sum()
    no_rela_prop = no_rela_count / no_rela_count.sum()

    print('\nrela ratios:')
    print(rela_prop.sort_values(ascending=False))

    print('\nø rela ratios:')
    print(no_rela_prop.sort_values(ascending=False))
    
print('Relative clause verb stem proportional representations:\n')
rela_vs_noRela(with_rela, without_rela)

Relative clause verb stem proportional representations:

60636 results
4687 results

rela ratios:
qal     0.652230
piel    0.121613
hif     0.109878
nif     0.087903
hit     0.011521
hof     0.008748
pual    0.007041
hsht    0.000853
pasq    0.000213
dtype: float64

ø rela ratios:
qal     0.700178
hif     0.134607
piel    0.089204
nif     0.051257
hit     0.012072
hof     0.004898
pual    0.004717
hsht    0.002721
hotp    0.000132
tif     0.000066
nit     0.000049
pasq    0.000033
etpa    0.000033
poal    0.000016
htpo    0.000016
dtype: float64


We see a marginal increase in the proportions of passive-type verb stems: nif +4%, hof +0.4%, pual +0.3% (the biggest boost comes in the Piel, which is in itself interesting). These are minor increases, but so are the differences between the base and experiment distributions. It is at least valid to say that an exclusion of relative particles will slightly decrease the representation of nifal and increase the qal (which accounts for 70% in øRela clauses versus 65% in Rela clauses. This simple search also did not take into account lexical collocation preferences for certain constructions. How does the root מצא in the nifal compare in its use of the relative?

In [13]:
nifal_rela = pred_target.format(basis='''
    
    w3:word lex=MY>[

    p2:phrase function=Rela

w3 = target
p2 < p1
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

nifal_no_rela = pred_target.format(basis='''

c2:clause
/without/
    phrase function=Rela
/-/
    
    w3:word lex=MY>[

c2 = c1
w3 = target
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

print('Clauses with מצא; verb stem representations with and without relative particles:\n')
rela_vs_noRela(nifal_rela, nifal_no_rela)

Clauses with מצא; verb stem representations with and without relative particles:

364 results
79 results

rela ratios:
nif    0.683544
qal    0.316456
dtype: float64

ø rela ratios:
qal    0.760989
nif    0.219780
hif    0.019231
dtype: float64


Here we get the confirmation. The nifal of מצא has a much higher representation alongside the relative particle, +46% more (!). This explains the decrease of this verb's orverall representation in the experiment sample.

## Checking for Doubled Results

This loop makes sure that each result is only counted once per experiment.

In [14]:
problems = collections.defaultdict(lambda: collections.Counter())

for exp in experiments:
    
    # skip contextual searches
    if exp in ['vd_con_window', 'vd_con_clause', 'vd_con_chain']:
        continue
    
    samples = [(basis, match) for basis in experiments[exp].basis2result for match in experiments[exp].basis2result[basis]]
    covered = set()
    for basis, sample in samples:
        if tuple(sample) in covered:
            problems[exp][tuple(sample)] += 1
        else:
            covered.add(tuple(sample))
            
len(problems)

0

## Evaluation Process

Evaluating datasets for accuracy is extremely important. This is especially the case since I am using two custom datasets in my research: heads and semantic domains. The first dataset is designed by me, but I am aware of some edge cases that are not always selected properly. I have attempted to exclude any of these edges in my experiment parameters. The semantic domains dataset has been converted from UBS's Semantic Dictionary of Biblical Hebrew via XML representation (courtesy of Renier de Blois). There are two versions: "domain" and "domain2." "domain" is quite experimental, as I attempt to map the SDBH categories to three custom categories: animate, inanimate, and events (an SDBH category). This mapping does not always work well, due to contextual features or lack of available data (the SDBH is not complete). It is also important to check data from this source carefully to ensure that the converted data accurately reflects its source.

How to strategically evaluate the datasets? 

For one thing, the number of total observations among the datasets are very high:

In [15]:
for total, exp in sorted((experiments[exp].data.sum().sum(), exp) for exp in experiments):
    print(f'{exp}:\t{total}')

vd_par_lex:	723.0
vf_adju_animacy:	2491.0
vi_adj+_animacy:	3712.0
vf_obj_animacy:	6576.0
vf_cmpl_animacy:	7992.0
vf_adju_domain:	8089.0
vi_objc_animacy:	8771.0
vf_coad_animacy:	9415.0
vi_cmpl_animacy:	9894.0
vi_adj+_domain:	10384.0
vf_argAll_animacy:	10532.0
vf_cmpl_domain:	10695.0
vi_subj_animacy:	10941.0
vf_obj_domain:	11961.0
vf_adju_lex:	12543.0
vi_cmpl_domain:	12962.0
vi_subj_domain:	13761.0
vi_coad_animacy:	14528.0
vi_objc_domain:	14883.0
vi_adj+_lex:	14947.0
vf_cmpl_lex:	15587.0
vf_obj_lex:	16654.0
vi_subj_lex:	16719.0
vf_coad_domain:	16934.0
vi_cmpl_lex:	18108.0
vi_objc_lex:	19759.0
vf_argAll_domain:	20418.0
vi_allarg_animacy:	24591.0
vi_coad_domain:	25137.0
vf_coad_lex:	26376.0
vf_argAll_lex:	32077.0
vi_coad_lex:	35176.0
vi_allarg_domain:	41752.0
vd_domain_embed:	49864.0
vf_obj_pa:	54983.0
vf_argAll_pa:	56815.0
vi_allarg_lex:	57126.0
vi_objc_pa:	57928.0
vf_cmpl_pa:	59667.0
vd_domain_simple:	59732.0
vf_coad_pa:	61748.0
vf_adju_pa:	61860.0
vg_tense:	62037.0
vi_adj+_pa:	62037.0
v

One possibility is to make a small script that will guide me through a manual review process on N-random samples. The sample size could realistically only be 50-100 per relevant experiment. Though this is a relatively small number compared to the total number of observations, it would at least help me find any glaring mistakes that are reproduced frequently. Some experiments probably do not need rigorous review such as the discourse spaces (vd_con_window, vd_con_clause) as these spaces are more or less straightforward.

I will build a script below. It does the following:

1. Takes in a .dill file (.pickle with lambda support) of samples that have already been randomized, each keyed by the experiment name. 
2. As a sample is reviewed, it is placed in a reviewed set. There is a simple option to rank the sample as "good" or with a note for questionable instances.
3. When a sample is reviewed, it is presented in context with B.show
4. The reviewer should be able to be able to save and quit so that it can be resumed later.
5. Once all samples are exhausted, the reviewer is finished.

In [4]:
experiments.keys()

dict_keys(['vi_subj_lex', 'vi_subj_domain', 'vi_subj_animacy', 'vi_objc_pa', 'vi_objc_lex', 'vi_objc_domain', 'vi_objc_animacy', 'vi_cmpl_pa', 'vi_cmpl_lex', 'vi_cmpl_domain', 'vi_cmpl_animacy', 'vi_adj+_pa', 'vi_adj+_lex', 'vi_adj+_domain', 'vi_adj+_animacy', 'vi_coad_pa', 'vi_coad_lex', 'vi_coad_domain', 'vi_coad_animacy', 'vi_allarg_pa', 'vi_allarg_lex', 'vi_allarg_domain', 'vi_allarg_animacy', 'vd_par_lex', 'vd_con_window', 'vd_con_clause', 'vd_con_chain', 'vd_domain_simple', 'vd_domain_embed', 'vg_tense', 'vf_argAll_pa', 'vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_animacy', 'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_animacy', 'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_animacy', 'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_animacy', 'vf_coad_pa', 'vf_coad_lex', 'vf_coad_domain', 'vf_coad_animacy'])

In [17]:
# review_data['to_review'][experiment_name][basis] = list(Nsamples)
# review_data['reviewed'][experiment_name][basis][sample] = note

# build randomized samples

review_file = '/Users/cody/github/verb_semantics/project_code/datareview/viArgAllReview.dill'

review_data = {'to_review': collections.defaultdict(lambda: collections.defaultdict(list)),
               'reviewed': collections.defaultdict(lambda: collections.defaultdict(lambda: collections.defaultdict(dict)))}

to_review = {'vi_allarg_pa', 'vi_allarg_lex', 'vi_allarg_domain'}

for exp_name, experiment in experiments.items():
    
    if exp_name not in {'vi_allarg_pa', 'vi_allarg_lex', 'vi_allarg_domain', 'vi_allarg_animacy'}:
        continue
    
    sample_size = 50
    exp_bases = list(experiment.basis2result.keys())
    picked_samples = list()
    
    # assemble randomly picked samples
    while len(picked_samples) < sample_size:
        
        basis = random.choice(exp_bases)
        result = random.choice(experiment.basis2result[basis])
        sample = (basis, result)
    
        # ensure no repeat selections
        while sample in picked_samples:
            basis = random.choice(exp_bases)
            result = random.choice(experiment.basis2result[basis])
            sample = (basis, result)
    
        picked_samples.append(sample) # pick it
        
    for basis, result in sorted(picked_samples):        
        review_data['to_review'][exp_name][basis].append(result)
        
with open(review_file, 'wb') as outfile:
    dill.dump(review_data, outfile)
    
print('review file exported...')

review file exported...


In [18]:
# review_data['reviewed'][experiment_name][basis][sample] = note

def review(pickle_file, backup=''):
    '''
    A simple reviewer function that reviews
    random samples of my experiment data.
    '''
    
    with open(pickle_file, 'rb') as infile:
        review_data = dill.load(infile)
    
    to_review = review_data['to_review']
    reviewed = review_data['reviewed']
    
    previous = tuple()
    new_to_review = copy.deepcopy(to_review)
    completed = [] # strings of completed reviews
    
    for i, experiment_name in enumerate(to_review):      
        
        print(f'reviewing {experiment_name}')
        time.sleep(.5)
        clear_output()
        
        start_i = len([sample for basis in reviewed[experiment_name]
                          for sample in reviewed[experiment_name][basis]])
        basis_i = 1 + start_i
        len_bases = len([sample for basis in to_review[experiment_name]
                            for sample in to_review[experiment_name][basis]]) + start_i
        
        for basis_name, samples in to_review[experiment_name].items():
            for sample in samples:
                
                name = f'{experiment_name}/{basis_name}/{basis_i}of{len_bases}'
                
                print('1 for good; 2 for notes; 3 for get last; q for quit\n')
                B.prettyTuple(sample, withNodes=True, seqNumber=name)
                print(f'Exp. {i+1}/{len(to_review)}\t{name}\n')

                while True:                    
                    choice = input()
                    
                    if choice == '1':
                        reviewed[experiment_name][basis_name][name] = {'review': 'good', 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '2':
                        note = input('input note:')
                        reviewed[experiment_name][basis_name][name] = {'review': note, 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '3':
                        print('\n', previous)
                        
                    elif choice == 'q':
                        print('quitting...')
                        print(f'\ncompleted:{completed}')
                        with open(pickle_file, 'wb') as outfile:
                            save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                            dill.dump(save_data, outfile)
                        with open(f'/Users/cody/Documents/{backup}.dill', 'wb') as outfile:
                            dill.dump(save_data, outfile)
                        return(f'data successfully saved...')
                        
                    else:
                        print('input invalid...')
                    
                # save constantly
                with open(pickle_file, 'wb') as outfile:
                    save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                    dill.dump(save_data, outfile)
                with open('/Users/cody/Documents/{backup}.dill', 'wb') as outfile:
                    dill.dump(save_data, outfile)
                previous = (name, sample)
                basis_i += 1
                clear_output()
                
        completed.append(experiment_name)
                
    print('**REVIEW COMPLETE**')

### Review

In [19]:
review_file = '/Users/cody/github/verb_semantics/project_code/datareview/viArgAllReview.dill'
backup = 'viArgAllReviewBackup'

In [20]:
B.prettySetup(features={'sem_domain', 'sem_domain_code'})

In [24]:
review(review_file, backup=backup)

**REVIEW COMPLETE**


## Reviewing Problems

In [62]:
review_file1 = '/Users/cody/github/verb_semantics/project_code/datareview/dataReview2.dill'
review_file2 = '/Users/cody/github/verb_semantics/project_code/datareview/dataReview3.dill'
review_file3 = '/Users/cody/github/verb_semantics/project_code/datareview/dataReview4.dill'

completed = dict()

for file in (review_file1, review_file2, review_file3):
    
    with open(file, 'rb') as infile:

        completed.update(dill.load(infile)['reviewed'])

In [73]:
problems = {}

for experiment, bases in completed.items():
    for basis, tags in bases.items():
        for tag, tagdata in tags.items():
            status = tagdata['review']
            if status != 'good':
                problems[tag] = tagdata

len(problems)

46

In [87]:
i = 0 
for tag, data in problems.items():
    i+=1
    print(tag)
    print(f'\t{data["review"]}')
    print(data['result'])
    #B.prettyTuple(data['result'], withNodes=True, seqNumber=i)
    print()

vi_subj_animacy/inanimate/31of50
	bad.woman listed as "object reference"?
(433836, 670388, 30142, 670389, 30143, 1438630, 30142)

vi_subj_animacy/inanimate/35of50
	bad.exclude quantity
(441656, 694413, 74552, 694412, 74551, 1437671, 74552)

vi_subj_animacy/inanimate/40of50
	caution.chariot is inanimate, but it moves!
(462674, 758019, 184988, 758020, 184989, 1437643, 184988)

vi_subj_animacy/inanimate/41of50
	bad.quantity is animate
(464896, 764517, 195403, 764518, 195404, 1437746, 195403)

vi_subj_animacy/inanimate/42of50
	bad.house is figurative for people
(477519, 799567, 251811, 799568, 251812, 1438660, 251811)

vi_subj_animacy/inanimate/44of50
	bad.town figurative for people
(489850, 834267, 308947, 834268, 308950, 1437643, 308947)

vi_subj_animacy/inanimate/45of50
	bad.FIXED
(491898, 839388, 316345, 839387, 316344, 1441676, 316345)

vi_subj_animacy/inanimate/46of50
	bad.frame is animate
(497767, 853839, 337162, 853838, 337160, 1440211, 337162)

vi_subj_animacy/inanimate/50of50
	ba

<br>
<br>
<br>
<br>
<hr>
## To Do

To-do list for data adjustments after the review.



## Done
* Consider an allArg inventory experiment. *Built it.*
* Decide on whether to exclude groups & quantifiers from the animacy experiment. Some groups are animate. Others are not. The issue is that groups are a quite common category. Quantifiers are less so, so excluding them is a simpler matter. Also, all categories in 2.\* range are open to interpretation. Most appear to be inanimate frames. But occasionally these frames are used in animate ways, especially if the word is a participle.
    * **Proposed Solution**: Currently groups are marked as animate. In several cases it appears that a group name functions as a location, but often that is unclear. Is the location a physical place or seen to represent the group itself? This is an exegetical question that really cannot be solved programmatically. For the purpose of this experiment, I have decided to keep the named groups on the assumption that the group name is the primary meaning of these terms, from which a location name is derived. There will be places in the data where the classification is wrong. I should eventually seek to clarify those situations. At this point, however, there is simply not a good solution for excluding group names.
    * **Follow-up**: I have decided to keep groups and have found a solution for at least all of the problems marked during this review: the animater checks to see whether the basis element is contained in a `Loca` (location) function phrase. If it is, and if it has a code for either "reference to person" or "reference to group," the tag "inanimate" is assigned.
* Exclude all quantifiers (sem_domain_code=2.128) from the animacy experiments. Quantifiers are ambiguous with regard to animacy.
* Solved the following tagged issue:
> vi_subj_animacy/inanimate/46of50 <br>
> bad.frame is animate<br>
> (497767, 853839, 337162, 853838, 337160, 1440211, 337162)

  The problem is that a participle functioning as a noun is marked with a frame domain, all of which I have marked as "inanimate." The problem is solved here by adding a further requirement: `sp#verb` to all animacy noun candidates. This prevents participles from being selected. Another requirement is also added that `sp#adjv` to prevent cases of where an adjective is mislabeled as inanimate (several adjectives have a code of 2.\*. This *does* exclude ~135 words (instances) from the sample. But this is very small compared to the sample size.

* Consider building an experiment that uses mid or high code level domains from SDBH.
    * For now I have decided to table this idea. Some codes are already at their highest level. The amount of work it would take to build a reliable function that could select the next highest code may be too much for the limits of the project.
* Note on the following datapoint: `vf_obj_domain/KJ_Act|People/24of50`, marked: 
    > bad.participle head missed here due to noun phrase <br>
    > (185574, 185576, 1437609, 185578, 185581, 462775, 462776, 758362, 758364, 758365, 758366)<br>
  
  The participle עשה here has a phrase dependent part of speech of `adjv`. The interpretation of the ETCBC is thus not as a verbal participle, but adjectival in description of the head noun מלאכה, i.e. "doer of work / work doer". Thus no action is required on this item.
  
* Add verbum dicendi req for speech objects or fix mismatches somehow. *Done with >MR and DBR based on query.*
* Fix prep_obj on suffixed prepositions followed by a conjunction. See prep_obj on 425706, which is a conjunction rather than a prepositional object. This happens because the prepositional object is a suffix. This happens 217 times in the corpus. 
    * **Applied temporary solution** by adding `conj` in the word selection parameters as an ineligible part of speech. This should have been there anyways.
    * See pattern for other examples:
> phrase typ=PP<br>
> &nbsp;&nbsp;&nbsp;&nbsp;-heads> word prs<br>
> &nbsp;&nbsp;&nbsp;&nbsp;-prep_obj> word pdp=conj<br>
* Consider the role of multiple headed argument phrases. As it is now: All heads are accepted as part of the frame. But a double object frame is different from a single object frame with coordinated objects. On the other hand, the double object frame is handled by the presence/absence experiment. Thus, the role of the frame lexeme experiment is as a lexically restricted task. So it the mixture of true double objects with coordinated objects problematic?
    * Solution 1: Change the basis element to the phrase itself; change heads template to /with/ quantifier. In the tokenizer retrieve all heads and construct a single string of all the coordinated elements. Alphabetize the list to disambiguate order.
    * Proposal 2: There is little difference, really, between tokenizing head by head and tokenizing by phrase. In either case, the string reflects the complexity of the plural objects. The only cases where a head by head approach may cause problems is when two tags with matching lexemes are matched, but one is a doubled object and another is part of a coordinate chain. Given that the lexemes are there to disambiguate, I doubt whether these cases would truly be problematic. Thus the proposal is to keep the status quo.
    * **Chosen Solution:** Proposal 3: Generate a new frame for every head for every phrase. This allows the frames to be disambiguated from the specificity of combined elements. This option would require the most coding, but would be the most robust.
        * Handle this on the processing end in the experiment class. The templates already capture all of the necessary elements. Thus the question is simply how to combine them on the back end. *[Used a recursive function to stitch together frames]*
* Exclude H article from conjunction tokenizer
* Build an experiment that combines complement and adjunct elements with no function disambiguation: "coad"; many "adjunct" elements are exactly the same as complement elements and vice versa.
    * adjunt, inventory experiment should have no function disambiguation at all
* The animate/inanimate experiment ought to only utilize nouns in the "objects" category, i.e. 1.001[0-9]\*. Event and reference nouns are too ambiguous.

## Notes

### On Animacy

In the first iteration of these experiments, I attempted to map all of the Semantic Dictionary of Biblical Hebrew (SDBH) categories to one of three tags: `animate`, `inanimate`, and `event`. Events are native SDBH categories, while I mapped "objects" and semantic frames to animacy categories. Upon several inspections of the data, it is apparent that events and frames cannot be consistently mapped to animacy categories, because these lexemes are too frequently used in various contexts. For example, because the SDBH lists both participles and adjectives under "events," the animacy mapping fails on the many cases where participles and adjectives stand in for persons. Presumably the contextual senses would indicate animacy in these cases. In Isaiah 54:13 there is a good example of that: Know > Human, where the participle event "Know" is transformed into its human referent. But these categories appear to still be in development.

Animacy can be consistently mapped to object categories (codes 001\*), object references (codes 003001\*), and a handful of frames. I will collect those and modify the semantic domain code accordingly. This means that there will now need to be two collections of semantic domain data where previously there was one set of templates. Domain2 has been using the native SDBH categories, and will need the old selection procedures. I will rename domain2 into simply domain. And I will create a new category specifically for animacy. The revamped experiments will thus follow accordingly:

1) semantic domains - utilizing the native SDBH categories<br>
2) animacy - utilizes a custom set of SDBH categories mapped to `animate` or `inanimate` tags.