# Data Validation and Discovery

The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in [project_code/parameters.py](project_code/parameters.py). Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis. I obviously want to avoid that!

In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features.  At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising. 

### Data Preparation

In [None]:
import numpy as np
import pandas as pd
import collections, os, sys, random, time, pickle, dill, copy, re
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code txt
                heads prep_obj
                prs prs_gn prs_nu prs_ps
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')

cached_data = False
from project_code.parameters import * # import all definitions and functions from the parameters module

This is Text-Fabric 4.3.4
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.3.4</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

Preparing frame validation data for vf_argAll_pa...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_lex...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_sd...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_pa...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_lex...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_sd...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_cmpl_pa...
	preparing good mother set...
	pr

There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. I will load both types into a single dictionary.

In [None]:
# for troubleshooting
success = {'vi_subj_lex','vi_subj_domain','vi_objc_pa','vi_objc_lex','vi_objc_domain','vi_objc_domain2','vi_cmpl_pa','vi_cmpl_lex',
         'vi_cmpl_domain','vi_cmpl_domain2','vi_adj+_pa','vi_adj+_lex','vi_adj+_domain','vi_adj+_domain2',
         'vd_par_lex','vd_con_window','vd_con_clause','vd_con_chain','vd_domain_simple','vd_domain_embed',
         'vg_tense', 'vf_argAll_pa','vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_domain2',
         'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_domain2',
          'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_domain2', 
          'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_domain2'}

In [None]:
experiments = {}

for label, exp_type in params.items():    
    print(f'\nprocessing {label} experiments...')
    for name, experiment_params in exp_type.items():
        
#         if name in success:
#             continue
        
        print(f'\tbuilding {name}...')
        is_frame = False if label =='inventory' else True
        min_obs = 10 if name != 'vd_par_lex' else 1
        experiments[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame, min_observation=min_obs)        
        print('\t\tfinished')
    
print('finished processing experiments...')
print(f'\t{len(experiments)} experiments loaded.')

In [None]:
#experiments['vf_argAll_pa'].basis2result.keys()

### Make Experiments Dump

In [5]:
# with open('/Users/cody/Documents/experiments.dill', 'wb') as outfile:
#     dill.dump(experiments, outfile)

## Analysis

### Shapes, Bases * Targets

Here the experiments are sorted by their basis units or tests, i.e. domain, lexeme, presence/absence (pa). We sort by length below:

In [6]:
for shape, exp in sorted((experiments[exp].data.shape, exp) for exp in experiments):
    print(f'{exp}:\t{shape}')

vi_objc_pa:	(2, 718)
vi_cmpl_pa:	(2, 734)
vi_subj_domain:	(3, 231)
vf_obj_pa:	(3, 695)
vd_domain_simple:	(3, 704)
vf_cmpl_pa:	(4, 725)
vi_adj+_pa:	(5, 738)
vg_tense:	(8, 734)
vi_objc_domain:	(20, 247)
vf_adju_pa:	(46, 733)
vf_obj_domain:	(66, 203)
vi_cmpl_domain:	(72, 223)
vd_domain_embed:	(73, 646)
vi_adj+_domain:	(110, 217)
vf_argAll_pa:	(156, 704)
vd_par_lex:	(305, 365)
vi_objc_domain2:	(448, 247)
vf_adju_domain:	(524, 172)
vf_cmpl_domain:	(534, 235)
vf_obj_domain2:	(950, 203)
vi_cmpl_domain2:	(1039, 223)
vi_adj+_domain2:	(1224, 217)
vf_argAll_domain:	(1626, 321)
vf_adju_domain2:	(1754, 172)
vi_subj_lex:	(1959, 290)
vi_objc_lex:	(2251, 305)
vf_cmpl_domain2:	(2481, 235)
vf_obj_lex:	(3439, 261)
vi_adj+_lex:	(3485, 295)
vi_cmpl_lex:	(4088, 281)
vf_cmpl_lex:	(4156, 252)
vf_adju_lex:	(4193, 246)
vd_con_window:	(4463, 790)
vf_argAll_domain2:	(5456, 321)
vd_con_clause:	(5477, 900)
vd_con_chain:	(8308, 1218)
vf_argAll_lex:	(13940, 447)


In [7]:
print(experiments['vf_argAll_lex'].data.index.unique())
#print(experiments['vi_o_domain'].data.index.unique())

Index(['Adju.<BD/|Adju.TXT/_N<R/', 'Adju.<BR[|Objc.JFR>L/',
       'Adju.<D=/|Objc.>WRJH/|Objc.ZKRJHW/',
       'Adju.<D>CR_<BR[|Adju.B_RGL/|Adju.K>CR_<FH[',
       'Adju.<D>CR_<BR[|Cmpl.B_DRK/', 'Adju.<D>CR_<BR[|Cmpl.DRK/_MLK/',
       'Adju.<D>CR_>CM[', 'Adju.<D>CR_>SP[', 'Adju.<D>CR_BW>[',
       'Adju.<D>CR_BW>[|Cmpl.L_DBR/',
       ...
       'Time.MN_QY/', 'Time.MN_QYH=/', 'Time.MN_QYT/', 'Time.MN_R>C/',
       'Time.MXR/', 'Time.QDM/', 'Time.TMJD/', 'Time.TMWL/', 'Time.VRM/',
       'Time.XDC=/'],
      dtype='object', length=13940)


In [35]:
#len(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'])

In [36]:
#B.prettySetup(features={'rela'})

#B.show(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'][:10], withNodes=True)

Complement domain elements are more sophisticated, since these combine domain tags with preposition lexemes:

## Frame Seal-Check

Checking to make sure that there are no missing elements within frame tests.



In [4]:
frame_exps = {'vf_argAll_pa','vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_domain2',
              'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_domain2',
              'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_domain2', 
              'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_domain2'}

problems = collections.defaultdict(list)

for exp in frame_exps:
    
    samples = [(basis, match) for basis in experiments[exp].basis2result for match in experiments[exp].basis2result[basis]]

    for basis, sample in samples:
        
        clauses = sorted(n for n in sample if F.otype.v(n) == 'clause')
        target_clause = next(cl for cl in clauses if next((m for m in E.mother.f(cl)),0) not in clauses)
        
        # get expected relas based on experiment name
        exp2functs = {
                        'cmpl': {'Cmpl'},
                        'adj': {'Adju', 'PrAd', 'Time', 'Loca'},
                        'obj': {'Objc', 'PreO', 'PtcO'},
                        'arg': {'Cmpl', 'Adju', 'PrAd', 'Time', 'Loca', 'Objc', 'PreO', 'PtcO'}
                    }
        exp_key = next(key for key in exp2functs if re.search(key, exp))
        expected_functs = exp2functs[exp_key]
        
        # check to make sure all matching phrase functions are accounted for in the frame result
        for phrase in L.d(target_clause, 'phrase'):
            if F.function.v(phrase) in expected_functs and phrase not in sample and target_clause not in {514162}:
                problems[exp].append((basis, sample, f'missing {phrase} a {F.otype.v(phrase)} with {F.function.v(phrase)}'))
                
        # check all daughter relations
        for d_cl in E.mother.t(target_clause):
            if F.rela.v(d_cl) in expected_functs and d_cl not in sample:
                problems[exp].append((basis, sample, f'missing {d_cl} a {F.otype.v(d_cl)} with {F.rela.v(d_cl)}'))
                
len(problems)

1

In [5]:
problems.keys()

dict_keys(['vf_adju_pa'])

In [6]:
for prob, prob_list in problems.items():
    print(f'{len(prob_list)}\t{prob}')

2	vf_adju_pa


In [7]:
for prob in problems['vf_adju_pa'][:10]:
    print(prob)

('ø', (3355, 653437, 1437759, 428167), 'missing 653438 a phrase with PrAd')
('ø', (505664, 367964, 1437805, 875191), 'missing 875189 a phrase with PrAd')


In [15]:
# B.prettySetup(features={'sem_domain', 'sem_domain_code'})

# B.show([(505664, 367964, 1437805, 875191)], withNodes=True)

## Filter Effects on Lexical Distribution

I have made a number of delimitations on how data is selected, the full scope of which can be seen in parameters.py. One important example of these delimitations is on the object frames, which exclude target clauses with relative particles. This is because a relative particle in the ETCBC primarily serves in its role as a connector to the mother clause, but the database does not specify what role the relative particle plays within its immediately enclosing clause. Often these particles serve as objects of the verb. But because the ETCBC does not disambiguate its clause-internal role, these cases must be excluded.

How do these kinds of exclusions affect verb lexeme distributions? Are there verbs whose distribution becomes significantly under-represented due to the selection restrictions? This might be the case, for example, if a verb lexeme is interconnected with an excluded construction.

In this test, I iterate through the main experiments. I make comparisons between a verb's overall occurrence ratio in the raw template search versus the experiments.

In [56]:
# standard target clause requirements
pred_target = '''

c1:clause
    p1:phrase

    /with/
    clause typ#Ptcp
        p:phrase function={pred_funct}
            -heads> word pdp=verb language=Hebrew
        p = p1
    /or/
    clause typ=Ptcp
        p:phrase function={ptcp_funct}
            -heads> word pdp=verb language=Hebrew
        p = p1
    /-/

        target:word pdp=verb
    
{basis}

lex freq_lex>9
   lexword:word 
   lexword = target
'''

all_preds = 'Pred|PreO|PreS|PtcO' # all predicate phrase functions
all_ptcp = 'PreC|PtcO'

baseline = B.search(pred_target.format(basis='', pred_funct=all_preds, ptcp_funct=all_ptcp))
base_lexs = collections.Counter(f'{F.lex.v(r[2])}.{F.vs.v(r[2])}' for r in baseline)
base_lexs = pd.Series(base_lexs)
base_lex_ratio = base_lexs / base_lexs.sum()

print(f'number of base lexemes: {len(base_lexs)}')

base_lexs.sort_values(ascending=False).head()

65323 results
number of base lexemes: 1749


>MR[.qal    5273
HJH[.qal    3533
<FH[.qal    2446
BW>[.qal    1969
NTN[.qal    1910
dtype: int64

In [70]:
for exp in experiments:
    
    # skip non frame/presence absence experiments
    if not re.search('vf.*_pa', exp):
        continue

    # get lex sums/ratios for experiment
    lex_sums = experiments[exp].data.sum()
    lex_ratio = lex_sums / lex_sums.sum()
    
    # make comparisons between base and experiment
    base_dif = lex_ratio.subtract(base_lex_ratio)
    
    #base_lex_ratio.combine(lex_ratio, lambda s1, s2: s1-s2) # absolute differences
    
    # print biggest differences
    print(f'{exp} differences from base:')
    print('\tPLUS:')
    print(base_dif.sort_values(ascending=False).head(20))
    print('\tMINUS:')
    print(base_dif.sort_values().head(20))
    print('\n', '-'*30, '\n')

vf_argAll_pa differences from base:
	PLUS:
>MR[.qal    0.008794
HJH[.qal    0.004967
LQX[.qal    0.001538
R>H[.qal    0.001506
CM<[.qal    0.001384
BW>[.qal    0.001372
HLK[.qal    0.001281
CWB[.qal    0.001245
QR>[.qal    0.000910
>KL[.qal    0.000853
QWM[.qal    0.000852
JD<[.qal    0.000718
FJM[.qal    0.000710
NF>[.qal    0.000678
MWT[.qal    0.000656
BW>[.hif    0.000611
CWB[.hif    0.000572
<NH[.qal    0.000566
JR>[.qal    0.000522
<LH[.qal    0.000513
dtype: float64
	MINUS:
YWH[.piel   -0.002399
<FH[.qal    -0.001568
DBR[.piel   -0.000813
BXR[.qal    -0.000704
MY>[.nif    -0.000685
CB<[.nif    -0.000663
JTR[.nif    -0.000478
C>R[.nif    -0.000428
NG<[.qal    -0.000401
XV>[.qal    -0.000373
<FH[.nif    -0.000327
KTB[.qal    -0.000309
GWR[.qal    -0.000294
DBR[.qal    -0.000284
XV>[.hif    -0.000282
QR>[.nif    -0.000248
JCB[.qal    -0.000237
XPY[.qal    -0.000182
GLH[.hif    -0.000179
BNH[.qal    -0.000164
dtype: float64

 ------------------------------ 

vf_obj_pa differences fr

One of the effects that can be seen is that other stems than qal tend to receive a slightly smaller representation in the experiment samples. In the object presence/absence experiment, one case of potential interest if the selection's negative affect on nifal representations. Let's see why that could be the case... 

In [74]:
nifal_find = pred_target.format(basis='''

w3:word lex=MY>[ vs=nif

w3 = target

''', pred_funct=all_preds, ptcp_funct=all_ptcp)

nifal_find = B.search(nifal_find)

B.show(nifal_find[5:15])

134 results



##### Passage 1



##### Passage 2



##### Passage 3



##### Passage 4



##### Passage 5



##### Passage 6



##### Passage 7



##### Passage 8



##### Passage 9



##### Passage 10



##### Passage 11


The basic survey above gives a tip. The exclusion of relative particles may negatively affect niphal representation due to constructions such as אשר יִמָצֵא. Let's see if the niphal accounts for a higher proportion of these constructions than qal.

In [80]:
without_rela = pred_target.format(basis='''
c2:clause
/without/
    phrase function=Rela
/-/
c1 = c2
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

with_rela = pred_target.format(basis='''
    phrase function=Rela
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

def rela_vs_noRela(relaPat, noRelaPat):
    
    vs_count = collections.defaultdict(lambda: collections.Counter())
    
    for r in B.search(noRelaPat):
        vs_count['øRela'][F.vs.v(r[2])] += 1
    for r in B.search(relaPat):
        vs_count['Rela'][F.vs.v(r[2])] += 1

    rela_count = pd.Series(vs_count['Rela'])
    no_rela_count = pd.Series(vs_count['øRela'])

    rela_prop = rela_count / rela_count.sum()
    no_rela_prop = no_rela_count / no_rela_count.sum()

    print('\nrela ratios:')
    print(rela_prop.sort_values(ascending=False))

    print('\nø rela ratios:')
    print(no_rela_prop.sort_values(ascending=False))
    
print('Relative clause verb stem proportional representations:\n')
rela_vs_noRela(with_rela, without_rela)

Relative clause verb stem proportional representations:

60636 results
4687 results

rela ratios:
qal     0.652230
piel    0.121613
hif     0.109878
nif     0.087903
hit     0.011521
hof     0.008748
pual    0.007041
hsht    0.000853
pasq    0.000213
dtype: float64

ø rela ratios:
qal     0.700178
hif     0.134607
piel    0.089204
nif     0.051257
hit     0.012072
hof     0.004898
pual    0.004717
hsht    0.002721
hotp    0.000132
tif     0.000066
nit     0.000049
pasq    0.000033
etpa    0.000033
poal    0.000016
htpo    0.000016
dtype: float64


We see a marginal increase in the proportions of passive-type verb stems: nif +4%, hof +0.4%, pual +0.3% (the biggest boost comes in the Piel, which is in itself interesting). These are minor increases, but so are the differences between the base and experiment distributions. It is at least valid to say that an exclusion of relative particles will slightly decrease the representation of nifal and increase the qal (which accounts for 70% in øRela clauses versus 65% in Rela clauses. This simple search also did not take into account lexical collocation preferences for certain constructions. How does the root מצא in the nifal compare in its use of the relative?

In [85]:
nifal_rela = pred_target.format(basis='''
    
    w3:word lex=MY>[

    p2:phrase function=Rela

w3 = target
p2 < p1
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

nifal_no_rela = pred_target.format(basis='''

c2:clause
/without/
    phrase function=Rela
/-/
    
    w3:word lex=MY>[

c2 = c1
w3 = target
''', pred_funct=all_preds, ptcp_funct=all_ptcp)

print('Clauses with מצא; verb stem representations with and without relative particles:\n')
rela_vs_noRela(nifal_rela, nifal_no_rela)

Clauses with מצא; verb stem representations with and without relative particles:

364 results
79 results

rela ratios:
nif    0.683544
qal    0.316456
dtype: float64

ø rela ratios:
qal    0.760989
nif    0.219780
hif    0.019231
dtype: float64


Here we get the confirmation. The nifal of מצא has a much higher representation alongside the relative particle, +46% more (!). This explains the decrease of this verb's orverall representation in the experiment sample.

## Evaluation Process

Evaluating datasets for accuracy is extremely important. This is especially the case since I am using two custom datasets in my research: heads and semantic domains. The first dataset is designed by me, but I am aware of some mistakes. The latter dataset is not mine but has been converted from an XML representation. It is important to check data from this source carefully to ensure that the converted data accurately reflects its source.

How to strategically evaluate the datasets? 

For one thing, the number of total observations among the datasets are very high:

In [7]:
for total, exp in sorted((experiments[exp].data.sum().sum(), exp) for exp in experiments):
    print(f'{exp}:\t{total}')

vd_par_lex:	723.0
vf_adju_domain:	4480.0
vf_adju_domain2:	4480.0
vf_adju_lex:	7673.0
vi_adj+_domain:	10406.0
vi_adj+_domain2:	10406.0
vf_obj_domain:	10855.0
vf_obj_domain2:	10855.0
vf_cmpl_domain:	10911.0
vf_cmpl_domain2:	10911.0
vi_cmpl_domain:	13049.0
vi_cmpl_domain2:	13049.0
vi_subj_domain:	13838.0
vf_obj_lex:	14896.0
vi_adj+_lex:	14959.0
vi_objc_domain:	15013.0
vi_objc_domain2:	15013.0
vf_cmpl_lex:	15914.0
vi_subj_lex:	16720.0
vf_argAll_domain:	16764.0
vf_argAll_domain2:	16764.0
vi_cmpl_lex:	18189.0
vi_objc_lex:	19806.0
vf_argAll_lex:	26201.0
vd_domain_embed:	49866.0
vf_obj_pa:	59095.0
vf_argAll_pa:	59412.0
vd_domain_simple:	59734.0
vf_adju_pa:	62039.0
vf_cmpl_pa:	62039.0
vg_tense:	62039.0
vi_cmpl_pa:	62039.0
vi_adj+_pa:	63153.0
vi_objc_pa:	67628.0
vd_con_window:	70947.0
vd_con_clause:	114736.0
vd_con_chain:	927080.0


One possibility is to make a small script that will guide me through a manual review process on N-random samples. The sample size could realistically only be 50-100 per relevant experiment. Though this is a relatively small number compared to the total number of observations, it would at least help me find any glaring mistakes that are reproduced frequently. Some experiments probably do not need rigorous review such as the discourse spaces (vd_con_window, vd_con_clause) as these spaces are more or less straightforward.

I will build a script below. It does the following:

1. Takes in a .dill file (.pickle with lambda support) of samples that have already been randomized, each keyed by the experiment name. 
2. As a sample is reviewed, it is placed in a reviewed set. There is a simple option to rank the sample as "good" or with a note for questionable instances.
3. When a sample is reviewed, it is presented in context with B.show
4. The reviewer should be able to be able to save and quit so that it can be resumed later.
5. Once all samples are exhausted, the reviewer is finished.

In [4]:
completed = ['vi_s_lex', 'vi_s_domain', 'vi_o_pa', 'vi_o_lex', 'vi_o_domain', 'vi_cmpl_pa', 'vi_cmpl_lex', 
             'vi_cmpl_domain', 'vi_adj+_pa', 'vi_adj+_lex', 'vi_adj+_domain', 'vd_par_lex', 'vd_con_window', 
             'vd_domain_simple', 'vd_domain_embed', 'vg_tense']

In [5]:
# # review_data['to_review'][experiment_name][basis] = list(Nsamples)
# # review_data['reviewed'][experiment_name][basis][sample] = note

# build randomized samples

# review_data = {'to_review': collections.defaultdict(lambda: collections.defaultdict(list)),
#                'reviewed': collections.defaultdict(lambda: collections.defaultdict(lambda: collections.defaultdict(dict)))
#               }

# for exp_name, experiment in experiments.items():
    
#     if exp_name in {'vd_con_chain', 'vd_con_clause'} | set(completed):
#         continue
    
#     random_bases = list((basis, result) for basis in experiment.basis2result
#                             for result in experiment.basis2result[basis])
#     random.seed(12321)
#     random.shuffle(random_bases)
    
#     sample_size = 50 if not exp_name.endswith('domain') else 75
    
#     for basis, result in random_bases[:sample_size]:        
#         review_data['to_review'][exp_name][basis].append(result)
        
# review_file = '/Users/cody/github/verb_semantics/project_code/datareview/random_samples_2.dill'
# with open(review_file, 'wb') as outfile:
#     dill.dump(review_data, outfile)

In [34]:
# fix vd_con_window
# vd_con_window has a tuple of word nodes within the results tuple
# B.show crashes since it expects only a list of single tuples
# I fix it here and export a repaired version

# with open(review_file, 'rb') as infile:
#     rev_data = dill.load(infile)
    
# rev_data_fix = copy.deepcopy(rev_data)    

# for lex, samples in rev_data['to_review']['vd_con_window'].items():
#     new_samples = list()
#     for sample in samples:
#         first_part = list(sample[:-1])
#         second_part = list(sample[-1])
#         new = tuple(first_part + second_part)
#         new_samples.append(new)
#     rev_data_fix['to_review']['vd_con_window'][lex] = new_samples

# export repaired dataset

# with open(review_file, 'wb') as outfile:
#     dill.dump(rev_data_fix, outfile)

In [6]:
# review_data['reviewed'][experiment_name][basis][sample] = note

def review(pickle_file, backup=''):
    '''
    A simple reviewer function that reviews
    random samples of my experiment data.
    '''
    
    with open(pickle_file, 'rb') as infile:
        review_data = dill.load(infile)
    
    to_review = review_data['to_review']
    reviewed = review_data['reviewed']
    
    previous = tuple()
    new_to_review = copy.deepcopy(to_review)
    completed = [] # strings of completed reviews
    
    for i, experiment_name in enumerate(to_review):      
        
        print(f'reviewing {experiment_name}')
        time.sleep(.5)
        clear_output()
        
        start_i = len([sample for basis in reviewed[experiment_name]
                          for sample in reviewed[experiment_name][basis]])
        basis_i = 1 + start_i
        len_bases = len([sample for basis in to_review[experiment_name]
                            for sample in to_review[experiment_name][basis]]) + start_i
        
        for basis_name, samples in to_review[experiment_name].items():
            for sample in samples:
                
                name = f'{experiment_name}/{basis_name}/{basis_i}of{len_bases}'
                
                print('1 for good; 2 for notes; 3 for get last; q for quit\n')
                B.show((sample,), withNodes=True)

                while True:
                    #print(F.domain.v(sample[0]))
                    choice = input(f'{i+1}/{len(to_review)}\t{name}\n')
                    
                    if choice == '1':
                        reviewed[experiment_name][basis_name][name] = {'review': 'good', 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '2':
                        note = input('input note:')
                        reviewed[experiment_name][basis_name][name] = {'review': note, 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '3':
                        print('\n', previous)
                        
                    elif choice == 'q':
                        print('quitting...')
                        print(f'\ncompleted:{completed}')
                        with open(pickle_file, 'wb') as outfile:
                            save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                            dill.dump(save_data, outfile)
                        with open(f'/Users/cody/Documents/{backup}.dill', 'wb') as outfile:
                            dill.dump(save_data, outfile)
                        return(f'data successfully saved...')
                        
                    else:
                        print('input invalid...')
                    
                # save constantly
                with open(pickle_file, 'wb') as outfile:
                    save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                    dill.dump(save_data, outfile)
                with open('/Users/cody/Documents/BACKUP_data_review.dill', 'wb') as outfile:
                    dill.dump(save_data, outfile)
                previous = (name, sample)
                basis_i += 1
                clear_output()
                
        completed.append(experiment_name)
                
    print('**REVIEW COMPLETE**')

### Review

In [13]:
review_file = '/Users/cody/github/verb_semantics/project_code/datareview/random_samples.dill'

<hr>

### TO-FIX:

These are instances that I mistakenly marked:

> should be 'good'. ('vi_o_pa/object/34of50', (429031, 656123, 8140, 429032, 1437759, 8140))