# Data Validation and Discovery

The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in [project_code/parameters.py](project_code/parameters.py). Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis. I obviously want to avoid that!

In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features.  At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising. 

### Data Preparation

In [1]:
import numpy as np
import pandas as pd
import collections, os, sys, random, time, pickle, dill, copy, re
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code txt
                heads prep_obj
                prs prs_gn prs_nu prs_ps
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')

cached_data = False
from project_code.parameters import * # import all definitions and functions from the parameters module

This is Text-Fabric 4.3.4
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.3.4</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

Preparing frame validation data for vf_argAll_pa...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_lex...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_allarg_sd...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_pa...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_lex...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_obj_sd...
	preparing good mother set...
	preparing good daughter set...
	√ Frame validation data prep complete.
Preparing frame validation data for vf_cmpl_pa...
	preparing good mother set...
	pr

There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. I will load both types into a single dictionary.

In [2]:
# for troubleshooting
success = {'vi_subj_lex','vi_subj_domain','vi_objc_pa','vi_objc_lex','vi_objc_domain','vi_objc_domain2','vi_cmpl_pa','vi_cmpl_lex',
         'vi_cmpl_domain','vi_cmpl_domain2','vi_adj+_pa','vi_adj+_lex','vi_adj+_domain','vi_adj+_domain2',
         'vd_par_lex','vd_con_window','vd_con_clause','vd_con_chain','vd_domain_simple','vd_domain_embed',
         'vg_tense', 'vf_argAll_pa','vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_domain2',
         'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_domain2',
          'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_domain2', 
          'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_domain2'}

In [3]:
experiments = {}

for label, exp_type in params.items():    
    print(f'\nprocessing {label} experiments...')
    for name, experiment_params in exp_type.items():
        
#         if name in success:
#             continue
        
        print(f'\tbuilding {name}...')
        is_frame = False if label =='inventory' else True
        min_obs = 10 if name != 'vd_par_lex' else 1
        experiments[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame, min_observation=min_obs)        
        print('\t\tfinished')
    
print('finished processing experiments...')
print(f'\t{len(experiments)} experiments loaded.')


processing inventory experiments...
	building vi_subj_lex...
		finished
	building vi_subj_domain...
		finished
	building vi_objc_pa...
		finished
	building vi_objc_lex...
		finished
	building vi_objc_domain...
		finished
	building vi_objc_domain2...
		finished
	building vi_cmpl_pa...
		finished
	building vi_cmpl_lex...
		finished
	building vi_cmpl_domain...
		finished
	building vi_cmpl_domain2...
		finished
	building vi_adj+_pa...
		finished
	building vi_adj+_lex...
		finished
	building vi_adj+_domain...
		finished
	building vi_adj+_domain2...
		finished
	building vd_par_lex...
		finished
	building vd_con_window...
		finished
	building vd_con_clause...
		finished
	building vd_con_chain...
		finished
	building vd_domain_simple...
		finished
	building vd_domain_embed...
		finished
	building vg_tense...
		finished

processing frame experiments...
	building vf_argAll_pa...
		finished
	building vf_argAll_lex...
		finished
	building vf_argAll_domain...
		finished
	building vf_argAll_domain2

In [11]:
#experiments['vf_argAll_pa'].basis2result.keys()

### Make Experiments Dump

In [8]:
with open('/Users/cody/Documents/experiments.dill', 'wb') as outfile:
    dill.dump(experiments, outfile)

KeyboardInterrupt: 

## Analysis

### Shapes, Bases * Targets

Here the experiments are sorted by their basis units or tests, i.e. domain, lexeme, presence/absence (pa). We sort by length below:

In [4]:
for shape, exp in sorted((experiments[exp].data.shape, exp) for exp in experiments):
    print(f'{exp}:\t{shape}')

vi_objc_pa:	(2, 718)
vi_cmpl_pa:	(2, 734)
vi_subj_domain:	(3, 231)
vf_obj_pa:	(3, 697)
vd_domain_simple:	(3, 704)
vf_cmpl_pa:	(4, 734)
vi_adj+_pa:	(5, 738)
vg_tense:	(8, 734)
vi_objc_domain:	(20, 247)
vf_adju_pa:	(46, 734)
vf_obj_domain:	(67, 213)
vi_cmpl_domain:	(72, 223)
vd_domain_embed:	(73, 646)
vi_adj+_domain:	(110, 217)
vf_argAll_pa:	(156, 706)
vd_par_lex:	(305, 365)
vi_objc_domain2:	(448, 247)
vf_adju_domain:	(527, 175)
vf_cmpl_domain:	(534, 235)
vf_obj_domain2:	(972, 213)
vi_cmpl_domain2:	(1039, 223)
vi_adj+_domain2:	(1224, 217)
vf_argAll_domain:	(1626, 321)
vf_adju_domain2:	(1770, 175)
vi_subj_lex:	(1959, 290)
vi_objc_lex:	(2251, 305)
vf_cmpl_domain2:	(2481, 235)
vi_adj+_lex:	(3485, 295)
vf_obj_lex:	(3587, 269)
vi_cmpl_lex:	(4088, 281)
vf_adju_lex:	(4227, 248)
vf_cmpl_lex:	(4423, 263)
vd_con_window:	(4463, 790)
vf_argAll_domain2:	(5456, 321)
vd_con_clause:	(5477, 900)
vd_con_chain:	(8308, 1218)
vf_argAll_lex:	(13941, 447)


In [5]:
print(experiments['vf_argAll_lex'].data.index.unique())
#print(experiments['vi_o_domain'].data.index.unique())

Index(['Adju.<BD/|Adju.TXT/_N<R/', 'Adju.<BR[|Objc.JFR>L/',
       'Adju.<D=/|Objc.>WRJH/|Objc.ZKRJHW/',
       'Adju.<D>CR_<BR[|Adju.B_RGL/|Adju.K>CR_<FH[',
       'Adju.<D>CR_<BR[|Cmpl.B_DRK/', 'Adju.<D>CR_<BR[|Cmpl.DRK/_MLK/',
       'Adju.<D>CR_>CM[', 'Adju.<D>CR_>SP[', 'Adju.<D>CR_BW>[',
       'Adju.<D>CR_BW>[|Cmpl.L_DBR/',
       ...
       'Time.MN_QY/', 'Time.MN_QYH=/', 'Time.MN_QYT/', 'Time.MN_R>C/',
       'Time.MXR/', 'Time.QDM/', 'Time.TMJD/', 'Time.TMWL/', 'Time.VRM/',
       'Time.XDC=/'],
      dtype='object', length=13941)


In [35]:
#len(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'])

In [36]:
#B.prettySetup(features={'rela'})

#B.show(experiments['vf_argAll_pa'].basis2result['Adju|Adju|Adju|Adju|Cmpl'][:10], withNodes=True)

Complement domain elements are more sophisticated, since these combine domain tags with preposition lexemes:

## Frame Seal-Check

Checking to make sure that there are no missing elements within frame tests.



In [6]:
frame_exps = {'vf_argAll_pa','vf_argAll_lex', 'vf_argAll_domain', 'vf_argAll_domain2',
              'vf_obj_pa', 'vf_obj_lex', 'vf_obj_domain', 'vf_obj_domain2',
              'vf_cmpl_pa', 'vf_cmpl_lex', 'vf_cmpl_domain', 'vf_cmpl_domain2', 
              'vf_adju_pa', 'vf_adju_lex', 'vf_adju_domain', 'vf_adju_domain2'}

problems = collections.defaultdict(list)

for exp in frame_exps:
    
    samples = [(basis, match) for basis in experiments[exp].basis2result for match in experiments[exp].basis2result[basis]]

    for basis, sample in samples:
        
        clauses = sorted(n for n in sample if F.otype.v(n) == 'clause')
        target_clause = next(cl for cl in clauses if next((m for m in E.mother.f(cl)),0) not in clauses)
        
        # get expected relas based on experiment name
        exp2functs = {
                        'cmpl': {'Cmpl'},
                        'adj': {'Adju', 'PrAd', 'Time', 'Loca'},
                        'obj': {'Objc', 'PreO', 'PtcO'},
                        'arg': {'Cmpl', 'Adju', 'PrAd', 'Time', 'Loca', 'Objc', 'PreO', 'PtcO'}
                    }
        exp_key = next(key for key in exp2functs if re.search(key, exp))
        expected_functs = exp2functs[exp_key]
        
        # check to make sure all matching phrase functions are accounted for in the frame result
        for phrase in L.d(target_clause, 'phrase'):
            if F.function.v(phrase) in expected_functs and phrase not in sample:
                problems[exp].append((basis, sample, f'missing {phrase} a {F.otype.v(phrase)} with {F.function.v(phrase)}'))
                
        # check all daughter relations
        for d_cl in E.mother.t(target_clause):
            if F.rela.v(d_cl) in expected_functs and d_cl not in sample:
                problems[exp].append((basis, sample, f'missing {d_cl} a {F.otype.v(d_cl)} with {F.rela.v(d_cl)}'))
                
len(problems)

6

In [7]:
problems.keys()

dict_keys(['vf_argAll_lex', 'vf_argAll_domain2', 'vf_argAll_pa', 'vf_obj_pa', 'vf_adju_pa', 'vf_argAll_domain'])

In [18]:
for prob, prob_list in problems.items():
    print(f'{len(prob_list)}\t{prob}')

2	vf_argAll_lex
1	vf_argAll_domain2
116	vf_argAll_pa
11	vf_obj_pa
2	vf_adju_pa
1	vf_argAll_domain


In [25]:
for prob in problems['vf_obj_pa'][:10]:
    print(prob)

('Objc', (332969, 496650, 851010, 1437655), 'missing 851011 a phrase with Objc')
('Objc', (478650, 803059, 1437837, 258358), 'missing 803062 a phrase with Objc')
('Objc', (446632, 1437753, 102304, 709463), 'missing 709464 a phrase with Objc')
('Objc', (788417, 473498, 233649, 1437710), 'missing 788419 a phrase with Objc')
('Objc', (678466, 1437671, 436453, 43767), 'missing 678467 a phrase with Objc')
('Objc', (160919, 744103, 1437834, 458063), 'missing 744104 a phrase with Objc')
('Objc', (1437609, 493674, 843595, 322252), 'missing 843596 a phrase with Objc')
('Objc', (792416, 1437754, 239667, 474975), 'missing 792417 a phrase with Objc')
('Objc', (837016, 312993, 1441730, 490916), 'missing 837017 a phrase with Objc')
('Objc', (219256, 1441872, 778491, 469717), 'missing 778492 a phrase with Objc')


In [26]:
B.show([(332969, 496650, 851010, 1437655)], withNodes=True)


##### Passage 1



##### Passage 2


## Evaluation Process

Evaluating datasets for accuracy is extremely important. This is especially the case since I am using two custom datasets in my research: heads and semantic domains. The first dataset is designed by me, but I am aware of some mistakes. The latter dataset is not mine but has been converted from an XML representation. It is important to check data from this source carefully to ensure that the converted data accurately reflects its source.

How to strategically evaluate the datasets? 

For one thing, the number of total observations among the datasets are very high:

**TODO: Compare these counts against total occurrences in HB**

In [7]:
for total, exp in sorted((experiments[exp].data.sum().sum(), exp) for exp in experiments):
    print(f'{exp}:\t{total}')

vd_par_lex:	723.0
vf_adju_domain:	4480.0
vf_adju_domain2:	4480.0
vf_adju_lex:	7673.0
vi_adj+_domain:	10406.0
vi_adj+_domain2:	10406.0
vf_obj_domain:	10855.0
vf_obj_domain2:	10855.0
vf_cmpl_domain:	10911.0
vf_cmpl_domain2:	10911.0
vi_cmpl_domain:	13049.0
vi_cmpl_domain2:	13049.0
vi_subj_domain:	13838.0
vf_obj_lex:	14896.0
vi_adj+_lex:	14959.0
vi_objc_domain:	15013.0
vi_objc_domain2:	15013.0
vf_cmpl_lex:	15914.0
vi_subj_lex:	16720.0
vf_argAll_domain:	16764.0
vf_argAll_domain2:	16764.0
vi_cmpl_lex:	18189.0
vi_objc_lex:	19806.0
vf_argAll_lex:	26201.0
vd_domain_embed:	49866.0
vf_obj_pa:	59095.0
vf_argAll_pa:	59412.0
vd_domain_simple:	59734.0
vf_adju_pa:	62039.0
vf_cmpl_pa:	62039.0
vg_tense:	62039.0
vi_cmpl_pa:	62039.0
vi_adj+_pa:	63153.0
vi_objc_pa:	67628.0
vd_con_window:	70947.0
vd_con_clause:	114736.0
vd_con_chain:	927080.0


One possibility is to make a small script that will guide me through a manual review process on N-random samples. The sample size could realistically only be 50-100 per relevant experiment. Though this is a relatively small number compared to the total number of observations, it would at least help me find any glaring mistakes that are reproduced frequently. Some experiments probably do not need rigorous review such as the discourse spaces (vd_con_window, vd_con_clause) as these spaces are more or less straightforward.

I will build a script below. It does the following:

1. Takes in a .dill file (.pickle with lambda support) of samples that have already been randomized, each keyed by the experiment name. 
2. As a sample is reviewed, it is placed in a reviewed set. There is a simple option to rank the sample as "good" or with a note for questionable instances.
3. When a sample is reviewed, it is presented in context with B.show
4. The reviewer should be able to be able to save and quit so that it can be resumed later.
5. Once all samples are exhausted, the reviewer is finished.

In [4]:
completed = ['vi_s_lex', 'vi_s_domain', 'vi_o_pa', 'vi_o_lex', 'vi_o_domain', 'vi_cmpl_pa', 'vi_cmpl_lex', 
             'vi_cmpl_domain', 'vi_adj+_pa', 'vi_adj+_lex', 'vi_adj+_domain', 'vd_par_lex', 'vd_con_window', 
             'vd_domain_simple', 'vd_domain_embed', 'vg_tense']

In [5]:
# # review_data['to_review'][experiment_name][basis] = list(Nsamples)
# # review_data['reviewed'][experiment_name][basis][sample] = note

# build randomized samples

# review_data = {'to_review': collections.defaultdict(lambda: collections.defaultdict(list)),
#                'reviewed': collections.defaultdict(lambda: collections.defaultdict(lambda: collections.defaultdict(dict)))
#               }

# for exp_name, experiment in experiments.items():
    
#     if exp_name in {'vd_con_chain', 'vd_con_clause'} | set(completed):
#         continue
    
#     random_bases = list((basis, result) for basis in experiment.basis2result
#                             for result in experiment.basis2result[basis])
#     random.seed(12321)
#     random.shuffle(random_bases)
    
#     sample_size = 50 if not exp_name.endswith('domain') else 75
    
#     for basis, result in random_bases[:sample_size]:        
#         review_data['to_review'][exp_name][basis].append(result)
        
# review_file = '/Users/cody/github/verb_semantics/project_code/datareview/random_samples_2.dill'
# with open(review_file, 'wb') as outfile:
#     dill.dump(review_data, outfile)

In [34]:
# fix vd_con_window
# vd_con_window has a tuple of word nodes within the results tuple
# B.show crashes since it expects only a list of single tuples
# I fix it here and export a repaired version

# with open(review_file, 'rb') as infile:
#     rev_data = dill.load(infile)
    
# rev_data_fix = copy.deepcopy(rev_data)    

# for lex, samples in rev_data['to_review']['vd_con_window'].items():
#     new_samples = list()
#     for sample in samples:
#         first_part = list(sample[:-1])
#         second_part = list(sample[-1])
#         new = tuple(first_part + second_part)
#         new_samples.append(new)
#     rev_data_fix['to_review']['vd_con_window'][lex] = new_samples

# export repaired dataset

# with open(review_file, 'wb') as outfile:
#     dill.dump(rev_data_fix, outfile)

In [6]:
# review_data['reviewed'][experiment_name][basis][sample] = note

def review(pickle_file, backup=''):
    '''
    A simple reviewer function that reviews
    random samples of my experiment data.
    '''
    
    with open(pickle_file, 'rb') as infile:
        review_data = dill.load(infile)
    
    to_review = review_data['to_review']
    reviewed = review_data['reviewed']
    
    previous = tuple()
    new_to_review = copy.deepcopy(to_review)
    completed = [] # strings of completed reviews
    
    for i, experiment_name in enumerate(to_review):      
        
        print(f'reviewing {experiment_name}')
        time.sleep(.5)
        clear_output()
        
        start_i = len([sample for basis in reviewed[experiment_name]
                          for sample in reviewed[experiment_name][basis]])
        basis_i = 1 + start_i
        len_bases = len([sample for basis in to_review[experiment_name]
                            for sample in to_review[experiment_name][basis]]) + start_i
        
        for basis_name, samples in to_review[experiment_name].items():
            for sample in samples:
                
                name = f'{experiment_name}/{basis_name}/{basis_i}of{len_bases}'
                
                print('1 for good; 2 for notes; 3 for get last; q for quit\n')
                B.show((sample,), withNodes=True)

                while True:
                    #print(F.domain.v(sample[0]))
                    choice = input(f'{i+1}/{len(to_review)}\t{name}\n')
                    
                    if choice == '1':
                        reviewed[experiment_name][basis_name][name] = {'review': 'good', 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '2':
                        note = input('input note:')
                        reviewed[experiment_name][basis_name][name] = {'review': note, 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '3':
                        print('\n', previous)
                        
                    elif choice == 'q':
                        print('quitting...')
                        print(f'\ncompleted:{completed}')
                        with open(pickle_file, 'wb') as outfile:
                            save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                            dill.dump(save_data, outfile)
                        with open(f'/Users/cody/Documents/{backup}.dill', 'wb') as outfile:
                            dill.dump(save_data, outfile)
                        return(f'data successfully saved...')
                        
                    else:
                        print('input invalid...')
                    
                # save constantly
                with open(pickle_file, 'wb') as outfile:
                    save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                    dill.dump(save_data, outfile)
                with open('/Users/cody/Documents/BACKUP_data_review.dill', 'wb') as outfile:
                    dill.dump(save_data, outfile)
                previous = (name, sample)
                basis_i += 1
                clear_output()
                
        completed.append(experiment_name)
                
    print('**REVIEW COMPLETE**')

### Review

In [13]:
review_file = '/Users/cody/github/verb_semantics/project_code/datareview/random_samples.dill'

<hr>

### TO-FIX:

These are instances that I mistakenly marked:

> should be 'good'. ('vi_o_pa/object/34of50', (429031, 656123, 8140, 429032, 1437759, 8140))