# Data Validation and Discovery

The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in [project_code/parameters.py](project_code/parameters.py). Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis. I obviously want to avoid that!

In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features.  At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising. 

### Data Preparation

In [64]:
import numpy as np
import pandas as pd
import collections, os, sys, random, time, pickle, dill, copy
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code txt
                heads prep_obj 
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')
B.prettySetup(features={'sem_domain', 'sem_domain_code'})

from project_code.parameters import * # import all definitions and functions from the parameters module

This is Text-Fabric 4.2.0
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.2.0</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. I will load both types into a single dictionary.

In [2]:
experiments = {}

indent(0, reset=True)

for label, exp_type in params.items():    
    info(f'processing {label} experiments...')
    for name, experiment_params in exp_type.items():
        print(f'\tbuilding {name}...')
        is_frame = False if label =='inventory' else True
        min_obs = 10 if name != 'vd_par_lex' else 1
        experiments[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame, min_observation=min_obs)
        print('\t\tfinished')
    
info('finished processing experiments...')
info(f'\t{len(experiments)} experiments loaded.')

  0.00s processing inventory experiments...
	building vi_s_lex...
		finished
	building vi_s_domain...
		finished
	building vi_o_pa...
		finished
	building vi_o_lex...
		finished
	building vi_o_domain...
		finished
	building vi_cmpl_pa...
		finished
	building vi_cmpl_lex...
		finished
	building vi_cmpl_domain...
		finished
	building vi_adj+_pa...
		finished
	building vi_adj+_lex...
		finished
	building vi_adj+_domain...
		finished
	building vd_par_lex...
		finished
	building vd_con_window...
		finished
	building vd_con_clause...
		finished
	building vd_con_chain...
		finished
	building vd_domain_simple...
		finished
	building vd_domain_embed...
		finished
	building vg_tense...
		finished
   |     3.58s processing frame experiments...
	building vf_argAll_pa...
		finished
	building vf_argAll_lex...
		finished
	building vf_argAll_domain...
		finished
	building vf_cmpl_lex...
		finished
	building vf_cmpl_domain...
		finished
	building vf_adj+_lex...
		finished
	building vf_adj+_domain...
		

## Analysis

### Shapes, Bases * Targets

Here the experiments are sorted by their basis units or tests, i.e. domain, lexeme, presence/absence (pa). We sort by length below:

In [4]:
for shape, exp in sorted((experiments[exp].data.shape, exp) for exp in experiments):
    print(f'{exp}:\t{shape}')

vi_s_domain:	(2, 216)
vi_o_domain:	(2, 223)
vi_o_pa:	(2, 671)
vi_cmpl_pa:	(2, 682)
vd_domain_simple:	(3, 661)
vi_adj+_pa:	(5, 685)
vg_tense:	(7, 682)
vi_cmpl_domain:	(39, 204)
vi_adj+_domain:	(50, 198)
vd_domain_embed:	(73, 609)
vf_cmpl_domain:	(183, 190)
vd_par_lex:	(305, 365)
vf_adj+_domain:	(426, 174)
vf_argAll_pa:	(1104, 610)
vf_argAll_domain:	(1437, 381)
vi_s_lex:	(1899, 273)
vi_o_lex:	(2367, 275)
vi_adj+_lex:	(2823, 271)
vi_cmpl_lex:	(3817, 261)
vf_adj+_lex:	(3841, 232)
vf_cmpl_lex:	(4170, 247)
vd_con_window:	(4232, 723)
vd_con_clause:	(5283, 845)
vd_con_chain:	(8281, 1182)
vf_argAll_lex:	(13705, 456)


It's not surprising that the smallest bases measurements are in the domain and presence/absence category, which checks simply for the presence of a given feature. For objects and subjects, the values should be either 'ø' or 'subj'/'obj':

In [5]:
print(experiments['vi_o_pa'].data.index.unique())
print(experiments['vi_cmpl_pa'].data.index.unique())
print(experiments['vi_adj+_pa'].data.index.unique()) # note that adjuncts + are more diverse

Index(['object', 'ø'], dtype='object')
Index(['Cmpl', 'ø'], dtype='object')
Index(['Adju', 'Loca', 'PrAd', 'Time', 'ø'], dtype='object')


Domain spaces also have fewer features:

In [6]:
print(experiments['vi_s_domain'].data.index.unique())
print(experiments['vi_o_domain'].data.index.unique())

Index(['animate', 'inanimate'], dtype='object')
Index(['animate', 'inanimate'], dtype='object')


Complement domain elements are more sophisticated, since these combine domain tags with preposition lexemes:

In [7]:
print(experiments['vi_cmpl_domain'].data.index.unique(), '\n')
print(experiments['vi_adj+_domain'].data.index.unique())

Index(['<D_animate', '<D_inanimate', '<L_animate', '<L_inanimate',
       '<M_animate', '<M_inanimate', '>L_animate', '>L_inanimate',
       '>T==_animate', '>T==_inanimate', '>T_animate', '>T_inanimate',
       '>XR/_animate', '>XR/_inanimate', '>YL/_animate', '>YL/_inanimate',
       'B<D/_animate', 'B<D/_inanimate', 'BJN/_animate', 'BJN/_inanimate',
       'BMW_inanimate', 'B_animate', 'B_inanimate', 'DRK/_animate',
       'DRK/_inanimate', 'K_animate', 'K_inanimate', 'LM<N_inanimate',
       'L_animate', 'L_inanimate', 'MN_animate', 'MN_inanimate',
       'MWL/_animate', 'NGD/_animate', 'NGD/_inanimate', 'TXT/_animate',
       'TXT/_inanimate', 'animate', 'inanimate'],
      dtype='object') 

Index(['<D_animate', '<D_inanimate', '<L_animate', '<L_inanimate',
       '<M_animate', '<M_inanimate', '>L_animate', '>L_inanimate',
       '>T==_animate', '>T==_inanimate', '>T_inanimate', '>XR/_animate',
       '>XR/_inanimate', '>YL/_animate', '>YL/_inanimate', 'B<D/_animate',
       'B<D/

Let's take a look at some of these domain categories to ensure domain is functioning properly.

In [8]:
#B.show(experiments['vi_cmpl_domain'].basis2clause['<D_animate'][:10])

To be considered: I have included "Names of Groups" under the "animate" category. However, these terms can also function as inanimate entities when they refer more to locations. I should consider excluding these domains altogether due to the ambiguity.

In [9]:
#B.show(experiments['vi_s_domain'].basis2clause['animate'][1000:1010])

In [10]:
print(experiments['vf_cmpl_domain'].data.index.unique())

Index(['Cmpl.<D_animate', 'Cmpl.<D_animate|Cmpl.L_inanimate|Cmpl.MN_inanimate',
       'Cmpl.<D_inanimate', 'Cmpl.<D_inanimate|Cmpl.<D_inanimate',
       'Cmpl.<D_inanimate|Cmpl.<D_inanimate|Cmpl.<D_inanimate',
       'Cmpl.<D_inanimate|Cmpl.<D_inanimate|Cmpl.>L_inanimate|Cmpl.MN_inanimate',
       'Cmpl.<D_inanimate|Cmpl.<D_inanimate|Cmpl.MN_inanimate',
       'Cmpl.<D_inanimate|Cmpl.>XR/_animate|Cmpl.>XR/_inanimate',
       'Cmpl.<D_inanimate|Cmpl.B_inanimate',
       'Cmpl.<D_inanimate|Cmpl.MN_inanimate',
       ...
       'Cmpl.TXT/_animate', 'Cmpl.TXT/_inanimate', 'Cmpl.animate',
       'Cmpl.inanimate', 'Cmpl.inanimate|Cmpl.inanimate',
       'Cmpl.inanimate|Cmpl.inanimate|Cmpl.inanimate', 'PreC.inanimate',
       'PreO.inanimate', 'Pred.inanimate', 'PtcO.inanimate'],
      dtype='object', length=183)


In [11]:
#B.show(experiments['vf_cmpl_domain'].basis2clause['Cmpl.TXT/_animate'][:40], withNodes=True)

## Evaluation Process

Evaluating datasets for accuracy is extremely important. This is especially the case since I am using two custom datasets in my research: heads and semantic domains. The first dataset is designed by me, but I am aware of some mistakes. The latter dataset is not mine but has been converted from an XML representation. It is important to check data from this source carefully to ensure that the converted data accurately reflects its source.

How to strategically evaluate the datasets? 

For one thing, the number of total observations among the datasets are very high:

**TODO: Compare these counts against total occurrences in HB**

In [12]:
for total, exp in sorted((experiments[exp].data.sum().sum(), exp) for exp in experiments):
    print(f'{exp}:\t{total}')

vd_par_lex:	723.0
vf_adj+_domain:	8067.0
vi_adj+_domain:	9511.0
vf_cmpl_domain:	10538.0
vf_adj+_lex:	11151.0
vi_cmpl_domain:	11597.0
vi_s_domain:	13060.0
vi_o_domain:	13136.0
vi_adj+_lex:	13773.0
vf_cmpl_lex:	14684.0
vi_s_lex:	15714.0
vi_cmpl_lex:	16272.0
vi_o_lex:	17055.0
vf_argAll_domain:	23652.0
vf_argAll_lex:	29955.0
vf_argAll_pa:	43334.0
vd_domain_embed:	45665.0
vi_o_pa:	54089.0
vd_domain_simple:	54658.0
vg_tense:	56494.0
vi_cmpl_pa:	56494.0
vi_adj+_pa:	57530.0
vd_con_window:	62856.0
vd_con_clause:	105556.0
vd_con_chain:	895144.0


One possibility is to make a small script that will guide me through a manual review process on N-random samples. The sample size could realistically only be 50-100 per relevant experiment. Though this is a relatively small number compared to the total number of observations, it would at least help me find any glaring mistakes that are reproduced frequently. Some experiments probably do not need rigorous review such as the discourse spaces (vd_con_window, vd_con_clause) as these spaces are more or less straightforward.

I will build a script below. It does the following:

1. Takes in a .dill file (.pickle with lambda support) of samples that have already been randomized, each keyed by the experiment name. 
2. As a sample is reviewed, it is placed in a reviewed set. There is a simple option to rank the sample as "good" or with a note for questionable instances.
3. When a sample is reviewed, it is presented in context with B.show
4. The reviewer should be able to be able to save and quit so that it can be resumed later.
5. Once all samples are exhausted, the reviewer is finished.

In [79]:
# # review_data['to_review'][experiment_name][basis] = list(Nsamples)
# # review_data['reviewed'][experiment_name][basis][sample] = note

# build randomized samples

# review_data = {'to_review': collections.defaultdict(lambda: collections.defaultdict(list)),
#                'reviewed': collections.defaultdict(lambda: collections.defaultdict(lambda: collections.defaultdict(dict)))
#               }

# for exp_name, experiment in experiments.items():
    
#     if exp_name in {'vd_con_chain', 'vd_con_clause'}:
#         continue
    
#     random_bases = list((basis, result) for basis in experiment.basis2result
#                             for result in experiment.basis2result[basis])
#     random.seed(12321)
#     random.shuffle(random_bases)
    
#     sample_size = 50 if not exp_name.endswith('domain') else 75
    
#     for basis, result in random_bases[:sample_size]:        
#         review_data['to_review'][exp_name][basis].append(result)
        
# review_file = '/Users/cody/github/verb_semantics/project_code/datareview/random_samples.dill'
# with open(review_file, 'wb') as outfile:
#     dill.dump(review_data, outfile)

In [41]:
len(review_data['to_review']['vi_s_domain']['inanimate']) + len(review_data['to_review']['vi_s_domain']['animate'])

75

In [88]:
# review_data['reviewed'][experiment_name][basis][sample] = note

def review(pickle_file):
    '''
    A simple reviewer function that reviews
    random samples of my experiment data.
    '''
    
    with open(pickle_file, 'rb') as infile:
        review_data = dill.load(infile)
    
    to_review = review_data['to_review']
    reviewed = review_data['reviewed']
    
    previous = tuple()
    new_to_review = copy.deepcopy(to_review)
    
    for i, experiment_name in enumerate(to_review):      
        
        print(f'current: {experiment_name}')
        time.sleep(1)
        clear_output()
        
        start_i = len([sample for basis in reviewed[experiment_name]
                          for sample in reviewed[experiment_name][basis]])
        basis_i = 1 + start_i
        len_bases = len([sample for basis in to_review[experiment_name]
                            for sample in to_review[experiment_name][basis]]) + start_i
        
        for basis_name, samples in to_review[experiment_name].items():
            for sample in samples:
                
                name = f'{experiment_name}/{basis_name}/{basis_i}of{len_bases}'
                
                print('1 for good; 2 for notes; 3 for get last; q for quit\n')
                B.show((sample,), withNodes=True)

                while True:
                    choice = input(f'{i+1}/{len(to_review)}\t{name}\n')
                    
                    if choice == '1':
                        reviewed[experiment_name][basis_name][name] = {'review': 'good', 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '2':
                        note = input('input note:')
                        reviewed[experiment_name][basis_name][name] = {'review': note, 'result': sample}
                        new_to_review[experiment_name][basis_name].remove(sample)
                        break
                        
                    elif choice == '3':
                        print('\n', previous)
                        
                    elif choice == 'q':
                        print('quitting...')
                        with open(pickle_file, 'wb') as outfile:
                            save_data = {'to_review': new_to_review, 'reviewed': reviewed}
                            dill.dump(save_data, outfile)
                        return('data successfully saved...')
                        
                    else:
                        print('input invalid...')
                    
                previous = (name, sample)
                basis_i += 1
                clear_output()
                
    print('**REVIEW COMPLETE**')

### Review

In [85]:
review_file = '/Users/cody/github/verb_semantics/project_code/datareview/random_samples.dill'

In [87]:
review(review_file)


##### Verse 1



##### Verse 2


1 for good; 2 for notes; 3 for get last; q for quit
quitting...


'data successfully saved...'

<hr>

### TO-FIX:

These are instances that I mistakenly marked:

> should be 'good'. ('vi_o_pa/object/34of50', (429031, 656123, 8140, 429032, 1437759, 8140))