# Data Validation and Discovery

The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in [project_code/parameters.py](project_code/parameters.py). Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis.

In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features.  At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising. 

### Data Preparation

In [1]:
import numpy as np
import pandas as pd
import collections, os, sys, random
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code
                heads prep_obj 
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')

from project_code.parameters import * # import all definitions and functions from the parameters module

This is Text-Fabric 4.2.0
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.2.0</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. 

In [2]:
inventories = {}
frames = {}

for label, experiments in params.items():
    
    print(f'processing {label} experiments...')
    
    for name, experiment_params in experiments.items():
        
        print(f'\t building {name}...')
        
        is_frame = False if label =='inventory' else True
        store_dict = inventories if not is_frame else frames
        store_dict[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame)
        
        print('\t\t finished')
        
print('finished processing experiments...')
print(f'\t{len(inventories)+len(frames)} experiments loaded.')

processing inventory experiments...
	 building vi_s_lex...
		 finished
	 building vi_s_domain...
		 finished
	 building vi_o_pa...
		 finished
	 building vi_o_lex...
		 finished
	 building vi_o_domain...
		 finished
	 building vi_cmpl_pa...
		 finished
	 building vi_cmpl_lex...
		 finished
	 building vi_cmpl_domain...
		 finished
	 building vi_adj+_pa...
		 finished
	 building vi_adj+_lex...
		 finished
	 building vi_adj+_domain...
		 finished
	 building vd_par_lex...
		 finished
	 building vd_con_window...
		 finished
	 building vd_con_clause...
		 finished
	 building vd_con_chain...
		 finished
processing frame experiments...
	 building vf_argAll_pa...
		 finished
	 building vf_argAll_lex...
		 finished
	 building vf_argAll_domain...
		 finished
	 building vf_cmpl_lex...
		 finished
	 building vf_cmpl_domain...
		 finished
	 building vf_adj+_lex...
		 finished
	 building vf_adj+_domain...
		 finished
finished processing experiments...
	22 experiments loaded.
