# Data Validation and Discovery

The purpose of this notebook is to examine the shape and contents of the datasets generated using the parameters stored in [project_code/parameters.py](project_code/parameters.py). Specifically, I am looking for consistency in the data; and I am checking to make sure the data which will be analyzed indeed represents what I expect it to. Selection of linguistic data inevitably involves a number of assumptions about the shape of the data. When there is a mismatch between expectation and the selected data, the result is incorrect data analysis.

In the notebook, I load all of the experiments defined in the parameters module. For each dataset, I analyze the contents and distribution of features.  At the same time, I will look at the matched clauses that correspond with the data for cases that are unexpected or surprising. 

### Data Preparation

In [1]:
import numpy as np
import pandas as pd
import collections, os, sys, random
from IPython.display import display
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from tf.fabric import Fabric
from tf.extra.bhsa import Bhsa
from project_code.experiments2 import Experiment
from project_code.semspace import SemSpace

bhsa_data_paths=['~/github/etcbc/bhsa/tf/c',
                 '~/github/verb_semantics/project_code/lingo/heads/tf/c',
                 '~/github/verb_semantics/project_code/sdbh']

TF = Fabric(bhsa_data_paths)
tf_api = TF.load('''
                function lex vs language
                pdp freq_lex gloss domain ls
                mother rela typ sp st code
                heads prep_obj 
                sem_domain sem_domain_code
              ''', silent=True)

tf_api.makeAvailableIn(globals())
B = Bhsa(api=tf_api, name='', version='c')

from project_code.parameters import * # import all definitions and functions from the parameters module

This is Text-Fabric 4.2.0
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

118 features found and 0 ignored


**Documentation:** <a target="_blank" href="https://etcbc.github.io/bhsa" title="{provenance of this corpus}">BHSA</a> <a target="_blank" href="https://etcbc.github.io/bhsa/features/hebrew/c/0_home.html" title="{CORPUS} feature documentation">Feature docs</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/Bhsa/" title="BHSA API documentation">BHSA API</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/" title="text-fabric-api">Text-Fabric API 4.2.0</a> <a target="_blank" href="https://dans-labs.github.io/text-fabric/Api/General/#search-templates" title="Search Templates Introduction and Reference">Search Reference</a>

There are two kinds of experiment parameters: inventories and frames. Inventories count cooccuring features with verbs individually. Frames count all features within the verb's clause as a single unit or frame. 

In [2]:
experiments = {}


for label, exp_type in params.items():    
    print(f'processing {label} experiments...')
    for name, experiment_params in exp_type.items():
        print(f'\tbuilding {name}...')
        is_frame = False if label =='inventory' else True
        min_obs = 10 if name != 'vd_par_lex' else 1
        experiments[name] = Experiment(experiment_params, tf=tf_api, frame=is_frame, min_observation=min_obs)
        print('\t\tfinished')
    
print('finished processing experiments...')
print(f'\t{len(experiments)} experiments loaded.')

processing inventory experiments...
	building vi_s_lex...
		finished
	building vi_s_domain...
		finished
	building vi_o_pa...
		finished
	building vi_o_lex...
		finished
	building vi_o_domain...
		finished
	building vi_cmpl_pa...
		finished
	building vi_cmpl_lex...
		finished
	building vi_cmpl_domain...
		finished
	building vi_adj+_pa...
		finished
	building vi_adj+_lex...
		finished
	building vi_adj+_domain...
		finished
	building vd_par_lex...
		finished
	building vd_con_window...
		finished
	building vd_con_clause...
		finished
	building vd_con_chain...
		finished
processing frame experiments...
	building vf_argAll_pa...
		finished
	building vf_argAll_lex...
		finished
	building vf_argAll_domain...
		finished
	building vf_cmpl_lex...
		finished
	building vf_cmpl_domain...
		finished
	building vf_adj+_lex...
		finished
	building vf_adj+_domain...
		finished
finished processing experiments...
	22 experiments loaded.


## Analysis

In [3]:
for i, exp in enumerate(experiments):
    print(i+1, exp)
    print(experiments[exp].data.sum().sort_values(ascending=False).head(15))
    print('\n')

1 vi_s_lex
>MR[.qal     2027.0
HJH[.qal     1763.0
BW>[.qal      797.0
<FH[.qal      450.0
DBR[.piel     373.0
JY>[.qal      351.0
NTN[.qal      340.0
R>H[.qal      276.0
CM<[.qal      266.0
<LH[.qal      265.0
HLK[.qal      263.0
YWH[.piel     253.0
LQX[.qal      241.0
CLX[.qal      232.0
QWM[.qal      208.0
dtype: float64


2 vi_s_domain
>MR[.qal     1948.0
HJH[.qal     1411.0
BW>[.qal      658.0
<FH[.qal      381.0
DBR[.piel     357.0
NTN[.qal      314.0
JY>[.qal      286.0
YWH[.piel     250.0
CM<[.qal      246.0
R>H[.qal      234.0
LQX[.qal      231.0
CLX[.qal      224.0
<LH[.qal      219.0
HLK[.qal      218.0
QWM[.qal      182.0
dtype: float64


3 vi_o_pa
>MR[.qal     5097.0
HJH[.qal     3375.0
<FH[.qal     1927.0
BW>[.qal     1691.0
NTN[.qal     1635.0
HLK[.qal     1182.0
R>H[.qal     1007.0
CM<[.qal      935.0
LQX[.qal      896.0
DBR[.piel     873.0
JD<[.qal      673.0
>KL[.qal      647.0
CWB[.qal      634.0
JY>[.qal      624.0
QR>[.qal      601.0
dtype: float64


4 vi_o_lex
NTN