# Demonstrate a way to do cross validation with DeepDive

Note, this is not K-fold cross validation. 

Cross validation here is essentially resampling with replacement.

We simply create the app once, then initdb and run and accumulate stats for each run.

We'll recreate one of our previous apps here, and loop through multiple initdb and run in order to collect our stats.

## If want true k-fold cross-validation:
It would be a little more difficult to do K-fold cross validation because DeepDive creates the test and training splits itself randomly.

And note, there is randomness both in the set selection for training and test AND in the specific number of elements in each of these sets. 

For example, when we specify in the apps .conf file 75% training, 25% test, DeepDive might instead take 73% and 27%.

## Start Server

In [26]:
!pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start # deepdive

pg_ctl: another server might be running; trying to start server anyway
server starting


## Use the same app creation method and one of the experiment sets as previously.
The app creation is same as in ./11_2_aud_per_vs_others...

In [27]:
import os
import shutil
import errno
import glob

dd_sent_sources = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences'
dd_app_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app'
templates = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/templates_deepdive_app_bagofwords/'
conf_matrix_r_src = '/Users/ccarey/Documents/Projects/NAMI/rdoc/scripts/report_dd_confusion_matrix.R' # creates tsv and pdf reports.

In [28]:
def create_dd_app(template_dir, app_name, input_raw, input_annotated):
    '''Sets up a DeepDive app based and populates the input data'''
    try:
        shutil.copytree(template_dir, app_name)
    except OSError as err:
        print("Error copying {} to {}: {}".format(template_dir, app_name, err))
    # create / overwrite unique postgres db name for this app:
    with open(os.path.join(app_name, 'db.url'), 'w') as f:
        f.write('postgresql://localhost/{}\n'.format(app_name))
    try:
        shutil.copyfile(input_raw, os.path.join(app_name, 'input', 'raw_sentences'))
    except OSError as err:
        print("Error copying raw sentences: {}".format(err))
    try:
        shutil.copyfile(input_annotated, os.path.join(app_name, 'input', 'annotated_sentences'))
    except OSError as err:
        print("Error copying annotated sentences: {}".format(err))

def my_dd_table_sql_str():
    '''Used to populate cc_all_predictions.tsv or such table in 
    deepdive app.
    
    That tsv file in turn is used by our confusion matrix R Script to
    generate stats and plots from our deepdive apps.
    '''
    cmd = ('SELECT a.has_term, r.terms, a.sentence_id, expectation '
           'FROM '
           '_annotated_sentences_has_term_inference as a JOIN '
           '_raw_sentences as r ON '
           'a.sentence_id = r.sentence_id '
           'ORDER BY a.sentence_id')
    return(cmd)

### Create a deepdive app named according to its input data.

In [29]:
# all apps must be at depth = 1, in a deepdive_app home directory, in our case 'deepdive_app/'
%cd {dd_app_dir}

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app


Make sure we know our input data before trying to create the app.

In [30]:
sentence_collections = !ls -d {dd_sent_sources}/Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000
sentence_input = sentence_collections[0]
app_name = 'AP_2_1000_vs_AR_1_1000_pdx_AP_1_1000_cval'
app_name

'AP_2_1000_vs_AR_1_1000_pdx_AP_1_1000_cval'

In [31]:
print('Creating deepdive_app {} at : {}'.format(app_name, os.getcwd()))
create_dd_app(template_dir=templates, 
              app_name = app_name,
              input_raw=os.path.join(sentence_input, 'raw_sentences'),
              input_annotated=os.path.join(sentence_input, 'annotated_sentences'))

Creating deepdive_app AP_2_1000_vs_AR_1_1000_pdx_AP_1_1000_cval at : /Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app


## loop through app 10 times, collecting stats.    

In [32]:
cmd = 'SELECT a.has_term, r.terms, a.sentence_id, expectation FROM _annotated_sentences_has_term_inference as a JOIN _raw_sentences as r ON a.sentence_id = r.sentence_id ORDER BY a.sentence_id'

%cd {dd_app_dir}
%cd {app_name}

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app
/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/AP_2_1000_vs_AR_1_1000_pdx_AP_1_1000_cval


In [33]:
### TODO: we could make this more pythonic...

# In each loop, the RScript is generating new stats and confusion matrix files, 
# we accumulate just the test portion each time.
# The '' as last argument to R was an alternate title for plots? 
def run_deepdive(run_stats_fname):
    !deepdive initdb 1> /dev/null 2>&1
    !deepdive run 1> /dev/null 2>&1
    !deepdive sql eval '{cmd}' format=tsv > cc_all_predictions.tsv # yes, "{cmd}" must be single or double quoted
    !RScript {conf_matrix_r_src} cc_all_predictions.tsv "{run_stats_fname}" '' 1> /dev/null 2>&1

for i in range(0,10):
    run_stats_fname = app_name + str(i)
    run_deepdive(run_stats_fname)

In [34]:
pattern = '*cval*' + 'test_only*' + 'confmatr.tsv'
!cat {pattern} > 'cv_conf_matrix.tsv'

In [35]:
pattern = '*cval*' + 'test_only*' + 'stats.tsv'
!cat {pattern} >> 'cv_stats.tsv'

In [36]:
!grep '"Specificity\|"Accuracy\|Sensitivity"' cv_stats.tsv | sort -s -d -k 1,1 > cc_cross_validation.tsv

#cat cv_stats_matrix.tsv  | grep '"Specificity\|"Accuracy\|Sensitivity"' | sort -s -d -k 1,1 > cc_cross_validation.tsv

## Generate the plot of the cross validation statistics.
Run the following code in R in this deepdive app.