# Generate DeepDive input data for each of several experiments

## The steps in our deepdive app creation.

**We are mainly interested in step 1.2 listed below in this current notebook.**

1) Setup input data:
  - 1.1 setup input files of sentences with labels
    - these serve as source data or a parts list for each specific app.
    - raw with labels
    - NLP processed or annotated for topic at hand with labels.
      - This is where bag of words is performed.
  - 1.2 setup app specific combinations of the raw or annotated sentences.
    - these will later be copied into our deepdive apps input folder.

2) Edit master templates as necessary:
  - edit input.sh if necessary (we haven't needed to yet.)

3) Copy and modify deepdive templates and input data to create an app.
    - cc_setup_deepdive template_source_dir topic app_name num_training num_test
    - mkdir
    - copy template files
    - assign app url
    - copy input data files (the sentence or abstract combinations above)

4) Run deepdive and our reporting scripts for our app.
    - cc_run_and_stats_on_deepdive
    - deepdive initdb
    - deepdive run
    - sql extract confusion matrix based stats
    - R graph stats
    - sql report top terms

## App specific input data creation.

Goal is to combine sentences from various topic-label sets into single input files per app.

Nomenclature of experiment is:

    Positive_training__vs__Negative_training__pdx__topredictset

Thus:

    Auditory_1000__vs__Arousal_1000__pdx__un_Aro_156

Means 1000 each of Auditory Perception as True labels, Arousal as False labels, another arousal set with 156 as unknowns.

### recall, all our pieces are as follows
With (true, false, nulled) and (annotated, raw) versions of each.

The sentences (actually abstracts) were already parsed and labeled in notebook 10_3_design_deepdive_wrapper.ipynb

In [9]:
%cd '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences'

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences


In [10]:
!find ./all_sentences -type f -name "annotated*false" | parallel wc -l {} | sort -k2

     154 ./all_sentences/annotated_sentences_arousal_156_false
    1000 ./all_sentences/annotated_sentences_arousal_1_1000_false
    1000 ./all_sentences/annotated_sentences_arousal_2_1000_false
     146 ./all_sentences/annotated_sentences_auditory_146_false
    1000 ./all_sentences/annotated_sentences_auditory_1_1000_false
    1000 ./all_sentences/annotated_sentences_auditory_2_1000_false
    1000 ./all_sentences/annotated_sentences_disease_1_1000_false
    1000 ./all_sentences/annotated_sentences_psyc_1_1000_false


Note the arousal_156 set was really trimmed to 154.

## Create input data for Question 1: With small 'truth' set and larger 'false' set, test effect of distance of the 'false' set. 

Given a limited set of positives and a large set of negatives, how is performance dictated by the distance.

Where distance is proxied by 'disease >> Psyc >> Arousal' distance away from our training set.

    - Auditory_146__vs__Arousal_1_1000__pdx__un_Aro_156
      - and is the 'non-group' correctly predicted.
    - Auditory_146__vs__Arousal_1_1000__pdx__un_Aud_1_1000
      - and is the 'group' correctly predicted.
    - Auditory_146__vs__Disease_1000__pdx__un_Aud_1_1000
    - Auditory_146__vs__Psyc_1000__pdx__un_Aud_1_1000

In [28]:
%mkdir Auditory_146__vs__Arousal_1_1000__pdx__un_Aro_154
%mkdir Auditory_146__vs__Arousal_1_1000__pdx__un_Aud_1_1000
%mkdir Auditory_146__vs__Psyc_1000__pdx__un_Aud_1_1000
%mkdir Auditory_146__vs__Disease_1000__pdx__un_Aud_1_1000

mkdir: Auditory_146__vs__Arousal_1_1000__pdx__un_Aro_154: File exists
mkdir: Auditory_146__vs__Arousal_1_1000__pdx__un_Aud_1_1000: File exists
mkdir: Auditory_146__vs__Psyc_1000__pdx__un_Aud_1_1000: File exists
mkdir: Auditory_146__vs__Disease_1000__pdx__un_Aud_1_1000: File exists


In [31]:
# Auditory_146__vs__Arousal_1_1000__pdx__un_Aro_154

raw_sources = ('./all_sentences/raw_sentences_auditory_146_true '
               './all_sentences/raw_sentences_arousal_1_1000_false '
               './all_sentences/raw_sentences_arousal_156_nulled '
               './all_sentences/raw_sentences_auditory_146_nulled '
               './all_sentences/raw_sentences_arousal_1_1000_nulled')
!cat {raw_sources} > 'Auditory_146__vs__Arousal_1_1000__pdx__un_Aro_154/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_146_true '
               './all_sentences/annotated_sentences_arousal_1_1000_false '
               './all_sentences/annotated_sentences_arousal_156_nulled '
               './all_sentences/annotated_sentences_auditory_146_nulled '
               './all_sentences/annotated_sentences_arousal_1_1000_nulled')
#annotated_sources
!cat {annotated_sources} > 'Auditory_146__vs__Arousal_1_1000__pdx__un_Aro_154/annotated_sentences'

In [32]:
# Auditory_146__vs__Arousal_1_1000__pdx__un_Aud_1_1000
raw_sources = ('./all_sentences/raw_sentences_auditory_146_true '
               './all_sentences/raw_sentences_arousal_1_1000_false '
               './all_sentences/raw_sentences_auditory_1_1000_nulled '
               './all_sentences/raw_sentences_auditory_146_nulled '
               './all_sentences/raw_sentences_arousal_1_1000_nulled')
!cat {raw_sources} > 'Auditory_146__vs__Arousal_1_1000__pdx__un_Aud_1_1000/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_146_true '
               './all_sentences/annotated_sentences_arousal_1_1000_false '
               './all_sentences/annotated_sentences_auditory_1_1000_nulled '
               './all_sentences/annotated_sentences_auditory_146_nulled '
               './all_sentences/annotated_sentences_arousal_1_1000_nulled')
!cat {annotated_sources} > 'Auditory_146__vs__Arousal_1_1000__pdx__un_Aud_1_1000/annotated_sentences'

In [33]:
# Auditory_146__vs__Psyc_1000__pdx__un_Aud_1_1000
raw_sources = ('./all_sentences/raw_sentences_auditory_146_true '
               './all_sentences/raw_sentences_psyc_1_1000_false '
               './all_sentences/raw_sentences_auditory_1_1000_nulled '
               './all_sentences/raw_sentences_auditory_146_nulled '
               './all_sentences/raw_sentences_psyc_1_1000_nulled')
!cat {raw_sources} > 'Auditory_146__vs__Psyc_1000__pdx__un_Aud_1_1000/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_146_true '
               './all_sentences/annotated_sentences_psyc_1_1000_false '
               './all_sentences/annotated_sentences_auditory_1_1000_nulled '
               './all_sentences/annotated_sentences_auditory_146_nulled '
               './all_sentences/annotated_sentences_psyc_1_1000_nulled')
!cat {annotated_sources} > 'Auditory_146__vs__Psyc_1000__pdx__un_Aud_1_1000/annotated_sentences'

In [34]:
# Auditory_146__vs__Disease_1000__pdx__un_Aud_1_1000
raw_sources = ('./all_sentences/raw_sentences_auditory_146_true '
               './all_sentences/raw_sentences_disease_1_1000_false '
               './all_sentences/raw_sentences_auditory_1_1000_nulled '
               './all_sentences/raw_sentences_auditory_146_nulled '
               './all_sentences/raw_sentences_disease_1_1000_nulled')
!cat {raw_sources} > 'Auditory_146__vs__Disease_1000__pdx__un_Aud_1_1000/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_146_true '
               './all_sentences/annotated_sentences_disease_1_1000_false '
               './all_sentences/annotated_sentences_auditory_1_1000_nulled '
               './all_sentences/annotated_sentences_auditory_146_nulled '
               './all_sentences/annotated_sentences_disease_1_1000_nulled')
!cat {annotated_sources} > 'Auditory_146__vs__Disease_1000__pdx__un_Aud_1_1000/annotated_sentences'

## Create input data for Question 2: Do we get substantially better with larger 'truth' set and large 'false' set.

Given a limited set of positives and a large set of negatives, how is performance dictated by the distance.

Where distance is proxied by 'disease >> Psyc >> Arousal' distance away from our training set.

    - Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aro_156
      - and is the 'non-group' correctly predicted.
    - Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000
      - and is the 'group' correctly predicted.
    - Auditory_2_1000__vs__Disease_1000__pdx__un_Aud_1_1000
    - Auditory_2_1000__vs__Psyc_1000__pdx__un_Aud_1_1000

In [30]:
%mkdir Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aro_154
%mkdir Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000
%mkdir Auditory_2_1000__vs__Disease_1000__pdx__un_Aud_1_1000
%mkdir Auditory_2_1000__vs__Psyc_1000__pdx__un_Aud_1_1000

mkdir: Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000: File exists
mkdir: Auditory_2_1000__vs__Disease_1000__pdx__un_Aud_1_1000: File exists
mkdir: Auditory_2_1000__vs__Psyc_1000__pdx__un_Aud_1_1000: File exists


In [35]:
# Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aro_154

raw_sources = ('./all_sentences/raw_sentences_auditory_2_1000_true '
               './all_sentences/raw_sentences_arousal_1_1000_false '
               './all_sentences/raw_sentences_arousal_156_nulled '
               './all_sentences/raw_sentences_auditory_2_1000_nulled '
               './all_sentences/raw_sentences_arousal_1_1000_nulled')
!cat {raw_sources} > 'Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aro_154/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_2_1000_true '
               './all_sentences/annotated_sentences_arousal_1_1000_false '
               './all_sentences/annotated_sentences_arousal_156_nulled '
               './all_sentences/annotated_sentences_auditory_2_1000_nulled '
               './all_sentences/annotated_sentences_arousal_1_1000_nulled')
#annotated_sources
!cat {annotated_sources} > 'Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aro_154/annotated_sentences'

In [36]:
# Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000
raw_sources = ('./all_sentences/raw_sentences_auditory_2_1000_true '
               './all_sentences/raw_sentences_arousal_1_1000_false '
               './all_sentences/raw_sentences_auditory_1_1000_nulled '
               './all_sentences/raw_sentences_auditory_2_1000_nulled '
               './all_sentences/raw_sentences_arousal_1_1000_nulled')
!cat {raw_sources} > 'Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_2_1000_true '
               './all_sentences/annotated_sentences_arousal_1_1000_false '
               './all_sentences/annotated_sentences_auditory_1_1000_nulled '
               './all_sentences/annotated_sentences_auditory_2_1000_nulled '
               './all_sentences/annotated_sentences_arousal_1_1000_nulled')
!cat {annotated_sources} > 'Auditory_2_1000__vs__Arousal_1_1000__pdx__un_Aud_1_1000/annotated_sentences'

In [37]:
# Auditory_2_1000__vs__Psyc_1000__pdx__un_Aud_1_1000
raw_sources = ('./all_sentences/raw_sentences_auditory_2_1000_true '
               './all_sentences/raw_sentences_psyc_1_1000_false '
               './all_sentences/raw_sentences_auditory_1_1000_nulled '
               './all_sentences/raw_sentences_auditory_2_1000_nulled '
               './all_sentences/raw_sentences_psyc_1_1000_nulled')
!cat {raw_sources} > 'Auditory_2_1000__vs__Psyc_1000__pdx__un_Aud_1_1000/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_2_1000_true '
               './all_sentences/annotated_sentences_psyc_1_1000_false '
               './all_sentences/annotated_sentences_auditory_1_1000_nulled '
               './all_sentences/annotated_sentences_auditory_2_1000_nulled '
               './all_sentences/annotated_sentences_psyc_1_1000_nulled')
!cat {annotated_sources} > 'Auditory_2_1000__vs__Psyc_1000__pdx__un_Aud_1_1000/annotated_sentences'

In [38]:
# Auditory_2_1000__vs__Disease_1000__pdx__un_Aud_1_1000
raw_sources = ('./all_sentences/raw_sentences_auditory_2_1000_true '
               './all_sentences/raw_sentences_disease_1_1000_false '
               './all_sentences/raw_sentences_auditory_1_1000_nulled '
               './all_sentences/raw_sentences_auditory_2_1000_nulled '
               './all_sentences/raw_sentences_disease_1_1000_nulled')
!cat {raw_sources} > 'Auditory_2_1000__vs__Disease_1000__pdx__un_Aud_1_1000/raw_sentences'

annotated_sources = ('./all_sentences/annotated_sentences_auditory_2_1000_true '
               './all_sentences/annotated_sentences_disease_1_1000_false '
               './all_sentences/annotated_sentences_auditory_1_1000_nulled '
               './all_sentences/annotated_sentences_auditory_2_1000_nulled '
               './all_sentences/annotated_sentences_disease_1_1000_nulled')
!cat {annotated_sources} > 'Auditory_2_1000__vs__Disease_1000__pdx__un_Aud_1_1000/annotated_sentences'