# Generate labeled input source data for deepdive apps. (and additional sketches of pipeline strategy)
Goal is to support multiple rounds of annotation based training and prediction using DeepDive with minimal programming between rounds.

## General approach to creating our simple deepdive apps based on bag of words.

We are mainly interested in step 1.1 listed below in this current notebook.

1) Setup input data:
  - 1.1 setup input files of sentences with labels
    - these serve as source data or a parts list for each specific app.
    - raw with labels
    - NLP processed or annotated for topic at hand with labels.
      - This is where bag of words is performed.
  - 1.2 setup app specific combinations of the raw or annotated sentences.
    - these will later be copied into our deepdive apps input folder.

2) Edit master templates as necessary:
  - edit input.sh if necessary (we haven't needed to yet.)

3) Copy and modify deepdive templates and input data to create an app.
    - cc_setup_deepdive template_source_dir topic app_name num_training num_test
    - mkdir
    - copy template files
    - assign app url
    - copy input data files (the sentence or abstract combinations above)

4) Run deepdive and our reporting scripts for our app.
    - cc_run_and_stats_on_deepdive
    - deepdive initdb
    - deepdive run
    - sql extract confusion matrix based stats
    - R graph stats
    - sql report top terms

### General extensions of deepdive app creation
3 and 4 conceivably can be looped.
 - to get validation statistics.
 - to test results at various levels of subsampling of input data.
 
1 or deepdive .conf from 3 would require editing in order to test other variations of NLP generated data.

In [None]:
#-------------------------------------------------------------------!
#
# For training data, we have the known 'standards' ('t' or 'f').
#
# But we also pseudoscore the training data to postgres nulls ('\N').
#
# The null pseudoscores are in order to allow us to predict on
# our training set because deepdive does not otherwise allow us to 
# easily view how the training set scored with the exception of the
# portion of the training set 'heldout' from training.
#
#-------------------------------------------------------------------!

# auditory = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/batch_05_AP_pmids')
# unknown_prob_auditory = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/AP00_1000_ids')
# disease = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/diss_1000_ids')
# psyc = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/psyc_1000_ids') # closer than disease, assume non-overlapping
# arousal = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/AR00_1000_ids') # closer than psyc, assume non-overlapping

## Write files for use as input sources for a simple deepdive app based on 't' 'f' labels
### Raw sentences (are abstracts extracted from our medic database).

## start server (required here for medic)

In [None]:
#bash
!pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start # deepdive and medic

In [None]:
# bowp.append_raw_sentences('raw_sentences_auditory_perception', auditory, 't')
# bowp.append_raw_sentences('raw_sentences_auditory_perception_nulled', auditory, '\N') # pseudscoring nulls
# bowp.append_raw_sentences('raw_sentences_disease', disease, 'f')
# bowp.append_raw_sentences('raw_sentences_diseasae_nulled', disease, '\N') # pseudoscoring nulls
# bowp.append_raw_sentences('raw_sentences_psyc', psyc, 'f')
# bowp.append_raw_sentences('raw_sentences_psyc_nulled', psyc, '\N') # pseudoscoring nulls
# bowp.append_raw_sentences('raw_sentences_arousal', arousal, 'f')
# bowp.append_raw_sentences('raw_sentences_arousal_nulled', arousal, '\N') # pseudoscoring nulls

# bowp.append_raw_sentences('raw_sentences_unknown_auditory_perception', unknown_prob_auditory, '\N') # prediction set

### Annotated sentences (are abstracts from our medic databse and processed as bags of words).

In [None]:
# compute intensive
# bowp.append_annotated_sentences('annotated_sentences_auditory_perception', 
#                                 bowp.get_scored_abstract_bow(auditory, 't'))
# bowp.append_annotated_sentences('annotated_sentences_auditory_perception_nulled', 
#                                 bowp.get_scored_abstract_bow(auditory, '\N')) # pseudscoring nulls
# bowp.append_annotated_sentences('annotated_sentences_disease_false', 
#                                 bowp.get_scored_abstract_bow(disease, 'f'))
# bowp.append_annotated_sentences('annotated_sentences_diseasae_nulled', 
#                                 bowp.get_scored_abstract_bow(disease, '\N')) # pseudoscoring nulls
# bowp.append_annotated_sentences('annotated_sentences_psyc_false', 
#                                 bowp.get_scored_abstract_bow(psyc, 'f'))
# bowp.append_annotated_sentences('annotated_sentences_psyc_nulled', 
#                                 bowp.get_scored_abstract_bow(psyc, '\N')) # pseudoscoring nulls
# bowp.append_annotated_sentences('annotated_sentences_arousal_false', 
#                                 bowp.get_scored_abstract_bow(arousal, 'f'))
# bowp.append_annotated_sentences('annotated_sentences_arousal_nulled', 
#                                 bowp.get_scored_abstract_bow(arousal, '\N')) # pseudoscoring nulls

# bowp.append_annotated_sentences('annotated_sentences_unknown_auditory_perception_nulled',
#                                 bowp.get_scored_abstract_bow(unknown_prob_auditory, '\N')) # prediction set

## Add raw and annotated sentences for one of our well annotated 'positive' classes.

In [None]:
# arousal_156 = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/batch_04_AR_pmids') # nearly all of these are relevant
# bowp.append_raw_sentences('raw_sentences_arousal_156', arousal_156, 't')
# bowp.append_raw_sentences('raw_sentences_arousal_156_nulled', arousal_156, '\N') # pseudscoring nulls
# bowp.append_raw_sentences('raw_sentences_arousal_156_false', arousal_156, 'f') # try it as a negative set too.

# bowp.append_annotated_sentences('annotated_sentences_arousal_156',
#                                 bowp.get_scored_abstract_bow(arousal_156, 't')) # prediction set
# bowp.append_annotated_sentences('annotated_sentences_arousal_156_nulled',
#                                 bowp.get_scored_abstract_bow(arousal_156, '\N'))
# bowp.append_annotated_sentences('annotated_sentences_arousal_156_false',
#                                 bowp.get_scored_abstract_bow(arousal_156, 'f')) # prediction set

# For each topic create parts of the input data for deepdive.

For every topic, fetch sentences and save TSV files that include a column filled with a label.

't', 'f' and '\N' versions are saved for each topic file.

That way, we can mix and match them to obtain various deepdive test / train and or prediction sets.

Note the annotated_sentences all derive from raw_sentences and could be processed by an extractor in deepdive instead.

## Why do we also save Null for datasets that will be used as positives?

For training data, we have the known 'standards' ('t' or 'f').

But we also pseudoscore the training data to postgres nulls ('\N').

The null pseudoscores are in order to allow us to predict on
our training set because deepdive does not otherwise allow us to 
easily view how the training set scored with the exception of the
portion of the training set 'heldout' from training.

In [1]:
import os
import sys
sys.path.append('/Users/ccarey/Documents/my_scripts/charcar/nlp') # see appendix
import bag_of_words_parsing as bowp

def get_abstracts(pmid_list_file):
    abstracts=!medic --format tsv write --pmid-list {pmid_list_file} 2>/dev/null
    return([a.split('\t', 2) for a in abstracts])

def write_abstracts(pmid_list, name, dest_dir, append=False):
    '''
    writes raw abstracts into SQL acceptable TSV table.
    writes annotated abstracts into SQL acceptable TSV table.
    
    Parameters:
        name (str): name of subject to be included in output filename.
        pmid_list (str): name of file containing pubmed ids.
        
    3 different labeled versions of each are created so 
    they can later be combined with others.
    
    Labeling supports deepdive apps based on template found in
    'templates_deepdive_app_bagofwords'
    
    The annotated abstracts are bags of words processed by NLP.
    
    medic database is presumed to already contain records 
    for the pmids in pmid_list.
    '''
    labels = {'false':'f', 'true':'t', 'nulled':'\N'}
    raw_fname = os.path.join(dest_dir, 'raw_sentences_' + name)
    ann_fname = os.path.join(dest_dir, 'annotated_sentences_' + name)
    
    abstracts = get_abstracts(pmid_list)
    
    for k,v in labels.iteritems():
        if not append:
            try: 
                os.remove(raw_fname + '_' + k)
            except OSError:
                pass
            try: 
                os.remove(ann_fname + '_' + k)
            except OSError:
                pass
    # TODO: For efficiency, in bag of words parsing,
    # TODO: could just do parsing once, and then apply the labels.
    for k,v in labels.iteritems():
        bowp.append_raw_sentences(raw_fname + '_' + k, abstracts, v)
        bowp.append_annotated_sentences(ann_fname + '_' + k, 
                                bowp.get_scored_abstract_bow(abstracts, v))

In [2]:
%mkdir '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences/all_sentences'
%cd  '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences/all_sentences'

mkdir: /Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences/all_sentences: File exists
/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences/all_sentences


### Identify the pubmed id lists that we know by manual annotation are probably all positive for topic.

In [3]:
auditory_146_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/batch_05_AP_pmids'
arousal_156_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/batch_04_AR_pmids' # nearly all of these are relevant

### Identify the pubmed id lists that we believe are likely positive for topic due to pubmed search term used.

Note, most of these pubmed id lists were initially generated in 8_0_1_fetch_random_human_abstracts.ipynb.

When a 2nd set of some were created, they were done such that they were non-overlapping with the first set.

In [4]:
disease_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/diss_1000_ids'
psyc_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/psyc_1000_ids' # closer than disease, assume non-overlapping
arousal_1_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/AR00_1000_ids' # closer than psyc, assume non-overlapping
auditory_1_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/AP00_1000_ids'
# 2nd set of 1000 each (note pmid-lists were previously selected to non-overlapping pmids compared to first set.)
arousal_2_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/AR00_1000_batch2_ids'
auditory_2_fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_pmids/AP00_1000_batch2_ids'

## Write the labeled sentences or processed sentences for each topic
Serves as raw data to be combined into deepdive training, testing and prediction input.

In [35]:
# will take a while, especially on larger sets.

write_abstracts(auditory_146_fname, 'auditory_146', '.', append=False)
write_abstracts(arousal_156_fname, 'arousal_156', '.', append=False)

write_abstracts(disease_fname, 'disease_1_1000', '.', append=False)
write_abstracts(psyc_fname, 'psyc_1_1000', '.', append=False)
write_abstracts(auditory_1_fname, 'auditory_1_1000', '.', append=False)
write_abstracts(arousal_1_fname, 'arousal_1_1000', '.', append=False)

write_abstracts(auditory_2_fname, 'auditory_2_1000', '.', append=False)
write_abstracts(arousal_2_fname, 'arousal_2_1000', '.', append=False)

## edit resulting files to remove 'empty' abstracts
These records without sentence or abstract data would accidentally crash our deepdive apps (probably due to our python script in our deepdive user defined functino (udf) folder). 

Not sure why these specific pubmed ids contained these errors and not others.

In [36]:
!pwd
!grep -n '\t\t\|{}' ./[ar]* # line numbering so we know we aren't throwing other records out of sync across files.
# one of files with the error
# !grep -n '21626350\|25325584' ./* | cut -d ':' -f1,2

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences/all_sentences
./annotated_sentences_arousal_156_false:16:21626350	{}	f	\N
./annotated_sentences_arousal_156_false:39:25325584	{}	f	\N
./annotated_sentences_arousal_156_nulled:16:21626350	{}	\N	\N
./annotated_sentences_arousal_156_nulled:39:25325584	{}	\N	\N
./annotated_sentences_arousal_156_true:16:21626350	{}	t	\N
./annotated_sentences_arousal_156_true:39:25325584	{}	t	\N
./raw_sentences_arousal_156_false:16:21626350		f	\N
./raw_sentences_arousal_156_false:39:25325584		f	\N
./raw_sentences_arousal_156_nulled:16:21626350		\N	\N
./raw_sentences_arousal_156_nulled:39:25325584		\N	\N
./raw_sentences_arousal_156_true:16:21626350		t	\N
./raw_sentences_arousal_156_true:39:25325584		t	\N


In [38]:
# removing the bad lines (and backing up original)
# note sed -i is gnu sed.
!sed -i.bak '/21626350\t/d;/25325584\t/d' ./[ar]*arousal_156*
# testing success
# !grep -n '\t\t\|{}' ./* # should only be backups
# !rm ./*.bak

## Summary, resulting number of labeled 'sentences' by label type.
- usually abstracts rather than sentences.

In [45]:
!find . -type f -name "*false" | parallel wc -l {} | sort -k2

     154 ./annotated_sentences_arousal_156_false
    1000 ./annotated_sentences_arousal_1_1000_false
    1000 ./annotated_sentences_arousal_2_1000_false
     146 ./annotated_sentences_auditory_146_false
    1000 ./annotated_sentences_auditory_1_1000_false
    1000 ./annotated_sentences_auditory_2_1000_false
    1000 ./annotated_sentences_disease_1_1000_false
    1000 ./annotated_sentences_psyc_1_1000_false
     154 ./raw_sentences_arousal_156_false
    1000 ./raw_sentences_arousal_1_1000_false
    1000 ./raw_sentences_arousal_2_1000_false
     146 ./raw_sentences_auditory_146_false
    1000 ./raw_sentences_auditory_1_1000_false
    1000 ./raw_sentences_auditory_2_1000_false
    1000 ./raw_sentences_disease_1_1000_false
    1000 ./raw_sentences_psyc_1_1000_false


In [46]:
!find . -type f -name "*nulled"| parallel wc -l {} | sort -k2

     154 ./annotated_sentences_arousal_156_nulled
    1000 ./annotated_sentences_arousal_1_1000_nulled
    1000 ./annotated_sentences_arousal_2_1000_nulled
     146 ./annotated_sentences_auditory_146_nulled
    1000 ./annotated_sentences_auditory_1_1000_nulled
    1000 ./annotated_sentences_auditory_2_1000_nulled
    1000 ./annotated_sentences_disease_1_1000_nulled
    1000 ./annotated_sentences_psyc_1_1000_nulled
     154 ./raw_sentences_arousal_156_nulled
    1000 ./raw_sentences_arousal_1_1000_nulled
    1000 ./raw_sentences_arousal_2_1000_nulled
     146 ./raw_sentences_auditory_146_nulled
    1000 ./raw_sentences_auditory_1_1000_nulled
    1000 ./raw_sentences_auditory_2_1000_nulled
    1000 ./raw_sentences_disease_1_1000_nulled
    1000 ./raw_sentences_psyc_1_1000_nulled


In [47]:
!find . -type f ! \( -name "*false" -o -name "*nulled" \) | parallel wc -l {} | sort -k2

       0 ./.DS_Store
     154 ./annotated_sentences_arousal_156_true
    1000 ./annotated_sentences_arousal_1_1000_true
    1000 ./annotated_sentences_arousal_2_1000_true
     146 ./annotated_sentences_auditory_146_true
    1000 ./annotated_sentences_auditory_1_1000_true
    1000 ./annotated_sentences_auditory_2_1000_true
    1000 ./annotated_sentences_disease_1_1000_true
    1000 ./annotated_sentences_psyc_1_1000_true
     154 ./raw_sentences_arousal_156_true
    1000 ./raw_sentences_arousal_1_1000_true
    1000 ./raw_sentences_arousal_2_1000_true
     146 ./raw_sentences_auditory_146_true
    1000 ./raw_sentences_auditory_1_1000_true
    1000 ./raw_sentences_auditory_2_1000_true
    1000 ./raw_sentences_disease_1_1000_true
    1000 ./raw_sentences_psyc_1_1000_true


# Unfinished code from hereon.
## These are sketches towards wrapping deepdive app creation.
## See notebook 11 and later for examples of deepdive apps.

In [None]:
def get_n_lines_file(fname):
    n_lines = 0
    with open(fname, 'r') as f:
        n_lines = sum(1 for _ in f)
    return(n_lines)

## Example, apps with different sample sizes
 - for auditory perception as positive (per annotators recommendation) 
 - disease as negative

all_sentences_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_sentences/all_sentences'

In [None]:
get_n_lines_file(os.path.join(all_sentences_dir, 'annotated_sentences_auditory_perception'))

In [None]:
### Setup each of the sampling sizes

range(25, max())

## Per single app or sampling regimen copy templates and setup single app

In [None]:
from __future__ import print_function
import os
import shutil

In [None]:
templates='/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/templates_deepdive_app_bagofwords'
dd_app_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/'
# training = '/Users/ccarey/Documents/Projects/NAMI/rdoc/.../some file of training sentences'
# prediction = '/Users/ccarey/Documents/Projects/NAMI/rdoc/.../some file of prediction sentences'

os.chdir(dd_app_dir)

In [None]:
def cc_setup_deepdive(template_source_dir, fname_train, fname_predict, topic='current_topic', app_name='current_app'):
    '''
    template_source_dir : directory containing deepdive.conf, folders, and input/init.sh
    train_fname : file containing data for training, 1 sample per line
    predict_fname : file containing data for prediction, 1 sample per line
    '''
    app = topic + '_' + app_name + '__' + fname_train + '__' + fname_train + '_' + fname_predict
    shutil.copytree(template_source_dir, app)
    with open(os.path.join(app_name + 'db.url'), 'w') as f:
        f.write('postgresql://localhost/' + app + '\n')

In [None]:
for i in range[]
app = 'test_app'
cc_setup_deepdive(template_source_dir=templates,
                  topic='a_topic',
                  app_name=app_name,
                  fname_train='training',
                  fname_predict='prediction')

In [None]:
cc_setup_deepdive(template_source_dir=templates,
                  topic='a_topic',
                  app_name=app_name,
                  fname_train=training,
                  fname_predict=prediction)

# Appendices

## A.1 Medic interaction
Python's medic module is being used to store abstract information.

Tables of interest:
- descriptors: Probably MH Medline headings.
- qualifiers: Probably MH additional qualifiers, like major topic.
- some abstracts seem to be generated from the *content* field of the 'sections' table.
- abstract field is entirely copyright.
- databases table gives access out to other databases, not necessarily all. (see pubmed on 25882325 for example.)
- chemicals table is content of the 'substances' doing a pubmed search.

Use postgres array_agg() to concatenate the content to get back a single abstract when that content is spread across multiple rows in the sections table.

(Assuming that only the abstract is in the content).

## A.2 NLP parsing