In [None]:
# Quick and dirty test Auditory perception whole docs vs. other categories

### Positive corpus from all Auditory abstracts
- 146 documents in batch_05_AP_pmids (most are actually AP)

### Compare Auditory perception to corpus for other topics
Decreasing distance:
- 1000 disease documents
- 1000 arousal documents
- 1000 auditory perception documents
- 1000 psychology documents, psyc_1000_ids
- 156 new arousal documents, batch_04_AR_pmids (most are probably AP, but a few prob are not.)

## Setup our deepdive app

deepdive_app/my_app/:
  - db.url # name for this db
  - deepdive.conf # contains extractors, inference rules, specify holdout.
  - input/raw_sentences
  - input/annotated_sentences
  - input/init.sh
  - udf/* # user defined functions used within deepdive.conf
  
Steps to build app:

deepdive initdb:
 - db started with schema
 - runs init.sh to preload deepdive postgres db with raw, annotated sentences

deepdive run:
 - creates run/* directory for each run
 - runs deepdive.conf which holds the deepdive pipeline extractors and rules
   - in particular, the extractors set up the features and rules to be used by deepdive
   
We have some of these items as templates in a template directory.

In [42]:
!pg_ctl -D /usr/local/var/postgres -l /usr/local/var/postgres/server.log start # deepdive and medic

pg_ctl: another server might be running; trying to start server anyway
server starting


In [1]:
templates='/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/templates_deepdive_app_bagofwords'

app_dir='/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception'
%mkdir {app_dir}

mkdir: /Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception: File exists


In [2]:
%cd {app_dir}

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception


In [43]:
%cp -r {templates}/* {app_dir}/

mkdir: /Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception: File exists
/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception


In [44]:
# modify the postgres db name
# 
!echo postgresql://localhost/8_0_3_quick_auditory_perception > db.url

## Fill input directory based on document abstracts

In [45]:
%cd '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception/input'

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception/input


### Prepping the Auditory perception positive, negative training, and a set of unkowns to test.
training set:
- 146 documents as postive
- 1000 documents as negative

our unknowns are a mix of positive, likely negative, most likely negative.

abstracts from other topics

In [46]:
def get_abstracts(pmid_list_file):
    abstracts=!medic --format tsv write --pmid-list {pmid_list_file} 2>/dev/null
    return([a.split('\t', 2) for a in abstracts])

In [47]:
ap_146 = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/batch_05_AP_pmids')
ap_1000 = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/AP00_1000_ids')
diss = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/diss_1000_ids')
# psyc = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/psyc_1000_ids')
# ar_1000 = get_abstracts('/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/AR_1000_ids')

### annotated sentences
 my_id   sentences   [tf]    \N
 - column 3, true, false, null
 - column 4 null id for deepdive's use

In [48]:
import io
import codecs
from spacy.en import English
nlp = English(parser=True, tagger=True) # so we can sentence parse

In [49]:
def spacy_lemma_gt_len(text, length=2):
    '''Create bag of unique lemmas, requiring lemma length > length
    
    Note: setting length to 1 may mess up our postgres arrays as we would
    get commas here, unless we were to quote everything.
    '''
    tokens = []
    #doc = nlp(text.decode('utf8')) #"This is a sentence. Here's another...".decode('utf8'))
    parsed_data = nlp(text) #"This is a sentence. Here's another...".decode('utf8'))
    for token in parsed_data:
        if len(token.lemma_) > length:
            tokens.append(token.lemma_.lower())
    return(list(set(tokens)))

# def remove_stop_words():
#     pass

# def spacy_lemma_biwords_gt_len(text, length=3):
#     '''Create bag of unique bi-lemmas, requiring lemma length > length
    
#     We are crudely eliminating any bi-lemmas that have commas in them to save us in loading postgres arrays.
#     '''
#     biwords = []
#     parsed_data = nlp(text)
#     skip_chars = [',', '"', "'"]
#     for i in range(1, len(parsed_data) - 1):
#         skip = False
#         biword = u'{} {}'.format(parsed_data[i].lemma_.lower(), parsed_data[i+1].lemma_.lower())
#         if (parsed_data[i].lemma_ in skip_chars or parsed_data[i+1].lemma_ in skip_chars):
#             skip = True
#         if len(biword) > length and not skip:
#             biwords.append(biword)
#     return(list(set(biwords)))

def get_scored_abstract_bow(abstracts, score):
    '''Return annotated bag of words.
     my_id   sentences   [tf]    \N
     - score (postgres boolean) :  t f \N
     - column 3, true, false, null.
     - column 4 null id for deepdive's use.
     - {{}} is to wrap list as postgres array.
    '''
    results = []
    for a in abstracts:
        # bow = spacy_lemma_gt_len(a[2].decode('utf8'), length=2)
        bow = spacy_lemma_gt_len(a[2].decode('utf-8'), length=2)
        # maybe remove stop words
        bow = u', '.join(bow)
        results.append(u'{}\t{{{}}}\t{}\t{}'.format(a[0], bow, score, '\N'))
    return(results)



In [50]:
def write_raw_sentences(fname, annotations, score=None):
    '''
    Annotations (list of lists) : [[id, title, abstract],...]
    score (postgres boolean) :  t f \N'''
    with codecs.open(fname, 'a', encoding = 'utf-8') as f:
        for a in annotations:
            f.write(u'{}\t{}\t{}\N\n'.format(a[0], a[2].decode('utf-8'), score))
                    
def write_annotated_sentences(fname, annotations):
    ''' 
    Annotations (list of strings) : ["id\tbagofwords\tpostgres_boolean\t\N",...]
    '''
    with codecs.open(fname, 'a', encoding = 'utf-8') as f:
        for a in annotations:
            a = a.replace('"', '') # avoid postgres malformed array on unescaped quotes
            f.write(a + '\n')

In [51]:
%cd '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception/input'
%rm ./raw_sentences
write_raw_sentences('raw_sentences', ap_146, 't')
write_raw_sentences('raw_sentences', diss, 'f')
write_raw_sentences('raw_sentences', ap_1000, '\N')

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception/input


In [52]:
ap_146_pos = get_scored_abstract_bow(ap_146, 't')
diss_neg = get_scored_abstract_bow(diss, 'f')
ap_1000_null = get_scored_abstract_bow(ap_1000, '\N')

In [53]:
%rm './annotated_sentences'
write_annotated_sentences('annotated_sentences', ap_146_pos)
write_annotated_sentences('annotated_sentences', diss_neg)
write_annotated_sentences('annotated_sentences', ap_1000_null)

### Add in Null equivalents for the entire training set.
This is so we can see their predictions, whether they are in the holdout fraction or not.
Otherwise we can not see the results of the non-holdout portion.

In [54]:
write_raw_sentences('raw_sentences', ap_146, '\N')
write_raw_sentences('raw_sentences', diss, '\N')

ap_146_null = get_scored_abstract_bow(ap_146, '\N')
diss_null = get_scored_abstract_bow(diss, '\N')
write_annotated_sentences('annotated_sentences', ap_146_null)
write_annotated_sentences('annotated_sentences', diss_null)

## Explanation of how the sentences get into deepdive.
input/init.sh is executed when we run deepdive initdb

## init and run deepdive app
From the top level of this deepdive app.

In [27]:
%cd '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception'

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception


## inspect  results
1. inspect - deepdive's calibration graphs showing accuracy, holdout and holdout + unknowns (nulls)
2. extract and inspect expectation vs test values
3. Recall that only the 'holdout' portion of the training data gets an expectation assigned.

### How to get reports out on all the training, not just the holdout?
Just put all the training back in as nulls.

In [62]:
cmd = ('select has_term,sentence_id,id,category,expectation '
       'from  _annotated_sentences_has_term_inference order by random() limit 10')
!deepdive sql "{cmd}"
# output tsv
# !deepdive sql eval "{cmd}" format=tsv

 has_term | sentence_id |  id  | category | expectation 
----------+-------------+------+----------+-------------
          | 24386403    | 1900 |        1 |       0.998
          | 24617559    | 3032 |        1 |           0
 f        | 24902046    |  963 |        1 |           0
          | 24161466    | 1868 |        1 |           1
          | 23510647    | 1712 |        1 |           0
          | 26527069    | 3279 |        1 |           0
 f        | 24286024    |  836 |        1 |           0
          | 23285949    | 1654 |        1 |           0
          | 23179223    | 2735 |        1 |       0.004
 f        | 23063979    |  570 |        1 |           0
(10 rows)



### Running deepdive another time, get different holdouts.
- The holdout fraction in the deepdive.conf file hasn't changed.
- The holdout fraction seems to simply be a rough guide.
- There is nothing in documentation about specifying the random seed or how the random selection is made.

## Review our expected input numbers
Yes, everything checks out. We have duplicates due to the pseudolabeling of the training test. And a few duplicates due to not having cleaned up our 1000 unknown-to-predict set that might have overlapped with the training set.

1146 training records, 146 true, 1000 false.

In [64]:
!wc input/raw_sentences

    3292  687066 4742740 input/raw_sentences


In [65]:
!cut -f1 input/raw_sentences | sort | uniq | wc

    2134    2134   19206


In [66]:
!cut -f1 input/annotated_sentences | sort | uniq | wc

    2134    2134   19206


In [67]:
!cut -f1 input/raw_sentences | sort | uniq -c | sort | grep -v ' 1 ' | wc

    1146    2292   16044


In [79]:
!cut -f1 input/raw_sentences | sort | uniq -c | sort | grep -v ' 1 '  | sed 's/ *. //' > mult_rec_pmids
!cut -f1 input/raw_sentences | sort | uniq -c | sort | grep -v ' 1 '  | sed 's/ *. //' | wc # number training records

    1146    1146   10314


In [80]:
#!grep -h -f mult_rec_pmids input/raw_sentences input/annotated_sentences | cut -f1,3 | sort | uniq -c | wc

    2292    6876   37818


## pull out clean results sets of the full training and unkown test sets.

In [400]:
fields = 'terms,has_term,sentence_id,expectation,sentence'

!deepdive sql 'DROP TABLE cc_neg_holdout'
!deepdive sql 'DROP TABLE cc_pos_holdout'
!deepdive sql 'DROP TABLE cc_training'

ERROR:  table "cc_neg_holdout" does not exist
DROP TABLE
DROP TABLE


In [396]:
neg_holdout = ('SELECT DISTINCT r.terms,has_term,a.sentence_id,expectation INTO '
               'cc_neg_holdout FROM '
               '_annotated_sentences_has_term_inference as a JOIN '
               '_raw_sentences as r ON '
               'a.sentence_id = r.sentence_id WHERE '
               'NOT a.has_term '
               'ORDER BY a.sentence_id') # if include r.terms would see we pseudonulled all these.
pos_holdout = ('SELECT DISTINCT r.terms,has_term,a.sentence_id,expectation INTO '
               'cc_pos_holdout FROM '
               '_annotated_sentences_has_term_inference as a JOIN '
               '_raw_sentences as r ON '
               'a.sentence_id = r.sentence_id WHERE '
               'a.has_term '
               'ORDER BY a.sentence_id') # if include r.terms would see we pseudonulled all these.

# test = ("SELECT DISTINCT r.terms,a.has_term,a.sentence_id,a.expectation FROM "
#                "_annotated_sentences_has_term_inference as a JOIN "
#                "_raw_sentences as r ON "
#                "a.sentence_id = r.sentence_id JOIN "
#                "cc_neg_holdout as n ON a.sentence_id = n.sentence_id WHERE "
#                "a.has_term IS NULL AND r.terms IS NOT NULL "
#                "ORDER BY a.sentence_id") # 259

pos_neg_input = ("SELECT DISTINCT r.terms,a.has_term,a.sentence_id,a.expectation INTO "
                 "cc_all_input FROM "
               "_annotated_sentences_has_term_inference as a JOIN "
               "_raw_sentences as r ON "
               "a.sentence_id = r.sentence_id LEFT JOIN "
               "cc_neg_holdout as n ON a.sentence_id = n.sentence_id WHERE "
               "a.has_term IS NULL AND r.terms IS NOT NULL "
               "ORDER BY a.sentence_id")

training = ("SELECT DISTINCT a.terms,a.has_term,a.sentence_id,a.expectation INTO "
            "cc_training FROM "
               "cc_all_input as a LEFT JOIN "
               "cc_pos_holdout as p ON "
               "a.sentence_id = p.sentence_id LEFT JOIN "
               "cc_neg_holdout as n ON "
               "a.sentence_id = n.sentence_id WHERE "
               "p.sentence_id IS NULL AND n.sentence_id IS NULL AND a.terms IS NOT NULL")

report = "select * from cc_pos_holdout UNION ALL select cc_pos_holdout" # as p union all select t.terms from cc_training as t"

# report = ("SELECT * "
#                "cc_all_input UNION ALL "
#                "SELECT * FROM cc_pos_holdout UNION ALL "
#                "SELECT * FROM cc_training ")
#                 WHERE "
#                "c.has_term IS NULL AND r.terms IS NOT NULL "
#                "ORDER BY a.sentence_id")

# unk_input = ("SELECT DISTINCT r.terms,has_term,a.sentence_id,category,expectation FROM "
#                "_annotated_sentences_has_term_inference as a JOIN "
#                "_raw_sentences as r ON "
#                "a.sentence_id = r.sentence_id")
#result=!deepdive sql "{neg_holdout}"
#result=!deepdive sql "{pos_holdout}"
#pos_neg_input_results=!deepdive sql "{pos_neg_input}" # should be 1000
###unk_input_results=!deepdive sql "{unk_input}" # should be 1000
#test=!deepdive sql "{test}"
#test = !deepdive sql "{training}"
test = !deepdive sql "{report}"

In [397]:
print(len(test))
test[0:6]


3


['ERROR:  column "cc_pos_holdout" does not exist',
 'LINE 1: select * from cc_pos_holdout UNION ALL select cc_pos_holdout',
 '                                                      ^']

In [40]:
neg_training = 'select has_term,sentence_id,category,expectation from _annotated_sentences_has_term_inference as a  WHERE NOT a.has_term'

cmd = ('select {} FROM '
       '_annotated_sentences_has_term_inference as a JOIN '
       '_raw_sentences as r ON '
       'a.sentence_id = r.sentence_id WHERE'.format(fields))
results=!deepdive sql eval "{cmd}" format=tsv
fields = fields.split(',')

In [None]:
print(fields)
#results[0:10]

## plot our curves by getting predictions from deepdive by sql.
Holdout data has_term is t or f

    Total returned = holdout + null labeled trues + null labeled falses + to be predicteds
    2438 = (259f + 33t) + 146 + 1000 + 1000

As sentences (pubmed ids) are shared between classe:
- remove the holdouts (they are retained in the pseudo-null labeled classes)
- extract correct labels onto the pseudo-null
  - maybe by queries back to _annotated_sentences
- remove any 'to be predicteds' that are also in the pseudo-null
  - because we hadn't cleaned these out prior to building our input files.

In [81]:
cmd = ('SELECT has_term,sentence_id,id,category,expectation '
       'FROM _annotated_sentences_has_term_inference, _annotated_sentences')
deepdive sql 'select has_term,sentence_id,id,category,expectation from  _annotated_sentences_has_term_inference' 
pdrx = !deepdive sql eval "{cmd}" format=tsv

In [82]:
print(len(pdrx))
# Total returned = holdout + null labeled trues + null labeled falses + to be predicteds
# 2438 = (259f + 33t) + 146 + 1000 + 1000

2438


In [5]:
%cd '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception'

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/deepdive_app/8_0_3_quick_auditory_perception


In [6]:
%alias  plot_cal /Users/ccarey/Documents/Projects/NAMI/rdoc/scripts/plot_deepdive_calibration.R

In [19]:
%plot_cal ./run/LATEST/calibration/_annotated_sentences.has_term.tsv custom_stats_plots/test > /dev/null 2>&1

<!---
images are loaded from the root of the notebook rather than the current directory
-->

In [14]:
# side by side
# <tr>
# <td><img src=./tasks/deepdive_app/8_0_3_quick_auditory_perception/custom_stats_plots/test_histogram.png width=200 height=200 /> </td>
# <td><img src=./tasks/deepdive_app/8_0_3_quick_auditory_perception/custom_stats_plots/test_stacked_histogram.png /> </td> 
# </tr>
# or 
# ![my image]./tasks/deepdive_app/8_0_3_quick_auditory_perception/custom_stats_plots/test_stacked_histogram.png
# or (also works for pdf) 
from IPython.display import HTML 
HTML('<iframe src=./tasks/deepdive_app/8_0_3_quick_auditory_perception/custom_stats_plots/test_histogram.png width=350 height=350></iframe>')

# Appendix 1. In R, plot our own curves from deepdive's calibration

DeepDive produces some diagnostics.

- *calibration/....png*. But can't distinguish the holdouts from the predictions on the unkowns in DeepDive's png.

- *calibration/....tsv*. See deepdive documentation.

tsv columns:

    [bucket_from] [bucket_to] [num_predictions] [num_true] [num_false]

- buckets are the min and max extent of the probability bins. (1.00 = 100% probability document is on topic).
- Columns 3 is predicted from unknowns + holdouts
- Columns 4 and 5 are predicted only from the holdouts.