# Select and distribute additional Agency articles for phrase annotation

## Goal is another topic.
- A good number of annotations on this topic.
- Phrase based annotation so we can be most flexible and assess at higher levels.

Don't think we need replication across annotators. Just go for depth. We do have annotator cross validation pretty well covered for sleep / arousal / auditory perception terms.

In [1]:
dest_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project'
%mkdir {dest_dir}
%cd {dest_dir}

mkdir: /Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project: File exists
/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project


### Get a list of all our previous pubmed ids used for agency.
We'll only omit for annotation those that have been previously annotated for agency.

In [2]:
from __future__ import print_function
import glob
import os
import re

# annotated and processed previously(before our most recent 2 batches)
prev_annotated_pdfs_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/all_pdfs_annotated_pmid_names/[0-9][0-9]_AG*.pdf'
pdfs = glob.glob(prev_annotated_pdfs_dir)
pdfs = [os.path.basename(pdf) for pdf in pdfs]
print('{} total PDFS'.format(len(pdfs)))
pdfs

6 total PDFS


['02_AG00_01_NA1_jl_IRRELEVANT.pdf',
 '02_AG00_01_NA1_mk_IRRELEVANT.pdf',
 '02_AG00_05_NA5_cc_IRRELEVANT.pdf',
 '02_AG05_01_23928891_mk_ANNOTATED.pdf',
 '02_AG05_02_23744445_jl_ANNOTATED.pdf',
 '02_AG05_02_23744445_mk_ANNOTATED.pdf']

In [3]:
pattern = '([0-9]{6,9})'
p = re.compile(pattern)
prev_pmids = [p.search(pdf).group() for pdf in pdfs if p.search(pdf)]
print('{} Total exisiting PMIDS'.format(len(prev_pmids)))
prev_pmids = set(prev_pmids)
print('{} Unique exisiting PMIDS'.format(len(prev_pmids)))
#print(prev_pmids)

3 Total exisiting PMIDS
2 Unique exisiting PMIDS


['23928891', '23744445']

## Get new lists of pmids specific to topic

In [4]:
#from __future__ import print_function
from Bio import Entrez
from subprocess import check_call
from shutil import copy2
#import glob
import time
import imp
import os
url2p = imp.load_source('Url2PubmedPmcPdf', '/Users/ccarey/Documents/Projects/NAMI/rdoc/scripts/Url2PubmedPmcPdf.py')
Entrez.email = "charlieccarey@gmail.com"

def narrow_id_list(found_ids, omit_ids):
    found_but_omit = list(set(found_ids) & set(omit_ids))
    found_and_keep = list(set(found_ids) - set(omit_ids))
    print('Removed this many ids: {}'.format(len(found_but_omit)))
    return(found_and_keep)

def pubmed_central_search_to_pubmed_id(search_string, retmax=20):
    # verify how many records match
    handle = Entrez.egquery(term=search_string)
    record = Entrez.read(handle)
    # maybe useful if we are dealing with 100s of ids and don't want to overwhelm server?
    for row in record["eGQueryResult"]:
        if row["DbName"] == "pubmed":
            print(row["Count"])
    # fetch the ids for those records
    handle = Entrez.esearch(db="pubmed", retmax=retmax, term=search_string)
    record = Entrez.read(handle)
    pubmed_ids = record["IdList"]
    return(pubmed_ids)

def fetch_pdfs(pubmed_ids, stub_name):
    u = url2p.Url2PubmedPmcPdf(pubmed_ids)
    urls = u.get_urls()
    found = []
    for url in urls:
        if url['url'] is not None:
            cmd = 'curl -L {} -o {}.pdf'.format(url['url'], stub_name + url['pubmed'])
            # print(cmd)
            check_call(cmd, shell = True)
            time.sleep(10)
            found.append(url['pubmed'])
    return(found)

# def copy_pdf_append_initials(initials):
#     pdfs = glob.glob('*.pdf')
#     for i in initials:
#         os.mkdir(i)
#         for p in pdfs:
#             pi = p.replace('.pdf', '_' + i + '.pdf')
#             copy2(p, os.path.join(i, pi))

def search_and_summarize(search_name, query, omit_ids):
    ids = pubmed_central_search_to_pubmed_id(query, retmax=1000000)
    new_ids = narrow_id_list(ids, omit_ids)
    print('{} search of pubmed found {} ids of which {} are new'.format(search_name, len(ids), len(new_ids)))
    return(new_ids)

### Desire to send out more articles. up to 60 per person.
180 - 240 total articles desired.
#### define new search terms as old ones from Aurelien's helper were mostly not good.
control helplessness valence depression schizophrenia illusions volition
moral framing experience control choice sense of control

Paradigms:
- Ford Corollary Discharge Paradigm 
- Identification of oneâ€™s own biological motion 
- illusions of will 
- Joy Stick manipulation (decoupling motor and sensory feedback)
- Reality Monitoring 


common to all, use as suffix

- ' AND ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "loattrfree full text"[sb] AND ("2011/01/31"[PDAT] : "2016/01/29"[PDAT])'

pubmed => "illusion of will" (phrase not found)

returns:
- ("illusions"[MeSH Terms] OR "illusions"[All Fields] OR "illusion"[All Fields]) AND ("volition"[MeSH Terms] OR "volition"[All Fields] OR "will"[All Fields]))

reality monitoring

- "Reality Monitoring"[All Fields]

corollary discharge paradigm

- "corollary discharge"[All Fields] AND ("hallucinations"[MeSH Terms] OR "hallucinations"[All Fields] OR "hallucination"[All Fields])

- "corollary discharge"[All Fields] AND ("schizophrenia"[MeSH Terms] OR "schizophrenia"[All Fields])

- ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND (Corollary[All Fields] AND "discharge"[All Fields]) OR ("corollary discharge"[All Fields]) AND Paradigm[All Fields])) AND "loattrfree full text"[sb]

more genearl using volition

- AND "agency"[All Fields] AND (("volition"[MeSH Terms] OR "volition"[All Fields]) AND ("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]))

Trying to get more general agency
- "perception"[MeSH Terms] AND agency[All Fields]

In [None]:
# These were from preliminary work by Aurelien etc but mostly no good.
# AG00 = '"agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG01 = '(("scalp"[MeSH Terms] OR "scalp"[All Fields]) AND motor[All Fields] AND potentials[All Fields]) AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG02 = '(perpetual[All Fields] AND aberration[All Fields] AND ("weights and measures"[MeSH Terms] OR ("weights"[All Fields] AND "measures"[All Fields]) OR "weights and measures"[All Fields] OR "scale"[All Fields])) AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG03 = '(("identification (psychology)"[MeSH Terms] OR ("identification"[All Fields] AND "(psychology)"[All Fields]) OR "identification (psychology)"[All Fields] OR "identification"[All Fields]) AND one's[All Fields] AND own[All Fields] AND ("biology"[MeSH Terms] OR "biology"[All Fields] OR "biological"[All Fields]) AND ("motion"[MeSH Terms] OR "motion"[All Fields])) AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG04 = '(Joy[All Fields] AND Stick[All Fields] AND manipulation[All Fields]) AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG05 = '(("illusions"[MeSH Terms] OR "illusions"[All Fields]) AND ("volition"[MeSH Terms] OR "volition"[All Fields] OR "will"[All Fields])) AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG06 = '(Ford[All Fields] AND Corollary[All Fields] AND ("patient discharge"[MeSH Terms] OR ("patient"[All Fields] AND "discharge"[All Fields]) OR "patient discharge"[All Fields] OR "discharge"[All Fields]) AND Paradigm[All Fields]) AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'
# AG07 = '"reality monitoring"[All Fields] AND "agency"[All Fields] AND ("loattrfree full text"[sb] AND "2010/06/24"[PDat] : "2015/06/22"[PDat] AND "humans"[MeSH Terms])'

In [5]:
search_filter = '("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "loattrfree full text"[sb] AND ("2011/01/31"[PDAT] : "2016/01/29"[PDAT]) AND '

AG08 = search_filter + '("illusions"[MeSH Terms] OR "illusions"[All Fields] OR "illusion"[All Fields]) AND ("volition"[MeSH Terms] OR "volition"[All Fields] OR "will"[All Fields])'
AG09 = search_filter + '"Reality Monitoring"[All Fields]'
AG10 = search_filter + '"corollary discharge"[All Fields] AND ("hallucinations"[MeSH Terms] OR "hallucinations"[All Fields] OR "hallucination"[All Fields])'
AG11 = search_filter + '"corollary discharge"[All Fields] AND ("schizophrenia"[MeSH Terms] OR "schizophrenia"[All Fields])'
AG12 = search_filter + '(Corollary[All Fields] AND "discharge"[All Fields]) OR ("corollary discharge"[All Fields] AND Paradigm[All Fields])'
AG13 = search_filter + '"agency"[All Fields] AND (("volition"[MeSH Terms] OR "volition"[All Fields]) AND ("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]))'
AG14 = search_filter + '"perception"[MeSH Terms] AND agency[All Fields]'

In [6]:
AG08_ids = search_and_summarize(search_name='AG08', query=AG08, omit_ids=prev_pmids)
AG09_ids = search_and_summarize(search_name='AG09', query=AG09, omit_ids=prev_pmids)
AG10_ids = search_and_summarize(search_name='AG10', query=AG10, omit_ids=prev_pmids)
AG11_ids = search_and_summarize(search_name='AG11', query=AG11, omit_ids=prev_pmids)
AG12_ids = search_and_summarize(search_name='AG12', query=AG12, omit_ids=prev_pmids)
AG13_ids = search_and_summarize(search_name='AG13', query=AG13, omit_ids=prev_pmids)
AG14_ids = search_and_summarize(search_name='AG14', query=AG14, omit_ids=prev_pmids)

27
Removed this many ids: 2
AG08 search of pubmed found 27 ids of which 25 are new
8
Removed this many ids: 0
AG09 search of pubmed found 8 ids of which 8 are new
3
Removed this many ids: 0
AG10 search of pubmed found 3 ids of which 3 are new
13
Removed this many ids: 0
AG11 search of pubmed found 13 ids of which 13 are new
44
Removed this many ids: 0
AG12 search of pubmed found 44 ids of which 44 are new
4
Removed this many ids: 0
AG13 search of pubmed found 4 ids of which 4 are new
86
Removed this many ids: 2
AG14 search of pubmed found 86 ids of which 84 are new


Found 180 new ids, 163 uniquely.

In [7]:
import collections

ids = [AG08_ids, AG09_ids, AG10_ids, AG11_ids, AG12_ids, AG13_ids, AG14_ids]
ids = [pmid for pmids in ids for pmid in pmids]
print('total new pmids found across all searches : {}'.format(len(ids)))

dups = [item for item, count in collections.Counter(ids).items() if count > 1]
print('duplicate pmids found across all searches : {}'.format(dups))

ids = set(ids)
print('unique new pmids found across all searches : {}'.format(len(ids)))

print('Ids that are not 8 digits long : {}'.format([id for id in ids if len(id) < 8 or len(id) > 8]))

total new pmids found across all searches : 181
duplicate pmids found across all searches : ['26156994', '22970130', '23155183', '21890745', '21959054', '21543355', '24786597', '24695696', '22242165', '22079494', '23754836', '21993915', '21972276', '20418446']
unique new pmids found across all searches : 164
Ids that are not 8 digits long : ['3783224', '9167517']


### Found plenty new articles

- total new pmids found across all but most general searches : 
  - 180
- total unique pmids across :
  - 163
- total retrieved as pdfs :
  - 134
- total unique pdfs by pmid retrieved (removing alphanumerically last ones) :
  - 117 (118 but 1 not downloadable as pdf)

### Attempt to fetch the pdfs for each search term.

In [8]:
%pwd

u'/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project'

In [None]:
batch = '06'
fetch_pdfs(AG08_ids, batch + '_AG08_')
fetch_pdfs(AG09_ids, batch + '_AG09_')
fetch_pdfs(AG10_ids, batch + '_AG10_')
fetch_pdfs(AG11_ids, batch + '_AG11_')
fetch_pdfs(AG12_ids, batch + '_AG12_')
fetch_pdfs(AG13_ids, batch + '_AG13_')
fetch_pdfs(AG14_ids, batch + '_AG14_')


I then listed the possible duplicate IDs and removed the later occuring duplicates. i.e. if ID occurred in AP01 and AP04 I only kept the AP01 version.

In [None]:
all_found = glob.glob( './*.pdf')
print('total pdfs downloaded : {}'.format(len(all_found)))
p = re.compile('[0-9]{7,8}')
pmids_found = [p.search(pdf).group() for pdf in all_found if p.search(pdf)]
print('total unique pdfs downloaded by pmid : {}'.format(len(set(pmids_found))))
dups_found = [(item, count) for item, count in collections.Counter(pmids_found).items() if count > 1]
print('Duplicate pmids downloaded : {}'.format(dups_found))
p = re.compile('.*([0-9]{7,8}).*')
dups = []
to_remove = []
for pmid_count in dups_found:
    files = glob.glob('*' + pmid_count[0] + '.*pdf')
    dups.append(files)
    to_remove.append(files[1:])
print('duplicate pmids : ')
for d in dups:
    print(d)
print('removing alphanumerically last of duplicate which are : ')
for r in to_remove:
    r_str = ' '.join(r)
    #print(r_str)
    %rm {r_str}

In [None]:
# verify count after removal
all_found = glob.glob( './*.pdf')
p = re.compile('[0-9]{7,8}')
pmids_found = [p.search(pdf).group() for pdf in all_found if p.search(pdf)]
print('total unique pdfs downloaded by pmid : {}'.format(len(set(pmids_found))))
dups_found = [(item, count) for item, count in collections.Counter(pmids_found).items() if count > 1]
print('Duplicate pmids downloaded : {}'.format(dups_found))

## Manually scan all pdfs for broken pdfs
Seems to happen with some journals for which you have to go to journal website to obtain the full pdf.

These were broken and manually refetched:
- 06_AG09_24125858.pdf
- 06_AG10_21972276.pdf
- 06_AG12_19776364.pdf
- 06_AG12_23986562.pdf
- 06_AG13_21920777.pdf
- 06_AG14_22178504.pdf
- 06_AG14_24333539.pdf
- 06_AG14_24555983.pdf
- ( Only paywall found, and irrelevant so discarding:  
  06_AG14_26356380.pdf )

In [9]:
# Confirm remaining
len(glob.glob( './*.pdf'))

206

In [None]:
%rm 06_AG14_26356380.pdf

In [None]:
# 30 identical for each of 3 annotators
# 30, 30, 29 unique to each of 3 annotators.
import random
random.seed(0)
files = glob.glob('06_AG*.pdf')
idxs = range(len(files))
random.shuffle(idxs)

everyone = [files[idx] for idx in idxs[0:30]]
mk = [files[idx] for idx in idxs[30:60]]
tc = [files[idx] for idx in idxs[60:89]]
cc = [files[idx] for idx in idxs[89:117]]

In [None]:
def copy_to_annotator(files, annotator, suffix):
    %mkdir {annotator}
    %mkdir {annotator}/annotated
    %mkdir {annotator}/irrelevant
    for f in files:
        f_dest = f.split('.')[0] + suffix
        %cp {f} ./{annotator}/{f_dest}

copy_to_annotator(everyone, 'mk_', '_mk.pdf')
copy_to_annotator(mk, 'mk_', '_mk.pdf')

copy_to_annotator(everyone, 'tc_', '_tc.pdf')
copy_to_annotator(tc, 'tc_', '_tc.pdf')

copy_to_annotator(everyone, 'cc_', '_cc.pdf')
copy_to_annotator(cc, 'cc_', '_cc.pdf')

### Also add to Tara's her previously broken pdfs
- 05_AP04_22786953_tc.pdf
- 05_AP05_22592306_tc.pdf
- 05_AP07_25231612_tc.pdf
- 05_AP09_21945789_tc.pdf

## TODO: Add the new abstracts to our medic database
Insert fails if any are already in db.
Update will overwrite previous records.

In [None]:
print(len(files))
#print(files)
print(len(ids))
print(len(set(ids)))
# print(len(prev_pmids)) # 2
print('----')
prev_pmids = list(prev_pmids)
prev_pmids.extend(ids)
print(len(set(prev_pmids)))

## add a few additional agency searches proposed by MK
1. "Sense of Agency"

2. "Illusion of Ownership"

3. "Feeling of Myness"

4. "Feeling of Agency"

Do not include if already attempted. (Note modified prev_pmids to include the AG08-AG14 searches.)

Matt later identified additional terms we are not extracting to pdfs here:

- asomatognosia
- proprioceptive drift

In [10]:
AG15 = search_filter + '"sense of agency"[All Fields]'

AG16 = search_filter + '"illusion of ownership"[All Fields]'
AG17 = search_filter + '"illusion"[All Fields] AND "ownership"[All Fields]"'

AG18 = search_filter + '"feeling of myness"[All Fields]'
AG19 = search_filter + '"feeling"[All Fields] AND "myness"[All Fields]'

AG20 = search_filter + '"feeling of agency"[All Fields]'
AG21 = search_filter + '"feeling"[All Fields] AND "agency"[All Fields]'

AG22 = search_filter + '"asomatognosia"[All Fields]'
AG23 = search_filter + '"proprioceptive drift"[All Fields]'

In [None]:
new_ids = {}
all_new_ids = []

AG15_ids = search_and_summarize(search_name='AG15', query=AG15, omit_ids=prev_pmids)
new_ids['AG15_ids'] = AG15_ids
all_new_ids.extend(set(AG15_ids))

AG16_ids = search_and_summarize(search_name='AG16', query=AG16, omit_ids=prev_pmids)
new_ids['AG16_ids'] = narrow_id_list(AG16_ids, all_new_ids)
all_new_ids.extend(set(AG16_ids))

AG17_ids = search_and_summarize(search_name='AG17', query=AG17, omit_ids=prev_pmids)
new_ids['AG17_ids'] = narrow_id_list(AG17_ids, all_new_ids)
all_new_ids.extend(set(AG17_ids))

AG18_ids = search_and_summarize(search_name='AG18', query=AG18, omit_ids=prev_pmids)
new_ids['AG18_ids'] = narrow_id_list(AG18_ids, all_new_ids)
all_new_ids.extend(set(AG18_ids))

AG19_ids = search_and_summarize(search_name='AG19', query=AG19, omit_ids=prev_pmids)
new_ids['AG19_ids'] = narrow_id_list(AG19_ids, all_new_ids)
all_new_ids.extend(set(AG19_ids))

AG20_ids = search_and_summarize(search_name='AG20', query=AG20, omit_ids=prev_pmids)
new_ids['AG20_ids'] = narrow_id_list(AG20_ids, all_new_ids)
all_new_ids.extend(set(AG20_ids))

AG21_ids = search_and_summarize(search_name='AG21', query=AG21, omit_ids=prev_pmids)
new_ids['AG21_ids'] = narrow_id_list(AG21_ids, all_new_ids)
all_new_ids.extend(set(AG21_ids))

# after we were done, matt suggested following 2 searches.

other_new_ids = {}
other_all_new_ids = list(all_new_ids)

AG22_ids = search_and_summarize(search_name='AG22', query=AG22, omit_ids=prev_pmids)
other_new_ids['AG22_ids'] = narrow_id_list(AG22_ids, all_new_ids)
other_all_new_ids.extend(set(AG22_ids))

AG23_ids = search_and_summarize(search_name='AG23', query=AG23, omit_ids=prev_pmids)
other_new_ids['AG23_ids'] = narrow_id_list(AG23_ids, all_new_ids)
other_all_new_ids.extend(set(AG23_ids))

In [None]:
temp = [AG15_ids, AG16_ids, AG17_ids, AG18_ids, AG19_ids, AG20_ids, AG21_ids]
temp = [pmid for pmids in temp for pmid in pmids]
print('total new pmids found across all searches : {}'.format(len(temp)))
print('total new pmids found across all_new_ids searches : {}'.format(len(all_new_ids)))
print('unique new pmids found across all searches : {}'.format(len(set(all_new_ids))))
dups = [item for item, count in collections.Counter(all_new_ids).items() if count > 1]
print('duplicates across sets : {}'.format(dups))
print([(k, len(v)) for k,v in sorted(new_ids.items())])

In [253]:
%pwd

u'/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project'

In [254]:
fetch_pdfs(new_ids['AG15_ids'], batch + '_AG15_')
fetch_pdfs(new_ids['AG17_ids'], batch + '_AG17_')
fetch_pdfs(new_ids['AG21_ids'], batch + '_AG21_')

['26180556',
 '23619067',
 '21140173',
 '25971400',
 '23256901',
 '22737114',
 '21371392']

In [None]:
newest_found_pdfs = {}
newest_found_pdfs['AG15'] = glob.glob('06_AG15_*')
newest_found_pdfs['AG17'] = glob.glob('06_AG17_*')
newest_found_pdfs['AG21'] = glob.glob('06_AG21_*')
print([(k,len(v)) for k, v in newest_found_pdfs.items()])
print(newest_found_pdfs)

[('AG17', 46), ('AG15', 36), ('AG21', 7)]
{'AG17': ['06_AG17_21208964.pdf', '06_AG17_21451023.pdf', '06_AG17_21521765.pdf', '06_AG17_21633503.pdf', '06_AG17_21687453.pdf', '06_AG17_21738756.pdf', '06_AG17_22073126.pdf', '06_AG17_22399451.pdf', '06_AG17_22658684.pdf', '06_AG17_22829891.pdf', '06_AG17_23123685.pdf', '06_AG17_23144814.pdf', '06_AG17_23209824.pdf', '06_AG17_23226375.pdf', '06_AG17_23285026.pdf', '06_AG17_23300992.pdf', '06_AG17_23416066.pdf', '06_AG17_23680793.pdf', '06_AG17_23682688.pdf', '06_AG17_23690859.pdf', '06_AG17_23720537.pdf', '06_AG17_23804622.pdf', '06_AG17_23858436.pdf', '06_AG17_23964067.pdf', '06_AG17_23980141.pdf', '06_AG17_24060991.pdf', '06_AG17_24073701.pdf', '06_AG17_24260454.pdf', '06_AG17_24268410.pdf', '06_AG17_24385970.pdf', '06_AG17_24465671.pdf', '06_AG17_24465698.pdf', '06_AG17_24498012.pdf', '06_AG17_24671172.pdf', '06_AG17_24699795.pdf', '06_AG17_24782721.pdf', '06_AG17_24806404.pdf', '06_AG17_24959128.pdf', '06_AG17_25210738.pdf', '06_AG17_25285620.pdf', '06_AG17_25295527.pdf', '06_AG17_25338780.pdf', '06_AG17_25583608.pdf', '06_AG17_25658822.pdf', '06_AG17_25775041.pdf', '06_AG17_25906330.pdf'], 'AG15': ['06_AG15_21295497.pdf', '06_AG15_21302161.pdf', '06_AG15_22129483.pdf', '06_AG15_22194878.pdf', '06_AG15_22326304.pdf', '06_AG15_22375891.pdf', '06_AG15_22383963.pdf', '06_AG15_22451482.pdf', '06_AG15_22529796.pdf', '06_AG15_22813429.pdf', '06_AG15_22871335.pdf', '06_AG15_23086590.pdf', '06_AG15_23143153.pdf', '06_AG15_23227017.pdf', '06_AG15_23285293.pdf', '06_AG15_23372562.pdf', '06_AG15_23445715.pdf', '06_AG15_23494975.pdf', '06_AG15_23823467.pdf', '06_AG15_23977268.pdf', '06_AG15_24009575.pdf', '06_AG15_24019932.pdf', '06_AG15_24093017.pdf', '06_AG15_24367303.pdf', '06_AG15_24443662.pdf', '06_AG15_24860486.pdf', '06_AG15_24987350.pdf', '06_AG15_25007276.pdf', '06_AG15_25191256.pdf', '06_AG15_25228869.pdf', '06_AG15_25295000.pdf', '06_AG15_25339886.pdf', '06_AG15_25473014.pdf', '06_AG15_25518726.pdf', '06_AG15_25904779.pdf', '06_AG15_26270552.pdf'], 'AG21': ['06_AG21_21140173.pdf', '06_AG21_21371392.pdf', '06_AG21_22737114.pdf', '06_AG21_23256901.pdf', '06_AG21_23619067.pdf', '06_AG21_25971400.pdf', '06_AG21_26180556.pdf']}

### 1/2 of new pdfs to MK
89 new pdfs, 1/2 of which is 45

These were broken, downloaded manually

    06_AG15_21295497
    06_AG15_22813429
    06_AG15_23143153
    06_AG17_24268410
    06_AG17_25583608

In [269]:
random.seed(0)
files = [pdf for pdfs in newest_found_pdfs.values() for pdf in pdfs]
print(len(files))

idxs = range(len(files))
random.shuffle(idxs)
mk = [files[idx] for idx in idxs[0:45]]
tc_maybe = [files[idx] for idx in idxs[45:89]]

89


In [270]:
copy_to_annotator(mk, 'mk_', '_mk.pdf')
# copy_to_annotator(tc_maybe, 'tc_', '_tc.pdf')

In [None]:
print(mk)
print(tc_maybe)

['06_AG15_22129483.pdf', '06_AG17_21687453.pdf', '06_AG17_23690859.pdf', '06_AG21_23256901.pdf', '06_AG15_24367303.pdf', '06_AG21_23619067.pdf', '06_AG15_24019932.pdf', '06_AG15_26270552.pdf', '06_AG15_22375891.pdf', '06_AG15_22529796.pdf', '06_AG17_24782721.pdf', '06_AG17_24060991.pdf', '06_AG15_25191256.pdf', '06_AG17_21451023.pdf', '06_AG17_22399451.pdf', '06_AG17_24385970.pdf', '06_AG15_23143153.pdf', '06_AG15_25007276.pdf', '06_AG17_24498012.pdf', '06_AG17_21738756.pdf', '06_AG17_23209824.pdf', '06_AG15_23227017.pdf', '06_AG17_23226375.pdf', '06_AG17_23123685.pdf', '06_AG17_25775041.pdf', '06_AG15_23445715.pdf', '06_AG17_24073701.pdf', '06_AG17_23300992.pdf', '06_AG17_25338780.pdf', '06_AG17_23416066.pdf', '06_AG15_24093017.pdf', '06_AG21_21371392.pdf', '06_AG17_21633503.pdf', '06_AG15_25339886.pdf', '06_AG15_23372562.pdf', '06_AG15_24987350.pdf', '06_AG17_21521765.pdf', '06_AG17_23680793.pdf', '06_AG17_24465698.pdf', '06_AG21_22737114.pdf', '06_AG17_22829891.pdf', '06_AG15_21302161.pdf', '06_AG17_22658684.pdf', '06_AG15_22383963.pdf', '06_AG15_22871335.pdf']
['06_AG17_23144814.pdf', '06_AG17_25295527.pdf', '06_AG17_23964067.pdf', '06_AG15_22451482.pdf', '06_AG17_24671172.pdf', '06_AG17_25583608.pdf', '06_AG17_23720537.pdf', '06_AG15_25904779.pdf', '06_AG17_21208964.pdf', '06_AG15_23823467.pdf', '06_AG21_21140173.pdf', '06_AG17_23285026.pdf', '06_AG15_22326304.pdf', '06_AG17_24268410.pdf', '06_AG15_25473014.pdf', '06_AG15_22813429.pdf', '06_AG17_24959128.pdf', '06_AG17_24260454.pdf', '06_AG17_22073126.pdf', '06_AG17_24465671.pdf', '06_AG17_25906330.pdf', '06_AG15_23285293.pdf', '06_AG15_22194878.pdf', '06_AG15_25295000.pdf', '06_AG15_23494975.pdf', '06_AG21_26180556.pdf', '06_AG15_24443662.pdf', '06_AG21_25971400.pdf', '06_AG17_23682688.pdf', '06_AG15_25518726.pdf', '06_AG15_23086590.pdf', '06_AG17_23804622.pdf', '06_AG17_25285620.pdf', '06_AG15_24860486.pdf', '06_AG15_21295497.pdf', '06_AG17_25210738.pdf', '06_AG17_23980141.pdf', '06_AG15_23977268.pdf', '06_AG17_24699795.pdf', '06_AG17_25658822.pdf', '06_AG17_23858436.pdf', '06_AG17_24806404.pdf', '06_AG15_24009575.pdf', '06_AG15_25228869.pdf']

## Add ids to medic db for which we have corresponding pdfs
- 206 pmids written to batch_06_AG_pmids and medic db.

In [43]:
%cd /Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project
all_found = glob.glob('06_AG*.pdf')
all_found.extend(mk)
all_found.extend(tc_maybe)

/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/20160212_rdoc_project


In [44]:
pattern = '([0-9]{6,9})'
s = re.compile(pattern).search
medic_ids = set([pmid.group(1) for pmid in map(s, all_found) if pmid])

In [47]:
fname = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/batch_06_AG_pmids'
print('Writing {} putative Agency ids to {}.'.format(len(medic_ids), fname))

with open(fname, 'w') as f:
    f.write('\n'.join(medic_ids))
    f.write('\n')

Writing 206 putative Agency ids to /Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp/batch_06_AG_pmids.


In [48]:
!medic update --pmid-list {fname} 2> /dev/null
!medic --format tsv write --pmid-list {fname} 2> /dev/null | wc -l

     206
