# Goal: Retrieve close and distant articles vs RDoC terms.

We want articles that are distant to, and increasingly close to the RDoC terms we are using.

These will serve as 'negative' article examples for a negative training set for deepdive.

The negative articles will help establish:
- DeepDive **response to distance** of negative training set vs. term of interest articles.
- DeepDive **response to increasing depth** of negative training set.

## Some notes on using MESH in searches
(See appendix for additional resources)

Using the MESH terms only is too harsh strict for major category terms.

Following search generated by entering pubmed search: human AND 

(psychology or psychiatry):

- (("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]) OR ("psychiatry"[MeSH Terms] OR "psychiatry"[All Fields])) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])
- 318188

human AND (psychology or psychiatry):

- (("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND (("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]) OR ("psychiatry"[MeSH Terms] OR "psychiatry"[All Fields]))) AND (hasabstract[text] AND "2011/01/31"[PDAT] : "2016/01/29"[PDAT]) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])
- 249889

human AND disease:
- (("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND ("disease"[MeSH Terms] OR "disease"[All Fields])) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])
- 579517

In [1]:
dest_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp'
%mkdir {dest_dir}
%cd {dest_dir}

mkdir: /Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp: File exists
/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp


In [5]:
# Previous pmids already retrieved.
# #from __future__ import print_function
# import glob
# import os
# import re
# prev_annotated_pdfs_dir = '/Users/ccarey/Documents/Projects/NAMI/rdoc/pdfs/all_pdfs_annotated_pmid_names/*.pdf'
# pdfs = glob.glob(prev_annotated_pdfs_dir)
# pdfs = [os.path.basename(pdf) for pdf in pdfs]

In [2]:
from __future__ import print_function
from Bio import Entrez
from subprocess import check_call
from shutil import copy2
Entrez.email = "charlieccarey@gmail.com"

def narrow_id_list(found_ids, omit_ids):
    found_but_omit = list(set(found_ids) & set(omit_ids))
    found_and_keep = list(set(found_ids) - set(omit_ids))
    print('Removed this many ids: {}'.format(len(found_but_omit)))
    return(found_and_keep)

def pubmed_central_search_to_pubmed_id(search_string, retmax=20):
    # verify how many records match
    handle = Entrez.egquery(term=search_string)
    record = Entrez.read(handle)
    # maybe useful if we are dealing with 100s of ids and don't want to overwhelm server?
    for row in record["eGQueryResult"]:
        if row["DbName"] == "pubmed":
            print(row["Count"])
    # fetch the ids for those records
    handle = Entrez.esearch(db="pubmed", retmax=retmax, term=search_string)
    record = Entrez.read(handle)
    pubmed_ids = record["IdList"]
    return(pubmed_ids)

def search_and_summarize(search_name, query, omit_ids=None):
    ids = pubmed_central_search_to_pubmed_id(query, retmax=1000000)
    if omit_ids is not None:
        new_ids = narrow_id_list(ids, omit_ids)
    else:
        new_ids = ids
    print('{} search of pubmed found {} ids of which {} are new'.format(search_name, len(ids), len(new_ids)))
    return(new_ids)

In [4]:
AR00 = '("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND ("arousal"[All Fields] OR "arousal"[MeSH Terms]) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])'
AP00 = '("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND ("auditory perception"[All Fields] OR "auditory perception"[MeSH Terms]) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])'
psych = '(("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND (("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]) OR ("psychiatry"[MeSH Terms] OR "psychiatry"[All Fields]))) AND (hasabstract[text] AND "2011/01/31"[PDAT] : "2016/01/29"[PDAT]) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])'
disease = '(("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND ("disease"[MeSH Terms] OR "disease"[All Fields])) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])'

In [30]:
AR00_ids = search_and_summarize(search_name='AR00', query=AR00)
omit_ids = None
print('Arousal search omit n ids: 0')
print('Arousal ids: {}.'.format(len(AR00_ids)))

print()
omit_ids = list(AR00_ids)
print('psyc search omit n ids: {}.'.format(len(omit_ids)))
psyc_ids = search_and_summarize(search_name='psych', query=psych, omit_ids=omit_ids)
print('psyc specific ids: {}.'.format(len(psyc_ids)))

print()
omit_ids.extend(psyc_ids)
print('disease search omit n ids: {}.'.format(len(omit_ids)))
diss_ids = search_and_summarize(search_name='disease', query=disease, omit_ids=omit_ids)
print('disease specific ids: {}.'.format(len(diss_ids)))

19033
AR00 search of pubmed found 19033 ids of which 19033 are new
Arousal search omit n ids: 0
Arousal ids: 19033.

psyc search omit n ids: 19033.
249889
Removed this many ids: 12206
psych search of pubmed found 249889 ids of which 237683 are new
psyc specific ids: 237683.

disease search omit n ids: 256716.
579517
Removed this many ids: 38950
disease search of pubmed found 579517 ids of which 540567 are new
disease specific ids: 540567.


In [5]:
AP00_ids = search_and_summarize(search_name='auditory_perception', query=AP00)
print('auditory perception ids: {}.'.format(len(AP00_ids)))

10385
auditory_perception search of pubmed found 10385 ids of which 10385 are new
auditory perception ids: 10385.


## Save random samples of pmids to lists and corresponding abstracts to medic database

In [8]:
import random

In [None]:
random.seed(1000)
AR00_1000 = random.sample(AR00_ids, 1000)
AP00_1000 = random.sample(AP00_ids, 1000)
psyc_1000 = random.sample(psyc_ids, 1000)
diss_1000 = random.sample(diss_ids, 1000)

In [None]:
with open('./AR00_1000_ids', 'wb') as f:
    for item in AR00_1000:
        f.write(item + '\n')

with open('./AP00_1000_ids', 'wb') as f:
    for item in AP00_1000:
        f.write(item + '\n')

with open('./psyc_1000_ids', 'wb') as f:
    for item in psyc_1000:
        f.write(item + '\n')

with open('./diss_1000_ids', 'wb') as f:
    for item in diss_1000:
        f.write(item + '\n')
        

### Later, we desired additional batches of a couple terms
#### More Arousal

In [None]:
random.seed(2000)
AR00_1200_batch2 = random.sample(AR00_ids, 1200) # we'll trim this to unique new Arousal later

In [None]:
with open('./AR00_1200_batch2_ids', 'wb') as f:
    for item in AR00_1200_batch2:
        f.write(item + '\n')

#### More Auditory Perception

In [20]:
random.seed(2000)
AP00_1400_batch2 = random.sample(AP00_ids, 1400) # we'll trim this to unique new Arousal later

In [21]:
with open('./AP00_1400_batch2_ids', 'wb') as f:
    for item in AP00_1400_batch2:
        f.write(item + '\n')

## Pull asbtracts into a database using Python medic package

In [13]:
%cd tasks/task_data_temp/
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l
!medic update --pmid-lists ./AP00_1000_ids 2> /dev/null

/Users/ccarey/Documents/Projects/NAMI/rdoc/tasks/task_data_temp
    3359


In [14]:
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l

    4337


In [None]:
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l
!medic update --pmid-lists ./AR00_1000_ids 2> /dev/null
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l
!medic update --pmid-lists ./AP00_1000_ids 2> /dev/null
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l
!medic update --pmid-lists ./psyc_1000_ids 2> /dev/null
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l
!medic update --pmid-lists ./diss_1000_ids 2> /dev/null
!medic --format tsv write ALL 2> /dev/null | cut -f1 | wc -l

In [None]:
!medic --format tsv write --pmid-lists ./diss_1000_ids 2> /dev/null | cut -f1 | wc -l

### Add to medic database the additional batches

In [22]:
#!medic update --pmid-lists ./AR00_1200_batch2_ids 2> /dev/null
!medic update --pmid-lists ./AP00_1400_batch2_ids 2> /dev/null

### Eliminate overlap between sets 1 and the additonal batches to get uniquely new sets for the additional batches
Note, permanent location for good lists ../task_data_pmids.

In [None]:
# print(len(ar1))
# print(len(ar2))
# len(set(ar2).intersection(ar1))
# uniq_ar2 = [a for a in ar2 if not a in ar1]
# print(len(uniq_ar2))
# with open('../task_data_pmids/AR00_1000_batch2_ids', 'wb') as f:
#     for item in uniq_ar2[0:1000]:
#         f.write(item + '\n')

In [26]:
def unique_1000(list1, list2):
    """Returns unique list of 1000 elements in list1 that are not in list2.
    
    Presumes at least 1000 items in list1.
    
    Preserves ordering in list1.
    """
    uniq_list1 = [item for item in list1 if not item in list2]
    print('Found {} items unique to list1 from {} total items in list1 and {} total items in list2'.format(len(uniq_list1), len(list1), len(list2)))
    if len(uniq_list1) > 1000:
        uniq_list1 = uniq_list1[0:1000]
    print('Returning {} items from list1'.format(len(uniq_list1)))
    return(uniq_list1)

#### reduce additional arousal to 1000 unique new terms

In [None]:
ar1 = !medic --format tsv write --pmid-lists ../task_data_pmids/AR00_1000_ids 2> /dev/null | cut -f1 
ar2 = !medic --format tsv write --pmid-lists ./AR00_1200_batch2_ids 2> /dev/null | cut -f1

In [None]:
with open('../task_data_pmids/AR00_1000_batch2_ids', 'wb') as f:
    for item in uniq_ar2[0:1000]:
        f.write(item + '\n')

#### reduce additional auditory perception to 1000 unique new terms

In [23]:
ap1 = !medic --format tsv write --pmid-lists ../task_data_pmids/AP00_1000_ids 2> /dev/null | cut -f1 
ap2 = !medic --format tsv write --pmid-lists ./AP00_1400_batch2_ids 2> /dev/null | cut -f1

In [30]:
uniq_ap2 =  unique_1000(ap2, ap1)

Found 1257 items unique to list1 from 1400 total items in list1 and 1000 total items in list2
Returning 1000 items from list1


In [31]:
with open('../task_data_pmids/AP00_1000_batch2_ids', 'wb') as f:
    for item in uniq_ap2[0:1000]:
        f.write(item + '\n')

# Appendices: Description of pubmed central resources

## A.0 generally pubmed portion to psychology and psychiatry
contrast search of pubmed in most recent 5 years human vs. (psychology or psychiatry)

Last 5 years, with abstract:

- human:
  - ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])
  - 2267867
- psychology or psychiatry:
  - ("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]) OR ("psychiatry"[MeSH Terms] OR "psychiatry"[All Fields]) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])
  - 318188
- human AND (psychology or psychiatry)
  - ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND (("psychology"[Subheading] OR "psychology"[All Fields] OR "psychology"[MeSH Terms]) OR ("psychiatry"[MeSH Terms] OR "psychiatry"[All Fields])) AND (hasabstract[text] AND "2011/01/31"[PDat] : "2016/01/29"[PDat])
  - 249889

Approx 11% of human pubmed literature is human psych

## A.1 pubmed resources for programmatic retrieval

There are official services useful for article retrieval from pubmed central.

### pubmed central programmatic access

Note directions at pubmed central on programmatic access.

- http://www.ncbi.nlm.nih.gov/pmc/oai
- http://www.ncbi.nlm.nih.gov/pmc/tools/oa-service/

base urls for retrieval:  
- http://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi
- http://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi

### pubmed listings of open access articles available as text or pdf

#### Useful Entrez terms

    "open access"[filter] - finds PMC articles that are in the OA subset
    "has pdf"[filter] - finds all PMC articles that have PDF files (including those not in the OA subset)
    "oa full text xml"[filter] - finds those OA subset articles that have XML

#### ftp service (can get pdf along with article XML, or only XML).
 - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/
 - (directions) http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/
 - http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/#Finding_Data
 - Set TCP FTP client as 32 Mb

#### Associating pubmed id to pubmed central id (PMID to PMCID).
- ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
- http://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/

## A.2 Description of pubmed vs. pubmed central corpuses and available full text.

These are kept quite up to date, so exact counts will vary.

The point is to show a few variations on our searches as it relates to the pubmed or pubmed central literature in general.

#### pubmed
- ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "loattrfree full text"[sb]
- pubmed search (n=2973711):


#### pubmed central searches, counts of resutls.

pubmedcentral search:
- These all have free articles.
- ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human[All Fields])
- n = 1753712

pubmedcentral search limit to free full text:
- These all have full text (but not necessarily *free* full text?)
- ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "free full text"[Filter]
- n = 2396446

pubmedcentral search 'open access':
- Distinguished from 'full text xml' because not all are free text xml?
- ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "open access"[Filter]
- n = 803601

**pubmedcentral open access and full text:** 
- **(DECIDED THIS IS OUR DESIRED SEARCH SPACE)**
- These all have free articles AND the full text is retrievable by xml.
- ("humans"[MeSH Terms] OR "humans"[All Fields] OR "human"[All Fields]) AND "open access"[Filter] AND "oa full text xml"Filter
- n = 763976

pubmedcentral open access and full text but **suboptimal** (forgetting to include 'human' singular:
- lose about 1/3 of terms.
- ("humans"[MeSH Terms] OR "humans"[All Fields]) AND "oa full text xml"[Filter]
- n = 504974



## A.3 Description of pubmed central id list mappings as fiels.
Note on this conversion list:
- pmc are unique
- a few pmc map to multiple pmid.

In [1]:
%cd /Users/ccarey/Documents/Projects/NAMI/rdoc
# !pwd
# !mkdir -p ./data/pubmed_central_open_access
%cd ./data/pubmed_central_open_access
# !wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz
!zgrep . PMC-ids.csv.gz | head -3

# # R
# t <- read.csv('PMC-ids.csv.gz')
# > nrow(t)
# [1] 3877841
# > length(unique(t$PMCID))
# [1] 3877841
# > length(unique(t$PMID))
# [1] 3373788

/Users/ccarey/Documents/Projects/NAMI/rdoc
/Users/ccarey/Documents/Projects/NAMI/rdoc/data/pubmed_central_open_access
Journal Title,ISSN,eISSN,Year,Volume,Issue,Page,DOI,PMCID,PMID,Manuscript Id,Release Date
Breast Cancer Res,1465-5411,1465-542X,2000,3,1,55,,PMC13900,11250746,,live
Breast Cancer Res,1465-5411,1465-542X,2000,3,1,61,,PMC13901,11250747,,live


## A.4 Description full corpuses by pubmed central ftp

### Maybe more useful (can validate pmid etc.)

Full text extracted either from the xml, or if article only available as pdf, then pmc has extracted the open access text from the pdf.

.nxml format, has pmc etc ids. Abstracts and text include lots of markup. figures by href off of base url?

    <article-id pub-id-type="pmid">22558545</article-id>
    <article-id pub-id-type="pmc">3339583</article-id>

full XML, inlcudes pdf and images if relevant (2-6 Gb):
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.A-B.tar.gz
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.C-H.tar.gz
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.I-N.tar.gz
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.O-Z.tar.gz

### Maybe easier for text analysis.

The extracted full text is not as useful because they do not include the metadata for pubmed id etc. that we could get elsewhere.

full extracted text archive files (2-6Gb) 
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.txt.0-9A-B.tar.gz
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.txt.C-H.tar.gz
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.txt.I-N.tar.gz
    - ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/articles.txt.O-Z.tar.gz

Example article:
- articles.txt.0-9A-B/Breast_Cancer_Res/Breast_Cancer_Res_2000_Dec_17_2\(1\)_15-21.txt 

### ID mappings again, but with useful folder location within online or downloaded corpus.
Column 1 refers to ftp archive folder location.

In [10]:
%cd /Users/ccarey/Documents/Projects/NAMI/rdoc/data/pubmed_central_open_access
# !wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.txt
# !wget ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.pdf.txt
!head -3 file_list.txt
# !zgrep -c . file_list.txt # 1202748
!head -3 file_list.pdf.txt
# !zgrep -c . file_list.pdf.txt # 1119897

/Users/ccarey/Documents/Projects/NAMI/rdoc/data/pubmed_central_open_access
2016-01-20 11:27:32
08/e0/Breast_Cancer_Res_2001_Nov_2_3(1)_55-60.tar.gz	Breast Cancer Res. 2001 Nov 2; 3(1):55-60	PMC13900	PMID:11250746
b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz	Breast Cancer Res. 2001 Nov 9; 3(1):61-65	PMC13901	PMID:11250747
2016-01-20 11:27:32
08/e0/BCR-3-1-055.PMC13900.pdf	Breast Cancer Res. 2001 Nov 2; 3(1):55-60	PMC13900	PMID:11250746
b0/ac/BCR-3-1-061.PMC13901.pdf	Breast Cancer Res. 2001 Nov 9; 3(1):61-65	PMC13901	PMID:11250747


## A.5 MESH
https://www.nlm.nih.gov/bsd/disted/meshtutorial/meshtreestructures/

MESH Subheadings under psychology and psychiatry
https://www.nlm.nih.gov/mesh/2016/mesh_browser/MeSHtree.F.html#link_id
Go back to MeSH Tree Psychiatry and Psychology [F]

    Behavior and Behavior Mechanisms [F01]  +
    Psychological Phenomena and Processes [F02]  +
    Mental Disorders [F03]  +
    Behavioral Disciplines and Activities [F04]  + 


## A.6 Misc resources and notes

### Programmatic OAI noted to not give MESH nor author affiliation
http://blog.humaneguitarist.org/2012/05/27/awesome-sauce-augmenting-pubmed-centrals-oai-response/

### To get MESH, blogger noted xml from efetch does it.
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=12654674&retmode=xml

Note, when in pubmed, if we unfold Publication Types, MESH Terms, Substances

XML has
- MeSH Terms : MeshHeadingList
- Substances : ChemicalList

Do we know that extended mesh concepts are included?


## A.7 MESH from python medic database using python medic package

To get MESH from medic, medic --format full and select MH rows.

Note '*' character should denote major topic.

### Example:
    
medic write --format full 21050743 | grep '^PMID\|^MH'