# ![](../../docs/img/PubCrawl@0.5x.png)

# Overview

1. Download list of PMIDs and metadata based on search term as described in [search process](#Search-Process).
2. Upload list of PMIDs to NCBI Flink (pubmed_pubmed_refs).
3. Download one-to-one mapping from NCBI Flink (csv: citedBy,primaryArticle).
4. Upload result of #3 to network analysis program of choice (e.g. networkx, cytoscape, R, etc.).

## Dependencies
* Python 3.5+
* [biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
* [pandas](http://pandas.pydata.org/pandas-docs/stable/index.html)


# Terms of Service and Use

## Frequency, Timing and Registration of E-utility URL Requests

> In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI. If NCBI blocks an IP address, service will not be restored unless the developers of the software accessing the E-utilities register values of the tool and email parameters with NCBI. 

>*See full text at [the NCBI website](https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen).* 

# Getting Started

## User Input. Enter YOUR email address.

* TODO: Add support for customizing search terms using [search term filters](#1.-list-possible-search-filter-terms).
* TODO: Add handles for save file names.

In [1]:
email = "adewole_oyalowo@brown.edu"
searchTerm = "traumatic brain injury"

## Import modules

Import modules as described by the BioPython manual. 
* Do not need SeqIO.
* Import pandas for use later.

In [2]:
from Bio import Entrez, SeqIO
import pprint
import pandas as pd

pprint = pprint.PrettyPrinter(indent=4).pprint

try:
    from urllib.error import HTTPError  # for Python 3
except ImportError:
    from urllib2 import HTTPError  # for Python 2
    
Entrez.email = email
Entrez.tool = 'pubcrawl via biopython'

# Pre-Search Process

## i. List Possible Search Filter Terms

First, let's generate a list of the possible search term filters. The BioPython cookbook describes how to format that request.

In [3]:
infoHandle = Entrez.einfo(db="pubmed")
infoRecord = Entrez.read(infoHandle)

for field in infoRecord["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to publication
FILT, Filter, Limits the records
TITL, Title, Words in title of publication
WORD, Text Word, Free text associated with publication
MESH, MeSH Terms, Medical Subject Headings assigned to publication
MAJR, MeSH Major Topic, MeSH terms of major importance to publication
AUTH, Author, Author(s) of publication
JOUR, Journal, Journal abbreviation of publication
AFFL, Affiliation, Author's institutional affiliation and address
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
SUBS, Supplementary Concept, CAS chemical name or MEDLINE Substance Name
PDAT, Date - Publication, Date of publication
EDAT, Date - Entrez, Date publication first accessible through Entrez
VOL, Volume, Volume number of publication
PAGE, Pagination, Page number(s) of publication
PTYP, Publication Type, Type of publication (e.g., review)
LANG, Language, Language of publication
ISS, Issue, Issue number of publ

## ii. Use Global Query Counts to return a list of the number of times the search term shows up for each database.

Next, although primarily interested in PubMed, in some cases, PMC has more articles that PM. Let's use global query counts just to get a general sense of in which databases our search term appears. Once again, this info is in the BioPython cookbook.

In [4]:
egqHandle = Entrez.egquery(term=searchTerm)
egqRecord = Entrez.read(egqHandle)
for row in egqRecord["eGQueryResult"]:
    print('{}: {}'.format(row["DbName"], row["Count"]))

pubmed: 38694
pmc: 52307
mesh: 2
books: 2841
pubmedhealth: 757
omim: 11
ncbisearch: 41
nuccore: 266
nucgss: 0
nucest: 0
protein: 323
genome: 0
structure: 2
taxonomy: 0
snp: 0
dbvar: 1
gene: 315
sra: 113
biosystems: 0
unigene: 6
cdd: 0
clone: 44763
popset: 0
geoprofiles: 39898
gds: 208
homologene: 0
pccompound: 0
pcsubstance: 0
pcassay: 141
nlmcatalog: 663
probe: 0
gap: 105
proteinclusters: 7
bioproject: 48
biosample: 10


# Search Process

## 1. Use ESearch to grab the list of UIDs based on the search term. 

ESearch is the equivalent of entering our search term into pubmed's search bar. For this use case, we will set up the handle such that the UIDs are saved/cached in a web session.

In [13]:
searchHandle = Entrez.esearch(
                                db='pubmed',
                                term=searchTerm,
                                retmax=0,
                                retstart=0,
                                sort='relevance',
                                usehistory='y',
    
                             )

searchResults = Entrez.read(searchHandle)
searchHandle.close()
pprint(searchResults)

{   'Count': '38694',
    'IdList': [],
    'QueryKey': '1',
    'QueryTranslation': '"brain injuries, traumatic"[MeSH Terms] OR ("brain"[All Fields] AND "injuries"[All Fields] AND "traumatic"[All Fields]) OR "traumatic brain injuries"[All Fields] OR ("traumatic"[All Fields] AND "brain"[All Fields] AND "injury"[All Fields]) OR "traumatic brain injury"[All Fields]',
    'RetMax': '0',
    'RetStart': '0',
    'TranslationSet': [{'From': 'traumatic brain injury', 'To': '"brain injuries, traumatic"[MeSH Terms] OR ("brain"[All Fields] AND "injuries"[All Fields] AND "traumatic"[All Fields]) OR "traumatic brain injuries"[All Fields] OR ("traumatic"[All Fields] AND "brain"[All Fields] AND "injury"[All Fields]) OR "traumatic brain injury"[All Fields]'}],
    'TranslationStack': [{'Term': '"brain injuries, traumatic"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '6173', 'Explode': 'Y'}, {'Term': '"brain"[All Fields]', 'Field': 'All Fields', 'Count': '1330790', 'Explode': 'N'}, {'Term': '"injuri

## 2. Use ESummary to pull data from other metafields.

ESummary can be used to quickly grab some metadata about the articles we pull. An example output is shown below.

In [14]:
summaryHandle = Entrez.esummary(
                                db='pubmed',
                                query_key=searchResults['QueryKey'],
                                WebEnv=searchResults['WebEnv'],
                                retmax=1,
                                retstart=0,
                                )

summaryResults = Entrez.read(summaryHandle)
summaryHandle.close()
len(summaryResults)

1

In [10]:
pprint(summaryResults[0])

{   'ArticleIds': {   'doi': '10.1016/j.annemergmed.2014.02.003',
                      'eid': '24635991',
                      'medline': [],
                      'pii': 'S0196-0644(14)00100-0',
                      'pubmed': ['24635991'],
                      'rid': '24635991'},
    'AuthorList': ['Dayan PS', 'Holmes JF', 'Schutzman S', 'Schunk J', 'Lichenstein R', 'Foerster LA', 'Hoyle J Jr', 'Atabaki S', 'Miskin M', 'Wisner D', 'Zuspan S', 'Kuppermann N', 'Traumatic Brain Injury Study Group of the Pediatric Emergency Care Applied Research Network (PECARN).'],
    'DOI': '10.1016/j.annemergmed.2014.02.003',
    'ELocationID': 'doi: 10.1016/j.annemergmed.2014.02.003',
    'EPubDate': '2014 Mar 11',
    'ESSN': '1097-6760',
    'FullJournalName': 'Annals of emergency medicine',
    'HasAbstract': 1,
    'History': {   'accepted': '2014/02/03 00:00',
                   'entrez': '2014/03/19 06:00',
                   'medline': ['2014/09/30 06:00'],
                   'pubmed': ['2

## 3. Store metadata in pandas dataframe and export to csv

We will store this metadata in a pandas dataframe, and then export this dataframe to a csv file to serve as a lookup file.

In [16]:
metadata = [dict(metainfo) for metainfo in summaryResults]

df = pd.DataFrame(metadata)
df

Unnamed: 0,ArticleIds,AuthorList,DOI,ELocationID,EPubDate,ESSN,FullJournalName,HasAbstract,History,ISSN,...,PmcRefCount,PubDate,PubStatus,PubTypeList,RecordStatus,References,SO,Source,Title,Volume
0,"{'pubmed': ['24635991'], 'medline': [], 'pii':...","[Dayan PS, Holmes JF, Schutzman S, Schunk J, L...",10.1016/j.annemergmed.2014.02.003,doi: 10.1016/j.annemergmed.2014.02.003,2014 Mar 11,1097-6760,Annals of emergency medicine,1,"{'pubmed': ['2014/03/19 06:00'], 'medline': ['...",0196-0644,...,6,2014 Aug,ppublish,[Journal Article],PubMed - indexed for MEDLINE,[],2014 Aug;64(2):153-62,Ann Emerg Med,Risk of traumatic brain injuries in children y...,64


### Chunk, download, and save.

* Be careful about running multiple times. File mode will overwrite files of same name. 
* TODO: add variable for specifying file name.

In [17]:
count = int(searchResults['Count'])
batchSize = 5000


with open("output/pmid_summaries.csv", "w") as outfile:
    
    writeHeader = True
    for start in range(0,count,batchSize):
        end = min(count, start+batchSize)
        print("Going to download record {} to {}".format(start+1, end))
        attempt = 0

        try:

            summaryHandle = Entrez.esummary(
                                            db='pubmed',
                                            query_key=searchResults['QueryKey'],
                                            WebEnv=searchResults['WebEnv'],
                                            retstart=start,
                                            retmax=batchSize,
                                            )

            
        # TODO: Handle HTTP errors.
        except Exception as err:
            print(err)

        summaryResults = Entrez.read(summaryHandle)
        summaryHandle.close()
        
        metadata = [dict(metainfo) for metainfo in summaryResults]
        df = pd.DataFrame(metadata)

        df.to_csv(outfile, index=False, header=writeHeader)
        writeHeader = False 

Going to download record 1 to 5000
Going to download record 5001 to 10000
Going to download record 10001 to 15000
Going to download record 15001 to 20000
Going to download record 20001 to 25000
Going to download record 25001 to 30000
Going to download record 30001 to 35000
Going to download record 35001 to 38694


## 4. Use EFetch to grab PMID list as text.

EFetch can be used to grab full abstracts (where available) or a list of all the PMIDs for the search term. At this moment, I'm interested only in generating a lists of PMIDs. This list can later be uploaded to [NCBI FLink website](https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi). Then, by using pubmed_pubmed_refs link, I can figure out what papers are commonly cited by our list.

* For example, I can download a one-to-one mapping of 38694 pubmed records to their 10000 most commonly cited pubmed records.

* TODO: Batch size.

## 5. Export UIList to CSV
* No need to chunk here (I think)

In [22]:
fetchHandle = Entrez.efetch(
                                db='pubmed',
                                rettype='uilist',
#                                 retmode='xml',
                                query_key=searchResults['QueryKey'],
                                WebEnv=searchResults['WebEnv'],
                                retmax=searchResults['Count'],
                                retstart=0,
                                )

# fetchResults = Entrez.read(fetchHandle)
fetchResults = fetchHandle.read()
fetchHandle.close()
print(fetchResults)

with open("output/uids_for_flink.csv", "w") as outfile:
    outfile.write(fetchResults)

28319562
28319479
28319470
28316901
28316319
28316248
28315797
28315455
28314984
28314903
28314863
28314862
28314621
28314375
28306235
28303478
28303450
28301983
28301956
28301868
28301824
28301451
28299683
28298577
28298549
28298379
28298170
28298143
28298047
28296529
28296528
28296510
28295837
28295524
28294709
28294464
28294336
28293983
28293979
28293977
28293177
28293148
28292694
28292310
28291783
28291466
28291465
28291464
28291463
28291459
28291455
28291098
28291094
28290939
28290938
28290394
28290385
28290384
28290383
28289791
28289648
28289516
28288867
28288648
28288551
28287909
28287384
28287287
28287196
28286721
28285834
28285745
28285571
28285405
28285252
28284950
28284405
28284404
28283963
28283812
28283595
28282851
28282628
28281095
28279706
28279553
28279125
28278592
28278589
28277415
28276866
28275834
28275224
28275059
28274861
28274814
28273811
28273100
28272771
28272763
28272186
28271640
28271316
28271248
28270466
28269841
28269777
28269642
28269506
28269244
28269177
2

## 6. Grab Abstracts (unfinished)

* TODO: Chunk.

In [None]:
fetchHandle = Entrez.efetch(
                                db='pubmed',
#                                 rettype='medline',
                                retmode='xml',
                                query_key=searchResults['QueryKey'],
                                WebEnv=searchResults['WebEnv'],
                                retmax=searchResults['Count'],
                                retstart=0,
                                )

# fetchResults = Entrez.read(fetchHandle)
fetchResults = fetchHandle.read()
fetchHandle.close()
print(fetchResults)

# Works Referenced

* [BioPython](https://github.com/biopython/biopython)
    * Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878
    * Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    
* [NCBI](https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi)
    * FLink: Frequency weighted links [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information. 2010. Available from: https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi
    
* [Pandas](http://pandas.pydata.org/pandas-docs/stable/index.html)
    * Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) [(publisher link)](http://conference.scipy.org/proceedings/scipy2010/mckinney.html)