# ![](../../docs/img/PubCrawl@0.5x.png)

# Overview

* This notebook is for keeping track of things I want to try out in BioPython. 

TODO:
1. Grab all abstracts for text mining or NLP.
2. Generate list of references to overcome FLinks 100,000 link maximum.
3. Look at differences vs PubMed and PMC.

## Dependencies
* Python 3.5+
* [biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
* [pandas](http://pandas.pydata.org/pandas-docs/stable/index.html)


# Terms of Service and Use

## Frequency, Timing and Registration of E-utility URL Requests

> In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays. Failure to comply with this policy may result in an IP address being blocked from accessing NCBI. If NCBI blocks an IP address, service will not be restored unless the developers of the software accessing the E-utilities register values of the tool and email parameters with NCBI. 

>*See full text at [the NCBI website](https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen).* 

# Getting Started

## User Input. Enter YOUR email address.

* TODO: Add support for customizing search terms using [search term filters](#1.-list-possible-search-filter-terms).

In [None]:
email = "adewole_oyalowo@brown.edu"
searchTerm = "concussion"
# excludeReviews = False
# excludeSystematicReviews = False
# relDate = 0

## Import modules

In [None]:
from Bio import Entrez, SeqIO
import pprint
import pandas as pd

pprint = pprint.PrettyPrinter(indent=4).pprint

try:
    from urllib.error import HTTPError  # for Python 3
except ImportError:
    from urllib2 import HTTPError  # for Python 2
    
Entrez.email = email
Entrez.tool = 'pubcrawl via biopython'

# Pre-Search Process

## 1. List Possible Search Filter Terms

First, let's generate a list of the possible search term filters.

In [None]:
infoHandle = Entrez.einfo(db="pubmed")
infoRecord = Entrez.read(infoHandle)

for field in infoRecord["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

## 2. Use Global Query Counts to return a list of the number of times the search term shows up for each database.

Next, although primarily interested in PubMed, in some cases, PMC has more articles that PM. Let's use global query counts just to get a general sense of in which databases our search term appears.

In [None]:
egqHandle = Entrez.egquery(term=searchTerm)
egqRecord = Entrez.read(egqHandle)
for row in egqRecord["eGQueryResult"]:
    print('{}: {}'.format(row["DbName"], row["Count"]))

# Search Process

## 3. Use ESearch to grab the list of UIDs based on the search term. 

ESearch is the equivalent of entering our search term into pubmed's search bar. For this use case, we will set up the handle such that the UIDs are saved/cached in a web session.

In [None]:
searchHandle = Entrez.esearch(
                                db='pubmed',
                                term=searchTerm,
                                retmax=0,
                                retstart=0,
                                sort='relevance',
                                usehistory='y',
    
                             )

searchResults = Entrez.read(searchHandle)
searchHandle.close()
pprint(searchResults)

## 5. Use EFetch to grab abstracts or PMID list as text.

EFetch can be used to grab full abstracts (where available) or a list of all the PMIDs for the search term. At this moment, I'm interested only in generating a lists of PMIDs. This list can later be uploaded to [NCBI FLink website](https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi). Then, by using pubmed_pubmed_refs link, I can figure out what papers are commonly cited by our list.

* For example, I can download a one-to-one mapping of 38694 pubmed records to their 10000 most commonly cited pubmed records.

* TODO: Batch size.

## 5b. Export UIList to CSV
* No need to chunk here (I think)

In [None]:
fetchHandle = Entrez.efetch(
                                db='pubmed',
                                rettype='uilist',
#                                 retmode='xml',
                                query_key=searchResults['QueryKey'],
                                WebEnv=searchResults['WebEnv'],
                                retmax=searchResults['Count'],
                                retstart=0,
                                )

# fetchResults = Entrez.read(fetchHandle)
fetchResults = fetchHandle.read()
fetchHandle.close()
print(fetchResults)

with open("output/uids_for_flink.csv", "w") as outfile:
    outfile.write(fetchResults)

## 5c. Grab Abstracts (unfinished)

* TODO: Chunk.

In [None]:
fetchHandle = Entrez.efetch(
                                db='pubmed',
#                                 rettype='medline',
                                retmode='xml',
                                query_key=searchResults['QueryKey'],
                                WebEnv=searchResults['WebEnv'],
                                retmax=searchResults['Count'],
                                retstart=0,
                                )

# fetchResults = Entrez.read(fetchHandle)
fetchResults = fetchHandle.read()
fetchHandle.close()
print(fetchResults)

# Under Development

## 6. Use ELink to generate a list of all the papers that cite the uploaded PMIDS.

* Does not seem to work for very large search queries.

## Generate list of possible linknames

In [None]:
linkHandle = Entrez.elink(
                            dbfrom="pubmed",
                            db="pubmed",
                            query_key=searchResults['QueryKey'],
                            WebEnv=searchResults['WebEnv'],
                            )

linkResults = Entrez.read(linkHandle)
linkHandle.close()

for linksetdb in linkResults[0]["LinkSetDb"]:
    print(linksetdb["DbTo"], linksetdb["LinkName"], len(linksetdb["Link"]))

In [None]:
linkHandle = Entrez.elink(
                            dbfrom="pubmed",
                            db="pubmed",
                            LinkName="pubmed_pubmed_citedin",
                            query_key=searchResults['QueryKey'],
                            WebEnv=searchResults['WebEnv'],
                            cmd='neighbor_score',
                            rettype='xml',

                            )

linkResults = Entrez.read(linkHandle)
linkHandle.close()

In [None]:
pprint(linkResults[0])

In [None]:
import csv
import json


pmids_pmids = ["{},{}".format(str(link["Id"]),str(link['Score'])) for link in linkResults[0]["LinkSetDb"][0]["Link"]]

print(pmcids_pmids)

with open('pmidCitedInPmid.txt','w') as outfile:
    header = csv.writer(f)
    header.writerow(['pmid','citedInPmid'])
    
    writer = csv.writer(outfile, delimiter='\n')
    writer.writerows([pmids])
    
# print(linkResults[0])

## Convert PMCIDs to PMIDs

In [None]:
linkHandle = Entrez.elink(
                            dbfrom="pubmed",
                            db="pmc",
                            LinkName="pubmed_pmc_refs",
                            query_key=searchResults['QueryKey'],
                            WebEnv=searchResults['WebEnv'],
                            cmd='neighbor_history',
                            )

linkResults = Entrez.read(linkHandle)
linkHandle.close()

In [None]:
linkResults

In [None]:
linkHandle = Entrez.elink(
                            dbfrom="pmc",
                            db="pubmed",
                            LinkName="pmc_pubmed",
                            query_key=linkResults[0]['LinkSetDbHistory'][0]['QueryKey'],
                            WebEnv=linkResults[0]['WebEnv'],
#                             cmd='neighbor_score',
#                             usehistory='y',
                            )

linkResults = Entrez.read(linkHandle)
linkHandle.close()

In [None]:
linkResults

In [None]:
[link["Id"] for link in linkResults[0]["LinkSetDb"][0]["Link"]]

# Works Referenced

* [BioPython](https://github.com/biopython/biopython)
    * Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878
    * Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html
    
* [NCBI](https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi)
    * FLink: Frequency weighted links [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information. 2010. Available from: https://www.ncbi.nlm.nih.gov/Structure/flink/flink.cgi
    
* [Pandas](http://pandas.pydata.org/pandas-docs/stable/index.html)
    * Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) [(publisher link)](http://conference.scipy.org/proceedings/scipy2010/mckinney.html)