# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Analyses 8 - Literature Analyses**
Welcome to the eighth jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with obtaining different literature-related information such as the information that can be obtained using the Textpresso Central website.
Let's get started!

We will start by importing required libraries for the analysis. We use the Europe PMC API for obtaining this information!

In [None]:
import requests
import sys
import json
import urllib3
import xml.dom.minidom
from lxml import etree
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

Let us first explore the fields that are available in the Europe PMC API. 

We first need to query the API for fetching the fields that can be used for data extraction with this API. We then print out the results.

In [None]:
request = requests.get('https://www.ebi.ac.uk/europepmc/webservices/rest/fields', 
                       headers={ "Content-Type" : "application/json", "Accept" : ""})

if not request.ok:
    request.raise_for_status()
    sys.exit() 
result = xml.dom.minidom.parseString(request.text)
result = result.toprettyxml()
print(result)

In case you know the accession ID for a paper, it is very easy to download any supplementary material that is associated with this paper by using the supplementaryFiles end point of the API.

We generate the URL required for our query by entering the accession id of the paper in the id variable.
Then we download the queried results to our system into a '.zip' file.

In [None]:
id = 'PMC3027648'
request = requests.get('https://www.ebi.ac.uk/europepmc/webservices/rest/' + id + \
                       '/supplementaryFiles?includeInlineImage=true', 
                       headers = {"Content-Type" : "application/zip", "Accept" : ""}, stream=True)

target_path='supplementaryFiles.zip'
handle = open(target_path, 'wb')
for chunk in request.iter_content(chunk_size=512):
    if chunk:
        handle.write(chunk)
handle.close()

It is extremely useful to query for papers that contain a certain keyword. For this we define a function which you do not need to make any changes to which will query the keyword across the entire Europe PMC database.

In [None]:
def searchEuropePMCclient(query, format='XML'):
    base_url = 'https://www.ebi.ac.uk/europepmc/webservices/rest/search?'
    payload = {'query' : query, 'format' : format}
    request = requests.get(base_url, params=payload)
    if request.ok:
        result = xml.dom.minidom.parseString(request.text)
        result = result.toprettyxml()
        print(result)
    else:
        print('Something has gone wrong!!')

Assign the keyword that you want to search for using the API to the keyword variable.

In [None]:
keyword = 'Caenorhabditis elegans'
searchEuropePMCclient(keyword)

Another useful utility provided by the Europe PMC API is the possibility to query for the works of a certain author using either their name or their ORCID ID.

Assign the author's name or ORCID ID to the author_id variable.

In [None]:
author_id = '0000-0001-8314-8497'

We first generate the required URL for fetching the papers written by the author and then send the request.

In [None]:
request = requests.get('https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=AUTHORID:' + author_id, 
                       headers={ "Content-Type" : "application/json", "Accept" : ""})

if not request.ok:
    request.raise_for_status()
    sys.exit() 
result = xml.dom.minidom.parseString(request.text)
result = result.toprettyxml()
print(result)

It is also possible to list the papers that have cited a certain publication by just entering the source of the paper and its external id which can be its accession id in most cases.

Assign the source and external id of the paper to the variables source and external_id. 
The source can be - AGR, CBA, CTX, ETH, HIR, MED, PAT, PMC, PPR

In [None]:
source = 'MED'  
external_id = '30206121'

We then generate the required URL for fetching the papers that cite the queried paper and send the request.

In [None]:
request = requests.get('https://www.ebi.ac.uk/europepmc/webservices/rest/' + source + '/' + external_id + \
                       '/citations', 
                       headers={ "Content-Type" : "application/json", "Accept" : ""})

if not request.ok:
    request.raise_for_status()
    sys.exit() 
result = xml.dom.minidom.parseString(request.text)
result = result.toprettyxml()
print(result)

This is the end of the tutorial on replicating Textpresso results using the Europe PMC RESTful API to get the literature analyses information. The data is up-to date and is very quick to extract, and is easy to handle.


This tutorial is also the end of the analysis series. In the next tutorial, we will implement and test some simple utilities for the data.

Acknowledgements:

- Textpresso Central (https://textpressocentral.org/tpc/home)
- EuropePMC API (http://europepmc.org/RestfulWebService)