# Literature Metadata Retrieval

Chemical and biological literature are stored in the repositories of various publishers. There are many services, such as CrossRef and PubMed, that assign accession numbers to literature across many repositories and aggregate their metadata.

This notebook explores some of the ways to retrieve metadata about a document given one of these services' identifiers.

In [1]:
import requests

#doi_ex = "10.1038/ng1201-365"
doi_ex = "10.1186/2041-1480-5-5"
doi_endpoint = "http://doi.org/api/handles/"
crossref_endpoint = "http://api.crossref.org/works/"

## Digital Object Identifier System

The goal of the [Digital Object Identifier (DOI)](http://www.doi.org/) system is to assign unique identifiers not just to literature, but to all digital documents. Taking inspiration from other cataloging systems such as the ISBN, there is some organization and modularity to the identifiers.

This website provides a URL resolution service at http://dx.doi.org/. For example, http://dx.doi.org/10.1186/2041-1480-5-5 will resolve to the Journal of Biomedical Semantics' splash page for the corresponding article about the results from a BioHackathon in 2011 and 2012.

The website also provides a metadata service at http://doi.org/api/handles/. There is some, albeit sparse, [API documentation](https://www.doi.org/factsheets/DOIProxy.html).

In [2]:
res = requests.get(doi_endpoint + doi_ex).json()
res

{'handle': '10.1186/2041-1480-5-5',
 'responseCode': 1,
 'values': [{'data': {'format': 'string',
    'value': 'http://www.jbiomedsem.com/content/5/1/5'},
   'index': 1,
   'timestamp': '2014-04-07T20:05:11Z',
   'ttl': 86400,
   'type': 'URL'},
  {'data': {'format': 'string', 'value': '20140407165738'},
   'index': 700050,
   'timestamp': '2014-04-07T20:05:11Z',
   'ttl': 86400,
   'type': '700050'},
  {'data': {'format': 'admin',
    'value': {'handle': '0.na/10.1186',
     'index': 200,
     'permissions': '111111110010'}},
   'index': 100,
   'timestamp': '2014-04-07T20:05:11Z',
   'ttl': 86400,
   'type': 'HS_ADMIN'}]}

## CrossRef 

CrossRef is a much more powerful metadata and citation indexing system built on top of the DOI system. Because citation information is generally inconsistient across the centuries of publications that various repositories serve, the data from this service must be handled carefully and checked for thoroughness. 

CrossRef has thorouch [API Documentation](https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md) on GitHub.

CrossRef also has a blog where they're developing new and interesting things! http://labs.crossref.org/

In [3]:
res = requests.get(crossref_endpoint + doi_ex).json()
res = res['message']
d = {}
d['publisher'] = res['publisher']
d['publication'] = max(res['container-title'], key=len)
d['author'] = "{}, {}".format(res['author'][0]['family'], res['author'][0]['given'])
d['publication_date'] = res['published-print']['date-parts'][0]
if 'subject' in res:
    d['subject'] = res['subject'][0]
if 'title' in res:
    d['title'] = res['title'][0]
d

{'author': 'Katayama, Toshiaki',
 'publication': 'Journal of Biomedical Semantics',
 'publication_date': [2014],
 'publisher': 'Springer Science + Business Media',
 'title': 'BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains'}

## PubMed

PubMed Identifiers (PMID) are an alternate accession given to articles in MEDLINE by the [PubMed](http://www.ncbi.nlm.nih.gov/pubmed) service. Unfortunately, the PubMed API and its [documentation](http://www.ncbi.nlm.nih.gov/home/api.shtml) are not so straightforwards. 

Fred Trotter made a nice write-up on how to use it on his [blog](http://www.fredtrotter.com/2014/11/14/hacking-on-the-pubmed-api/).

## AltMetric

AltMetric already solved all of the problems with retreiving, curating, and publishing literature metadata, but it's limited for free users. They have a DOI endpoint at http://api.altmetric.com/v1/doi and a PMID endpoing at http://api.altmetric.com/v1/pmid.

In [4]:
pmid_ex = "23942530"
altmetric_endpoint = "http://api.altmetric.com/v1/pmid/"

In [9]:
res = requests.get(altmetric_endpoint + pmid_ex).json()
print(set(res.keys()))

{'abstract', 'readers', 'is_oa', 'pmid', 'score', 'cohorts', 'type', 'added_on', 'details_url', 'doi', 'abstract_source', 'history', 'readers_count', 'cited_by_tweeters_count', 'issns', 'images', 'cited_by_posts_count', 'journal', 'url', 'altmetric_id', 'scopus_subjects', 'schema', 'subjects', 'last_updated', 'cited_by_accounts_count', 'altmetric_jid', 'published_on', 'context', 'publisher_subjects', 'title'}
