# Building the dataset of research papers

_(Adapted from: Building the "evolution" research papers dataset - [Luís F. Simões](mailto:luis.simoes@vu.nl). Converted to Python 3 and minor changes by Tobias Kuhn, 2015-10-22.)_

*******

The [Entrez](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html) module, a part of the [Biopython](http://biopython.org/) library, will be used to interface with [PubMed](http://www.ncbi.nlm.nih.gov/pubmed).<br>
You can download Biopython from [here](http://biopython.org/wiki/Download).

In this notebook we will be covering several of the steps taken in the [Biopython Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html), specifically in [Chapter 9  Accessing NCBI’s Entrez databases](http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc109).

In [1]:
from Bio import Entrez

# NCBI requires you to set your email address to make use of NCBI's E-utilities
Entrez.email = "Your.Name.Here@example.org"

The datasets will be saved as serialized Python objects, compressed with bzip2.
Saving/loading them will therefore require the [pickle](http://docs.python.org/3/library/pickle.html) and [bz2](http://docs.python.org/3/library/bz2.html) modules.

In [2]:
import pickle, bz2, os

## EInfo: Obtaining information about the Entrez databases

In [3]:
# accessing extended information about the PubMed database
pubmed = Entrez.read( Entrez.einfo(db="pubmed"), validate=False )[u'DbInfo']

# list of possible search fields for use with ESearch:
search_fields = { f['Name']:f['Description'] for f in pubmed["FieldList"] }

In search_fields, we find 'TIAB' ('Free text associated with Abstract/Title') as a possible search field to use in searches.

In [4]:
search_fields

{'AFFL': "Author's institutional affiliation and address",
 'ALL': 'All terms from all searchable fields',
 'AUCL': 'Author Cluster ID',
 'AUID': 'Author Identifier',
 'AUTH': 'Author(s) of publication',
 'BOOK': 'ID of the book that contains the document',
 'CDAT': 'Date of completion',
 'CNTY': 'Country of publication',
 'COLN': 'Corporate Author of publication',
 'CRDT': 'Date publication first accessible through Entrez',
 'DSO': 'Additional text from the summary',
 'ECNO': 'EC number for enzyme or CAS registry number',
 'ED': "Section's Editor",
 'EDAT': 'Date publication first accessible through Entrez',
 'EID': 'Extended PMID',
 'EPDT': 'Date of Electronic publication',
 'FAUT': 'First Author of publication',
 'FILT': 'Limits the records',
 'FINV': 'Full name of investigator',
 'FULL': 'Full Author Name(s) of publication',
 'GRNT': 'NIH Grant Numbers',
 'INVR': 'Investigator',
 'ISBN': 'ISBN',
 'ISS': 'Issue number of publication',
 'JOUR': 'Journal abbreviation of publication',


## ESearch: Searching the Entrez databases

To have a look at the kind of data we get when searching the database, we'll perform a search for papers authored by Haasdijk:

In [5]:
example_authors = ['Haasdijk E']
example_search = Entrez.read( Entrez.esearch( db="pubmed", term=' AND '.join([a+'[AUTH]' for a in example_authors]) ) )
example_search

{'RetMax': '20', 'IdList': ['24977986', '24901702', '24852945', '24708899', '24252306', '23580075', '23144668', '22174697', '22154920', '21870131', '21760539', '20662596', '20602234', '20386726', '18579581', '18305242', '17913916', '17804640', '17686042', '17183535'], 'TranslationStack': [{'Term': 'Haasdijk E[Author]', 'Field': 'Author', 'Count': '30', 'Explode': 'N'}, 'GROUP'], 'Count': '30', 'TranslationSet': [], 'RetStart': '0', 'QueryTranslation': 'Haasdijk E[Author]'}

Note how the result being produced is not in Python's native string format:

In [6]:
type( example_search['IdList'][0] )

Bio.Entrez.Parser.StringElement

The part of the query's result we are most interested in is accessible through

In [7]:
example_ids = [ int(id) for id in example_search['IdList'] ]
print(example_ids)

[24977986, 24901702, 24852945, 24708899, 24252306, 23580075, 23144668, 22174697, 22154920, 21870131, 21760539, 20662596, 20602234, 20386726, 18579581, 18305242, 17913916, 17804640, 17686042, 17183535]


### PubMed IDs dataset

We will now assemble a dataset comprised of research articles containing the keyword "evolution", in either their titles or abstracts.

In [8]:
search_term = 'air'

In [9]:
Ids_file = 'data/' + search_term + '__Ids.pkl.bz2'

In [10]:
if os.path.exists( Ids_file ):
    Ids = cPickle.load( bz2.BZ2File( Ids_file, 'rb' ) )
else:
    # determine the number of hits for the search term
    search = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retmax=0 ) )
    total = int( search['Count'] )
    
    # `Ids` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Ids_str = []
    retrieve_per_query = 10000
    
    for start in range( 0, total, retrieve_per_query ):
        print('Fetching IDs of results [%d,%d]' % ( start, start+retrieve_per_query ) )
        s = Entrez.read( Entrez.esearch( db="pubmed", term=search_term+'[TIAB]', retstart=start, retmax=retrieve_per_query ) )
        Ids_str.extend( s[ u'IdList' ] )
    
    # convert Ids to integers (and ensure that the conversion is reversible)
    Ids = [ int(id) for id in Ids_str ]
    
    for (id_str, id_int) in zip(Ids_str, Ids):
        if str(id_int) != id_str:
            raise Exception('Conversion of PubMed ID %s from string to integer it not reversible.' % id_str )
    
    # Save list of Ids
    pickle.dump( Ids, bz2.BZ2File( Ids_file, 'wb' ) )
    
total = len( Ids )
print('%d documents contain the search term "%s".' % ( total, search_term ) )

Fetching IDs of results [0,10000]
Fetching IDs of results [10000,20000]
Fetching IDs of results [20000,30000]
Fetching IDs of results [30000,40000]
Fetching IDs of results [40000,50000]
Fetching IDs of results [50000,60000]
Fetching IDs of results [60000,70000]
Fetching IDs of results [70000,80000]
Fetching IDs of results [80000,90000]
Fetching IDs of results [90000,100000]
Fetching IDs of results [100000,110000]
Fetching IDs of results [110000,120000]
Fetching IDs of results [120000,130000]
Fetching IDs of results [130000,140000]
Fetching IDs of results [140000,150000]
Fetching IDs of results [150000,160000]
Fetching IDs of results [160000,170000]
Fetching IDs of results [170000,180000]
Fetching IDs of results [180000,190000]
Fetching IDs of results [190000,200000]
190555 documents contain the search term "air".


Taking a look at what we just retrieved, here are the last 5 elements of the `Ids` list:

In [11]:
Ids[:5]

[26489032, 26489005, 26488732, 26488527, 26488458]

## ESummary: Retrieving summaries from primary IDs

To have a look at the kind of metadata we get from a call to `Entrez.esummary()`, we now fetch the summary of one of Haasdijk's papers (using one of the PubMed IDs we obtained in the previous section:

In [12]:
example_paper = Entrez.read( Entrez.esummary(db="pubmed", id='23144668') )[0]

def print_dict( p ):
    for k,v in p.items():
        print(k)
        print('\t', v)

print_dict(example_paper)

EPubDate
	 2012 Apr 20
PubTypeList
	 ['Journal Article']
AuthorList
	 ['Eiben AE', 'Kernbach S', 'Haasdijk E']
LastAuthor
	 Haasdijk E
Item
	 []
ArticleIds
	 {'eid': '23144668', 'pmc': 'PMC3490067', 'medline': [], 'pii': '71', 'pubmed': ['23144668'], 'pmcid': 'pmc-id: PMC3490067;', 'doi': '10.1007/s12065-012-0071-x', 'rid': '23144668'}
SO
	 2012 Dec;5(4):261-272
FullJournalName
	 Evolutionary intelligence
HasAbstract
	 1
Pages
	 261-272
Title
	 Embodied artificial evolution: Artificial evolutionary systems in the 21st Century.
Issue
	 4
Volume
	 5
History
	 {'epublish': '2012/04/20 00:00', 'accepted': '2012/03/22 00:00', 'medline': ['2012/11/13 06:00'], 'pubmed': ['2012/11/13 06:00'], 'entrez': '2012/11/13 06:00', 'revised': '2012/02/17 00:00', 'received': '2011/11/28 00:00'}
Source
	 Evol Intell
RecordStatus
	 PubMed
PubStatus
	 ppublish+epublish
PubDate
	 2012 Dec
LangList
	 ['English']
ELocationID
	 
ESSN
	 1864-5917
DOI
	 10.1007/s12065-012-0071-x
ISSN
	 1864-5909
NlmUniqueID
	 101

For now, we'll keep just some basic information for each paper: title, list of authors, publication year, and [DOI](https://en.wikipedia.org/wiki/Digital_object_identifier).

In case you are not familiar with the DOI system, know that the paper above can be accessed through the link [http://dx.doi.org/10.1007/s12065-012-0071-x](http://dx.doi.org/10.1007/s12065-012-0071-x) (which is `http://dx.doi.org/` followed by the paper's DOI).

In [13]:
( example_paper['Title'], example_paper['AuthorList'], int(example_paper['PubDate'][:4]), example_paper['DOI'] )

('Embodied artificial evolution: Artificial evolutionary systems in the 21st Century.',
 ['Eiben AE', 'Kernbach S', 'Haasdijk E'],
 2012,
 '10.1007/s12065-012-0071-x')

### Summaries dataset

We are now ready to assemble a dataset containing the summaries of all the paper `Ids` we previously fetched.

To reduce the memory footprint, and to ensure the saved datasets won't depend on Biopython being installed to be properly loaded, values returned by `Entrez.read()` will be converted to their corresponding native Python types. We start by defining a function for helping with the conversion of strings:

In [14]:
Summaries_file = 'data/' + search_term + '__Summaries.pkl.bz2'

In [15]:
if os.path.exists( Summaries_file ):
    Summaries = cPickle.load( bz2.BZ2File( Summaries_file, 'rb' ) )
else:
    # `Summaries` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Summaries = []
    retrieve_per_query = 500
    
    print('Fetching Summaries of results: ')
    for start in range( 0, len(Ids), retrieve_per_query ):
        if (start % 10000 == 0):
            print('')
            print(start, end='')
        else:
            print('.', end='')
        
        # build comma separated string with the ids at indexes [start, start+retrieve_per_query)
        query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
        
        s = Entrez.read( Entrez.esummary( db="pubmed", id=query_ids ) )
        
        # out of the retrieved data, we will keep only a tuple (title, authors, year, DOI), associated with the paper's id.
        # (all values converted to native Python formats)
        f = [
            ( int( p['Id'] ), (
                str( p['Title'] ),
                [ str(a) for a in p['AuthorList'] ],
                int( p['PubDate'][:4] ),                # keeps just the publication year
                str( p.get('DOI', '') )            # papers for which no DOI is available get an empty string in their place
                ) )
            for p in s
            ]
        Summaries.extend( f )
    
    # Save Summaries, as a dictionary indexed by Ids
    Summaries = dict( Summaries )
    
    pickle.dump( Summaries, bz2.BZ2File( Summaries_file, 'wb' ) )

Fetching Summaries of results: 

0...................
10000...................
20000...................
30000...................
40000...................
50000...................
60000...................
70000...................
80000...................
90000...................
100000...................
110000...................
120000...................
130000...................
140000...................
150000...................
160000...................
170000...................
180000...................
190000.

Let us take a look at the first 3 retrieved summaries:

In [16]:
{ id : Summaries[id] for id in Ids[:3] }

{26488732: ('Use of Whole-Genome Sequencing to Link Burkholderia pseudomallei from Air Sampling to Mediastinal Melioidosis, Australia.',
  ['Currie BJ',
   'Price EP',
   'Mayo M',
   'Kaestli M',
   'Theobald V',
   'Harrington I',
   'Harrington G',
   'Sarovich DS'],
  2015,
  '10.3201/eid2111.141802'),
 26489005: ('A simple chemical solution deposition of Co3O4 thin film electrocatalyst for oxygen evolution reaction.',
  ['Jeon HS', 'Jee MS', 'Kim H', 'Ahn SJ', 'Hwang YJ', 'Min BK'],
  2015,
  '10.1021/acsami.5b06189'),
 26489032: ('The Occurrence of Beer Spoilage Lactic Acid Bacteria in Craft Beer Production.',
  ['Garofalo C',
   'Osimani A',
   'Milanović V',
   'Taccari M',
   'Aquilanti L',
   'Clementi F'],
  2015,
  '10.1111/1750-3841.13112')}

## EFetch: Downloading full records from Entrez

`Entrez.efetch()` is the function that will allow us to obtain paper abstracts. Let us start by taking a look at the kind of data it returns when we query PubMed's database.

In [17]:
q = Entrez.read( Entrez.efetch(db="pubmed", id='23144668', retmode="xml") )

`q` is a list, with each member corresponding to a queried id. Because here we only queried for one id, its results are then in `q[0]`.

In [18]:
type(q), len(q)

(Bio.Entrez.Parser.ListElement, 1)

At `q[0]` we find a dictionary containing two keys, the contents of which we print below.

In [19]:
type(q[0]), q[0].keys()

(Bio.Entrez.Parser.DictionaryElement,
 dict_keys(['MedlineCitation', 'PubmedData']))

In [20]:
print_dict( q[0][ 'PubmedData' ] )

PublicationStatus
	 ppublish
History
	 [DictElement({'Year': '2011', 'Month': '11', 'Day': '28'}, attributes={'PubStatus': 'received'}), DictElement({'Year': '2012', 'Month': '2', 'Day': '17'}, attributes={'PubStatus': 'revised'}), DictElement({'Year': '2012', 'Month': '3', 'Day': '22'}, attributes={'PubStatus': 'accepted'}), DictElement({'Year': '2012', 'Month': '4', 'Day': '20'}, attributes={'PubStatus': 'epublish'}), DictElement({'Hour': '6', 'Year': '2012', 'Minute': '0', 'Month': '11', 'Day': '13'}, attributes={'PubStatus': 'entrez'}), DictElement({'Hour': '6', 'Year': '2012', 'Minute': '0', 'Month': '11', 'Day': '13'}, attributes={'PubStatus': 'pubmed'}), DictElement({'Hour': '6', 'Year': '2012', 'Minute': '0', 'Month': '11', 'Day': '13'}, attributes={'PubStatus': 'medline'})]
ArticleIdList
	 [StringElement('10.1007/s12065-012-0071-x', attributes={'IdType': 'doi'}), StringElement('71', attributes={'IdType': 'pii'}), StringElement('23144668', attributes={'IdType': 'pubmed'}), Stri

The key `'MedlineCitation'` maps into another dictionary. In that dictionary, most of the information is contained under the key `'Article'`. To minimize the clutter, below we show the contents of `'MedlineCitation'` excluding its `'Article'` member, and below that we then show the contents of `'Article'`.

In [21]:
print_dict( { k:v for k,v in q[0][ 'MedlineCitation' ].items() if k!='Article' } )

OtherAbstract
	 []
PMID
	 23144668
GeneralNote
	 []
KeywordList
	 []
MedlineJournalInfo
	 {'MedlineTA': 'Evol Intell', 'NlmUniqueID': '101475575', 'ISSNLinking': '1864-5909'}
CitationSubset
	 []
OtherID
	 []
DateCreated
	 {'Year': '2012', 'Month': '11', 'Day': '12'}
SpaceFlightMission
	 []


In [22]:
print_dict( q[0][ 'MedlineCitation' ][ 'Article' ] )

AuthorList
	 ListElement([DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'VU University Amsterdam, Amsterdam, The Netherlands.'}], 'LastName': 'Eiben', 'Initials': 'AE', 'Identifier': [], 'ForeName': 'A E'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [], 'LastName': 'Kernbach', 'Initials': 'S', 'Identifier': [], 'ForeName': 'S'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [], 'LastName': 'Haasdijk', 'Initials': 'E', 'Identifier': [], 'ForeName': 'Evert'}, attributes={'ValidYN': 'Y'})], attributes={'Type': 'authors', 'CompleteYN': 'Y'})
ArticleTitle
	 Embodied artificial evolution: Artificial evolutionary systems in the 21st Century.
Journal
	 {'JournalIssue': DictElement({'Issue': '4', 'Volume': '5', 'PubDate': {'Year': '2012', 'Month': 'Dec'}}, attributes={'CitedMedium': 'Print'}), 'Title': 'Evolutionary intelligence', 'ISOAbbreviation': 'Evol Intell', 'ISSN': StringElement('1864-5909', attributes={'IssnType': 'Print'})}
Articl

A paper's abstract can therefore be accessed with:

In [23]:
{ int(q[0]['MedlineCitation']['PMID']) : str(q[0]['MedlineCitation']['Article']['Abstract']['AbstractText'][0]) }

{23144668: 'Evolution is one of the major omnipresent powers in the universe that has been studied for about two centuries. Recent scientific and technical developments make it possible to make the transition from passively understanding to actively using evolutionary processes. Today this is possible in Evolutionary Computing, where human experimenters can design and manipulate all components of evolutionary processes in digital spaces. We argue that in the near future it will be possible to implement artificial evolutionary processes outside such imaginary spaces and make them physically embodied. In other words, we envision the "Evolution of Things", rather than just the evolution of digital objects, leading to a new field of Embodied Artificial Evolution (EAE). The main objective of this paper is to present a unifying vision in order to aid the development of this high potential research area. To this end, we introduce the notion of EAE, discuss a few examples and applications, and

A paper for which no abstract is available will simply not contain the `'Abstract'` key in its `'Article'` dictionary:

In [24]:
print_dict( Entrez.read( Entrez.efetch(db="pubmed", id='17782550', retmode="xml") )[0]['MedlineCitation']['Article'] )

ArticleTitle
	 EVOLUTION OF LOCOMOTIVES IN AMERICA.
Journal
	 {'JournalIssue': DictElement({'Issue': '3', 'Volume': '1', 'PubDate': {'Year': '1880', 'Month': 'Jul', 'Day': '17'}}, attributes={'CitedMedium': 'Print'}), 'Title': 'Science (New York, N.Y.)', 'ISOAbbreviation': 'Science', 'ISSN': StringElement('0036-8075', attributes={'IssnType': 'Print'})}
ArticleDate
	 []
PublicationTypeList
	 [StringElement('Journal Article', attributes={'UI': 'D016428'})]
Language
	 ['eng']
ELocationID
	 []
Pagination
	 {'MedlinePgn': '35'}


Some of the ids in our dataset refer to books from the [NCBI Bookshelf](http://www.ncbi.nlm.nih.gov/books/), a collection of freely available, downloadable, on-line versions of selected biomedical books. For such ids, `Entrez.efetch()` returns a slightly different structure, where the keys `[u'BookDocument', u'PubmedBookData']` take the place of the `[u'MedlineCitation', u'PubmedData']` keys we saw above.

Here is an example of the data we obtain for the id corresponding to the book [The Social Biology of Microbial Communities](http://www.ncbi.nlm.nih.gov/books/NBK114831/):

In [25]:
r = Entrez.read( Entrez.efetch(db="pubmed", id='24027805', retmode="xml") )

In [26]:
print_dict( r[0][ 'PubmedBookData' ] )

PublicationStatus
	 ppublish
History
	 [DictElement({'Hour': '6', 'Year': '2013', 'Minute': '0', 'Month': '9', 'Day': '13'}, attributes={'PubStatus': 'pubmed'}), DictElement({'Hour': '6', 'Year': '2013', 'Minute': '0', 'Month': '9', 'Day': '13'}, attributes={'PubStatus': 'medline'}), DictElement({'Hour': '6', 'Year': '2013', 'Minute': '0', 'Month': '9', 'Day': '13'}, attributes={'PubStatus': 'entrez'})]
ArticleIdList
	 [StringElement('24027805', attributes={'IdType': 'pubmed'})]


In [27]:
print_dict( r[0][ 'BookDocument' ] )

KeywordList
	 []
AuthorList
	 []
PMID
	 24027805
Book
	 {'AuthorList': [ListElement([DictElement({'AffiliationInfo': [], 'Identifier': [], 'CollectiveName': 'Institute of Medicine (US) Forum on Microbial Threats'}, attributes={'ValidYN': 'Y'})], attributes={'Type': 'authors', 'CompleteYN': 'Y'})], 'BookTitle': StringElement('The Social Biology of Microbial Communities: Workshop Summary', attributes={'book': 'nap13500'}), 'CollectionTitle': StringElement('The National Academies Collection: Reports funded by National Institutes of Health', attributes={'book': 'napcollect'}), 'Isbn': ['9780309264327', '0309264324'], 'PubDate': {'Year': '2012'}, 'ELocationID': [], 'Publisher': {'PublisherName': 'National Academies Press (US)', 'PublisherLocation': 'Washington (DC)'}}
Sections
	 [{'Section': [], 'SectionTitle': StringElement('THE NATIONAL ACADEMIES', attributes={'book': 'nap13500', 'part': 'fm.s1'})}, {'Section': [], 'SectionTitle': StringElement('PLANNING COMMITTEE FOR A WORKSHOP ON THE MI

In a book from the NCBI Bookshelf, its abstract can then be accessed as such:

In [28]:
{ int(r[0]['BookDocument']['PMID']) : str(r[0]['BookDocument']['Abstract']['AbstractText'][0]) }

{24027805: 'On March 6 and 7, 2012, the Institute of Medicine’s (IOM’s) Forum on Microbial Threats hosted a public workshop to explore the emerging science of the “social biology” of microbial communities. Workshop presentations and discussions embraced a wide spectrum of topics, experimental systems, and theoretical perspectives representative of the current, multifaceted exploration of the microbial frontier. Participants discussed ecological, evolutionary, and genetic factors contributing to the assembly, function, and stability of microbial communities; how microbial communities adapt and respond to environmental stimuli; theoretical and experimental approaches to advance this nascent field; and potential applications of knowledge gained from the study of microbial communities for the improvement of human, animal, plant, and ecosystem health and toward a deeper understanding of microbial diversity and evolution.'}

### Abstracts dataset

We can now assemble a dataset mapping paper ids to their abstracts.

In [29]:
Abstracts_file = 'data/' + search_term + '__Abstracts.pkl.bz2'

In [42]:
import http.client
from collections import deque

if os.path.exists( Abstracts_file ):
    Abstracts = pickle.load( bz2.BZ2File( Abstracts_file, 'rb' ) )
else:
    # `Abstracts` will be incrementally assembled, by performing multiple queries,
    # each returning at most `retrieve_per_query` entries.
    Abstracts = deque()
    retrieve_per_query = 500
    
    print('Fetching Abstracts of results: ')
    for start in range( 0, len(Ids), retrieve_per_query ):
        if (start % 10000 == 0):
            print('')
            print(start, end='')
        else:
            print('.', end='')
        
        # build comma separated string with the ids at indexes [start, start+retrieve_per_query)
        query_ids = ','.join( [ str(id) for id in Ids[ start : start+retrieve_per_query ] ] )
        
        # issue requests to the server, until we get the full amount of data we expect
        while True:
            try:
                s = Entrez.read( Entrez.efetch(db="pubmed", id=query_ids, retmode="xml" ) )
            except http.client.IncompleteRead:
                print('r', end='')
                continue
            break
        
        i = 0
        for p in s:
            abstr = ''
            if 'MedlineCitation' in p:
                pmid = p['MedlineCitation']['PMID']
                if 'Abstract' in p['MedlineCitation']['Article']:
                    abstr = p['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
            elif 'BookDocument' in p:
                pmid = p['BookDocument']['PMID']
                if 'Abstract' in p['BookDocument']:
                    abstr = p['BookDocument']['Abstract']['AbstractText'][0]
            else:
                raise Exception('Unrecognized record type, for id %d (keys: %s)' % (Ids[start+i], str(p.keys())) )
            
            Abstracts.append( (int(pmid), str(abstr)) )
            i += 1
    
    # Save Abstracts, as a dictionary indexed by Ids
    Abstracts = dict( Abstracts )
    
    pickle.dump( Abstracts, bz2.BZ2File( Abstracts_file, 'wb' ) )

Taking a look at one paper's abstract:

In [31]:
Abstracts[26488732]

'The frequency with which melioidosis results from inhalation rather than percutaneous inoculation or ingestion is unknown. We recovered Burkholderia pseudomallei from air samples at the residence of a patient with presumptive inhalational melioidosis and used whole-genome sequencing to link the environmental bacteria to B. pseudomallei recovered from the patient.'

## ELink: Searching for related items in NCBI Entrez

To understand how to obtain paper citations with Entrez, we will first assemble a small set of PubMed IDs, and then query for their citations.
To that end, we search here for papers published in the [PLOS Computational Biology](http://www.ploscompbiol.org/) journal (as before, having also the word "air" in either the title or abstract):

In [32]:
CA_search_term = search_term+'[TIAB] AND PLoS computational biology[JOUR]'
CA_ids = Entrez.read( Entrez.esearch( db="pubmed", term=CA_search_term ) )['IdList']
CA_ids

['24853675', '23737742', '23209398', '22383866', '20865052', '19300479']

In [33]:
CA_summ = {
    p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
    for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( CA_ids )) )
    }
CA_summ

{'19300479': ('Computational model of the insect pheromone transduction cascade.',
  ['Gu Y', 'Lucas P', 'Rospars JP'],
  '2009',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1000321'),
 '20865052': ('Reverse engineering of oxygen transport in the lung: adaptation to changing demands and resources through space-filling networks.',
  ['Hou C', 'Gheorghiu S', 'Huxley VH', 'Pfeifer P'],
  '2010',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1000902'),
 '22383866': ('A cell-based computational modeling approach for developing site-directed molecular probes.',
  ['Yu JY', 'Zheng N', 'Mane G', 'Min KA', 'Hinestroza JP', 'Zhu H', 'Stringer KA', 'Rosania GR'],
  '2012',
  'PLoS computational biology',
  '10.1371/journal.pcbi.1002378'),
 '23209398': ('Experimental studies and dynamics modeling analysis of the swimming and diving of whirligig beetles (Coleoptera: Gyrinidae).',
  ['Xu Z', 'Lenaghan SC', 'Reese BE', 'Jia X', 'Zhang M'],
  '2012',
  'PLoS computational biology

Because we restricted our search to papers in an open-access journal, you can then follow their DOIs to freely access their PDFs at the journal's website:<br>[10.1371/journal.pcbi.0040023](http://dx.doi.org/10.1371/journal.pcbi.0040023), [10.1371/journal.pcbi.1000948](http://dx.doi.org/10.1371/journal.pcbi.1000948), [10.1371/journal.pcbi.1002236](http://dx.doi.org/10.1371/journal.pcbi.1002236).

We will now issue calls to `Entrez.elink()` using these PubMed IDs, to retrieve the IDs of papers that cite them.
The database from which the IDs will be retrieved is [PubMed Central](http://www.ncbi.nlm.nih.gov/pmc/), a free digital database of full-text scientific literature in the biomedical and life sciences.

You can, for instance, find [archived here](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2951343/), with the PubMed Central ID 2951343, the paper "Critical dynamics in the evolution of stochastic strategies for the iterated prisoner's dilemma", which as we saw above, has the PubMed ID 20949101.

A complete list of the kinds of links you can retrieve with `Entrez.elink()` can be found [here](http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html).

In [34]:
CA_citing = {
    id : Entrez.read( Entrez.elink(
            cmd = "neighbor",               # ELink command mode: "neighbor", returns
                                            #     a set of UIDs in `db` linked to the input UIDs in `dbfrom`.
            dbfrom = "pubmed",              # Database containing the input UIDs: PubMed
            db = "pmc",                     # Database from which to retrieve UIDs: PubMed Central
            LinkName = "pubmed_pmc_refs",   # Name of the Entrez link to retrieve: "pubmed_pmc_refs", gets
                                            #     "Full-text articles in the PubMed Central Database that cite the current articles"
            from_uid = id                   # input UIDs
            ) )
    for id in CA_ids
    }

CA_citing['24853675']

[{'LinkSetDb': [{'Link': [{'Id': '4408149'}, {'Id': '4105914'}], 'LinkName': 'pubmed_pmc_refs', 'DbTo': 'pmc'}], 'LinkSetDbHistory': [], 'ERROR': [], 'IdList': ['24853675'], 'DbFrom': 'pubmed'}]

We have in `CA_citing[paper_id][0]['LinkSetDb'][0]['Link']` the list of papers citing `paper_id`. To get it as just a list of ids, we can do

In [35]:
cits = [ l['Id'] for l in CA_citing['24853675'][0]['LinkSetDb'][0]['Link'] ]
cits

['4408149', '4105914']

However, one more step is needed, as what we have now are PubMed Central IDs, and not PubMed IDs. Their conversion can be achieved through an additional call to `Entrez.elink()`:

In [36]:
cits_pm = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=",".join(cits)) )
cits_pm

[{'LinkSetDb': [{'Link': [{'Id': '25926883'}, {'Id': '25067946'}], 'LinkName': 'pmc_pubmed', 'DbTo': 'pubmed'}], 'LinkSetDbHistory': [], 'ERROR': [], 'IdList': ['4105914', '4408149'], 'DbFrom': 'pmc'}]

In [37]:
ids_map = { pmc_id : link['Id'] for (pmc_id,link) in zip(cits_pm[0]['IdList'], cits_pm[0]['LinkSetDb'][0]['Link']) }
ids_map

{'4105914': '25926883', '4408149': '25067946'}

And to check these papers:

In [38]:
{   p['Id'] : ( p['Title'], p['AuthorList'], p['PubDate'][:4], p['FullJournalName'], p.get('DOI', '') )
    for p in Entrez.read( Entrez.esummary(db="pubmed", id=','.join( ids_map.values() )) )
    }

{'25067946': ('Guiding deployment of resistance in cereals using evolutionary principles.',
  ['Burdon JJ', 'Barrett LG', 'Rebetzke G', 'Thrall PH'],
  '2014',
  'Evolutionary applications',
  '10.1111/eva.12175'),
 '25926883': ('Crop pathogen emergence and evolution in agro-ecological landscapes.',
  ['Papaïx J', 'Burdon JJ', 'Zhan J', 'Thrall PH'],
  '2015',
  'Evolutionary applications',
  '10.1111/eva.12251')}

### Citations dataset

We have now seen all the steps required to assemble a dataset of citations to each of the papers in our dataset.

In [39]:
Citations_file = 'data/' + search_term + '__Citations.pkl.bz2'
Citations = []

At least one server query will be issued per paper in `Ids`. Because NCBI allows for at most 3 queries per second (see [here](http://biopython.org/DIST/docs/api/Bio.Entrez-pysrc.html#_open)), this dataset will take a long time to assemble. Should you need to interrupt it for some reason, or the connection fail at some point, it is safe to just rerun the cell below until all data is collected.

In [None]:
import http.client

if Citations == [] and os.path.exists( Citations_file ):
    Citations = pickle.load( bz2.BZ2File( Citations_file, 'rb' ) )

if len(Citations) < len(Ids):
    
    i = len(Citations)
    checkpoint = len(Ids) / 10 + 1      # save to hard drive at every 10% of Ids fetched
    
    for pm_id in Ids[i:]:               # either starts from index 0, or resumes from where we previously left off
        
        while True:
            try:
                # query for papers archived in PubMed Central that cite the paper with PubMed ID `pm_id`
                c = Entrez.read( Entrez.elink( dbfrom = "pubmed", db="pmc", LinkName = "pubmed_pmc_refs", id=str(pm_id) ) )
                
                c = c[0]['LinkSetDb']
                if len(c) == 0:
                    # no citations found for the current paper
                    c = []
                else:
                    c = [ l['Id'] for l in c[0]['Link'] ]
                    
                    # convert citations from PubMed Central IDs to PubMed IDs
                    p = []
                    retrieve_per_query = 500
                    for start in range( 0, len(c), retrieve_per_query ):
                        query_ids = ','.join( c[start : start+retrieve_per_query] )
                        r = Entrez.read( Entrez.elink( dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed", from_uid=query_ids ) )
                        # select the IDs. If no matching PubMed ID was found, [] is returned instead
                        p.extend( [] if r[0]['LinkSetDb']==[] else [ int(link['Id']) for link in r[0]['LinkSetDb'][0]['Link'] ] )
                    c = p
            
            except http.client.BadStatusLine:
                # Presumably, the server closed the connection before sending a valid response. Retry until we have the data.
                print('r')
                continue
            break
        
        Citations.append( (pm_id, c) )
        if (i % 10000 == 0):
            print('')
            print(i, end='')
        if (i % 100 == 0):
            print('.', end='')
        i += 1
        
        if i % checkpoint == 0:
            print('\tsaving at checkpoint', i)
            pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )
    
    print('\n done.')
    
    # Save Citations, as a dictionary indexed by Ids
    Citations = dict( Citations )
    
    pickle.dump( Citations, bz2.BZ2File( Citations_file, 'wb' ) )

...........	saving at checkpoint 38113
..................
40000.............................

To see that we have indeed obtained the data we expected, you can match the ids below (citations to the paper "Critical dynamics in the evolution of stochastic strategies for the iterated prisoner's dilemma"), with the ids listed at the end of last section.

In [None]:
Citations[26488732]

## Where do we go from here?

Running the code above generates multiple local files, containing the datasets we'll be working with. Loading them into memory is a matter of just issuing a call like<br>
``data = pickle.load( bz2.BZ2File( data_file, 'rb' ) )``.

The Entrez module will therefore no longer be needed, unless you wish to extend your data processing with additional information retrieved from PubMed.

Should you be interested in looking at alternative ways to handle the data, have a look at the [sqlite3](http://docs.python.org/3/library/sqlite3.html) module included in Python's standard library, or [Pandas](http://pandas.pydata.org/), the Python Data Analysis Library.