# Bulk metadata download

Contents:
1. Introduction
2. Access bulk data using the OAI-MPH
3. Some metadata exploration
4. Appendix: Access bulk data using the arXiv API


## 1. Introduction

In this notebook, we access metadata for all articles in the astro-ph category, which is saved as `arxiv_astroph_metadata.csv`. This metadata helps us achieve downstream tasks, such as identifying whether or not a particular article belongs to the astro-ph category.

We have [three ways](https://arxiv.org/help/bulk_data) to harvest bulk metadata. Here we use the preferred way, the OAI protocol for metadata harvesting (OAI-MPH). 

OAI-MPH documentation:

- https://arxiv.org/help/oa/index
- http://www.openarchives.org/OAI/openarchivesprotocol.html

This notebook can be executed anytime. If `arxiv_astroph_metadata.csv` does not exist, it will be created. If it exists, it will be updated.

## 2. Access bulk metadata using the OAI-MPH

Import dependencies:

In [3]:
import urllib, time, pandas as pd, numpy as np, os, datetime
from bs4 import BeautifulSoup

Categories (called sets) we can query:

In [3]:
url = 'http://export.arxiv.org/oai2?verb=ListSets'
results = urllib.request.urlopen(url).read()
soup = BeautifulSoup(results, 'xml')
sets = soup.find_all('setSpec')
for s in sets:
    print(s.text)

cs
econ
eess
math
physics
physics:astro-ph
physics:cond-mat
physics:gr-qc
physics:hep-ex
physics:hep-lat
physics:hep-ph
physics:hep-th
physics:math-ph
physics:nlin
physics:nucl-ex
physics:nucl-th
physics:physics
physics:quant-ph
q-bio
q-fin
stat


We are interested in the `physics:astro-ph` set. 

Submit an exploratory request to access metadata for articles in this set:

In [6]:
url = 'http://export.arxiv.org/oai2?verb=ListRecords&set=physics:astro-ph&metadataPrefix=arXiv'
results = urllib.request.urlopen(url).read()
soup = BeautifulSoup(results, 'xml')
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>
  2019-02-11T17:57:13Z
 </responseDate>
 <request metadataPrefix="arXiv" set="physics:astro-ph" verb="ListRecords">
  http://export.arxiv.org/oai2
 </request>
 <ListRecords>
  <record>
   <header>
    <identifier>
     oai:arXiv.org:0704.0009
    </identifier>
    <datestamp>
     2010-03-18
    </datestamp>
    <setSpec>
     physics:astro-ph
    </setSpec>
   </header>
   <metadata>
    <arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd">
     <id>
      0704.0009
     </id>
     <created>
      2007-04-02
     </created>
     <authors>
      <author>
       <keyname>
        Harvey
       </keyn

In [14]:
resumptionToken = soup.find('resumptionToken')
print('Number of article records obtained: ' + str(len(soup.find_all('record'))))
print('Total number of articles: ' + str(resumptionToken['completeListSize']))
print('Resumption token: ' + resumptionToken.string)
print('Date that we made this request: ' + soup.find('responseDate').string)

Number of article records obtained: 1000
Total number of articles: 250099
Resumption token: 3363841|3001
Date that we made this request: 2019-02-11T18:10:41Z


Each request will return metadata for 1,000 records. There are 250,099 articles at the time of this notebook's execution. We can acess the metadata for all of those articles by submitting more requests, adding the given resumption token to our URL. 

The protocol also asks us to [pause between requests](http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#HTTPResponseFormat), otherwise we will receive a status code of 503. A Stack Overflow user has recommended a wait period of [20 seconds](https://academia.stackexchange.com/a/38982).

Define a function to request metadata for all astro-ph articles:

In [4]:
def request_bulk_metadata(date_of_last_request):  
    rows = []
    resumptionToken = 'placeholder'
    url = 'http://export.arxiv.org/oai2?verb=ListRecords&set=physics:astro-ph&metadataPrefix=arXiv'
    
    # If we have specified the date of the last request, add it to the URL
    if date_of_last_request:
        url += '&from=' + date_of_last_request.strftime('%Y-%m-%d')

    # Continue requesting until we are not given any more resumption tokens
    while resumptionToken is not None:
        # Send request and receive results
        print('Requesting: ' + url)
        results = urllib.request.urlopen(url).read()

        # Parse with Beautiful Soup
        soup = BeautifulSoup(results, 'xml')
        records = soup.find_all('record')
        for record in records:
            # Get header data
            identifier = record.find('identifier')
            datestamp = record.find('datestamp')
            spec = record.find('setSpec')

            # Get metadata
            filename = record.find('id')
            created = record.find('created')
            updated = record.find('updated')
            authors = []
            for author in record.find_all('author'):
                forenames = author.forenames
                keyname = author.keyname
                if forenames and keyname:
                    authors.append(author.forenames.text.strip() + ' ' + author.keyname.text.strip())
            author_str = ', '.join(authors)
            title = record.find('title')
            categories = record.find('categories')
            journal = record.find('journal-ref')
            doi = record.find('doi')
            abstract = record.find('abstract')
            comments = record.find('comment')

            # Save current record as a row in the table
            row = {
                'identifier': getattr(identifier, 'text', None),
                'filename': getattr(filename, 'text', None),
                'spec': getattr(spec, 'text', None),
                'title': getattr(title, 'text', None),
                'datestamp': getattr(datestamp, 'text', None),
                'created': getattr(created, 'text', None),
                'updated': getattr(updated, 'text', None), # may have more than one instance that we're missing
                'authors': author_str,
                'categories': getattr(categories, 'text', None),
                'journal': getattr(journal, 'text', None),
                'doi': getattr(doi, 'text', None),
                'abstract': getattr(abstract, 'text', None),
                'comments': getattr(comments, 'text', None)
            }
            rows.append(row)

        # Get resumption token if provided
        resumptionToken = soup.find('resumptionToken')

        # Continue if we have resumption token
        if resumptionToken is not None:
            print('Status: ' + str(int(resumptionToken['cursor']) + 1) + '—' + str(len(rows)) + '/' + str(resumptionToken['completeListSize']) + '...')
            resumptionToken = resumptionToken.text
            url = 'http://export.arxiv.org/oai2?verb=ListRecords&resumptionToken=' + resumptionToken
            time.sleep(20) # avoid 503 status
        else:
            # Otherwise, obtain date of last request and the while loop ends here
            requestDate = soup.find('responseDate').text
        
    return rows, requestDate

Execute request:

In [6]:
metadata_filepath = 'arxiv_metadata_astroph.csv'

if os.path.exists(metadata_filepath):
    # If the metadata file exists, load it into a data frame
    existing_metadata_df = pd.read_csv(metadata_filepath, 
                                       dtype={'filename': str,
                                              'filename_parsed': str,
                                              'identifier': str,
                                              'updated': str,
                                              'doi': str}, 
                                       parse_dates=['date_retrieved'])
    # Get the date of the last request
    date_of_last_request = existing_metadata_df['date_retrieved'].max()
    print(metadata_filepath + ' last updated on ' + date_of_last_request.strftime('%Y-%m-%d'))
    print('Updating...')
    # Send a request to access metadata since that date
    records, requestDate = request_bulk_metadata(date_of_last_request + datetime.timedelta(days=1))
    # Create data frame for records to specify additional info
    if len(records) > 0:
        print('Number of new records found: ' + str(len(records)))
        records_df = pd.DataFrame(records)
        records_df['date_retrieved'] = np.full(len(records_df), requestDate)
        records_df['filename_parsed'] = existing_metadata_df['filename'].str.replace('/', '')
        # Update metadata file
        metadata_df = pd.concat([existing_metadata_df, records_df], axis=0, sort=True, ignore_index=True)
        metadata_df.to_csv(metadata_filepath, index=False)
        print('Metadata has been updated.')
    else: 
        print('No additional records found. Metadata is up to date.')
else:
    # If the metadata file doesn't exist, request all metadata
    print(metadata_filepath + ' is being created...')
    records, requestDate = request_bulk_metadata(None)
    # Load records into data frame
    metadata_df = pd.DataFrame(records)
    # Add a column to specify additional info
    metadata_df['date_retrieved'] = np.full(len(metadata_df), requestDate)
    metadata_df['filename_parsed'] = metadata_df['filename'].str.replace('/', '')
    # Save it to CSV
    metadata_df.to_csv(metadata_filepath, index=False)
    print('Metadata has been saved.')

arxiv_metadata_astroph.csv last updated on 2019-02-12
Updating...
Requesting: http://export.arxiv.org/oai2?verb=ListRecords&set=physics:astro-ph&metadataPrefix=arXiv&from=2019-02-13
No additional records found. Metadata is up to date.


View the metadata:

In [96]:
metadata_df

Unnamed: 0,abstract,authors,categories,comments,created,date_retrieved,datestamp,doi,filename,filename_parsed,identifier,journal,spec,title,updated
0,We discuss the results from the combined IRA...,"['Paul Harvey', 'Bruno Merin', 'Tracy L. Huard...",astro-ph,,4/2/07,2019-02-07 23:29:53,3/18/10,10.1086/518646,704.0009,704.001,oai:arXiv.org:0704.0009,"Astrophys.J.663:1149-1173,2007",physics:astro-ph,"The Spitzer c2d Survey of Large, Nearby, Inste...",
1,Results from spectroscopic observations of t...,"['Nceba Mhlahlo', 'David H. Buckley', 'Vikram ...",astro-ph,,3/31/07,2019-02-07 23:29:53,6/23/09,10.1111/j.1365-2966.2007.11762.x,704.0017,704.002,oai:arXiv.org:0704.0017,"Mon.Not.Roy.Astron.Soc.378:211-220,2007",physics:astro-ph,Spectroscopic Observations of the Intermediate...,
2,"The very nature of the solar chromosphere, i...","['M. A. Loukitcheva', 'S. K. Solanki', 'S. Whi...",astro-ph,,3/31/07,2019-02-07 23:29:53,6/23/09,10.1007/s10509-007-9626-1,704.0023,704.002,oai:arXiv.org:0704.0023,"Astrophys.Space Sci.313:197-200,2008",physics:astro-ph,ALMA as the ideal probe of the solar chromosphere,
3,We present a theoretical framework for plasm...,"['A. A. Schekochihin', 'S. C. Cowley', 'W. Dor...",astro-ph nlin.CD physics.plasm-ph physics.spac...,,3/31/07,2019-02-07 23:29:53,5/13/15,10.1088/0067-0049/182/1/310,704.0044,704.004,oai:arXiv.org:0704.0044,"ApJS 182, 310 (2009)",physics:astro-ph,Astrophysical gyrokinetics: kinetic and fluid ...,5/9/09
4,We report on the analysis of selected single...,"['Alexander Stroeer', 'John Veitch', 'Christia...",gr-qc astro-ph,,3/31/07,2019-02-07 23:29:53,11/26/08,10.1088/0264-9381/24/19/S17,704.0048,704.005,oai:arXiv.org:0704.0048,"Class.Quant.Grav.24:S541-S550,2007",physics:astro-ph,Inference on white dwarf binary systems using ...,4/3/07
5,We derive masses and radii for both componen...,"['T. G. Beatty', 'J. M. Fernandez', 'D. W. Lat...",astro-ph,,3/31/07,2019-02-07 23:29:53,6/23/09,10.1086/518413,704.0059,704.006,oai:arXiv.org:0704.0059,"Astrophys.J.663:573-582,2007",physics:astro-ph,The Mass and Radius of the Unseen M-Dwarf Comp...,4/9/07
6,We show that the globular cluster mass funct...,"['Dean E. McLaughlin', 'S. Michael Fall']",astro-ph,,4/1/07,2019-02-07 23:29:53,11/11/10,10.1086/533485,704.008,704.008,oai:arXiv.org:0704.0080,"Astrophys.J.679:1272-1287,2008",physics:astro-ph,Shaping the Globular Cluster Mass Function by ...,6/11/08
7,We present semi-analytical constraint on the...,['HongSheng Zhao'],astro-ph,,4/2/07,2019-02-07 23:29:53,5/23/07,,704.0094,704.009,oai:arXiv.org:0704.0094,,physics:astro-ph,Timing and Lensing of the Colliding Bullet Clu...,
8,Context. Swift data are revolutionising our ...,"['P. A. Evans', 'A. P. Beardmore', 'K. L. Page...",astro-ph,,4/2/07,2019-02-07 23:29:53,11/13/09,10.1051/0004-6361:20077530,704.0128,704.013,oai:arXiv.org:0704.0128,,physics:astro-ph,An online repository of Swift/XRT light curves...,4/19/07
9,We report the first detection of the 6.2micr...,"['D. Lutz', 'E. Sturm', 'L. J. Tacconi', 'E. V...",astro-ph,,4/2/07,2019-02-07 23:29:53,11/13/09,10.1086/518537,704.0133,704.013,oai:arXiv.org:0704.0133,,physics:astro-ph,PAH emission and star formation in the host of...,


Now we have the metadata for all of the astro-ph articles on arXiv!

## 3. Some metadata exploration

Number of articles in each category:

In [300]:
unique, counts = np.unique(np.concatenate(metadata_df['categories'].str.split()), return_counts=True)
print(np.asarray((unique, counts)).T)

[['acc-phys' '1']
 ['adap-org' '14']
 ['alg-geom' '3']
 ['astro-ph' '105361']
 ['astro-ph.CO' '46302']
 ['astro-ph.EP' '15193']
 ['astro-ph.GA' '34545']
 ['astro-ph.HE' '32271']
 ['astro-ph.IM' '14857']
 ['astro-ph.SR' '37369']
 ['atom-ph' '8']
 ['bayes-an' '3']
 ['chao-dyn' '74']
 ['chem-ph' '2']
 ['comp-gas' '11']
 ['cond-mat' '193']
 ['cond-mat.dis-nn' '29']
 ['cond-mat.mes-hall' '62']
 ['cond-mat.mtrl-sci' '140']
 ['cond-mat.other' '121']
 ['cond-mat.quant-gas' '76']
 ['cond-mat.soft' '78']
 ['cond-mat.stat-mech' '582']
 ['cond-mat.str-el' '57']
 ['cond-mat.supr-con' '183']
 ['cs.AI' '27']
 ['cs.CC' '4']
 ['cs.CE' '45']
 ['cs.CG' '5']
 ['cs.CL' '2']
 ['cs.CR' '1']
 ['cs.CV' '97']
 ['cs.CY' '14']
 ['cs.DB' '37']
 ['cs.DC' '148']
 ['cs.DL' '114']
 ['cs.DM' '3']
 ['cs.DS' '10']
 ['cs.GR' '18']
 ['cs.HC' '19']
 ['cs.IR' '17']
 ['cs.IT' '63']
 ['cs.LG' '87']
 ['cs.MM' '5']
 ['cs.MS' '21']
 ['cs.NA' '18']
 ['cs.NE' '28']
 ['cs.NI' '10']
 ['cs.OH' '9']
 ['cs.OS' '1']
 ['cs.PF' '17']
 ['cs

## 4. Appendix: Access bulk metadata using the arXiv API

Although the OAI-MPH is the preferred way of accessing bulk metadata, the arXiv API may be useful for easy integration with web services and toolkits. We have some code below that requests metadata for astro-ph articles. It has not been refined as much as the code above, but it can serve as reference if we decide in the future to use this API. 

One thing to note is that with this API, it is easy to query subclasses of the astro-ph category, whereas with the OAI-MPH, it is not possible AFAIK. 

In [278]:
# Specify categories to search
categories = [
    'astro-ph',     # general astrophysics
    'astro-ph.ga',  # astrophysics of galaxies
    'astro-ph.co',  # cosmology and nongalactic astrophysics
    'astro-ph.ep',  # earth and planetary astrophysics
    'astro-ph.he',  # high energy astrophysical phenomena
    'astro-ph.im',  # instrumentation and methods for astrophysics 
    'astro-ph.sr']  # solar and stellar astrophysics

results_per_page = 1000
rows = []

def get_metadata_from_arXiv_API():
    for category in categories:
        print('Getting metadata for articles within the ' + category + ' category...')
        # Search parameters
        search_query = 'cat:' + category
        startIndex = 0

        # Loop until total results reached
        while True:
            url = 'http://export.arxiv.org/api/query?search_query=' + search_query + '&start=' + str(startIndex) + '&max_results=' + str(results_per_page)
            results = urllib.request.urlopen(url).read()
            soup = BeautifulSoup(results, 'xml')

            if startIndex == 0:
                total_results = int(soup.find('opensearch:totalResults').string)

            # Get all entry tags
            entries = soup.find_all('entry')

            for entry in entries: 
                # Collect authors
                authors = []
                for author in entry.find_all('name'):
                    authors.append(author.string)
                author_str = ', '.join(authors)

                # Get DOI if it exists
                doi = entry.find('arxiv:doi')
                if doi:
                    doi = doi.string

                # Extract links if they exist
                doi_link = None
                pdf_link = None
                links = entry.find_all('link')
                for link in links:
                    link_title = link.get('title')
                    if link_title and link_title == 'doi':
                        doi_link = link['href']
                    elif link_title and link_title == 'pdf':
                        pdf_link = link['href']

                # Get journal if it exists
                journal = entry.find('arxiv:journal_ref')
                if journal:
                    journal = journal.string

                # Get comment if it exists
                comment = entry.find('arxiv:comment')
                if comment:
                    comment = comment.string

                row = {
                    'id': entry.id.string,
                    'updated': entry.updated.string,
                    'published': entry.published.string,
                    'title': entry.title.string,
                    'summary': entry.summary.string,
                    'authors': author_str,
                    'doi': doi,
                    'doi_link': doi_link,
                    'journal': journal,
                    'pdf_link': pdf_link,
                    'category': entry.find('arxiv:primary_category')['term'],
                    'comment': comment
                }
                rows.append(row)

                # Quit looping if we have gotten all results
                # print(str(len(rows)) + ' of ' + str(total_results) + '...')
            if len(rows) == total_results:
                print()
                break
            else:
                print(str(len(rows)) +'/' + str(total_results) + '...')
                startIndex += results_per_page
                time.sleep(3) # recommended to sleep

In [None]:
start = time.time()
get_metadata_from_arXiv_API()
end = time.time()
print(str(end - start) + ' seconds')

In [None]:
metadata_api_df = pd.DataFrame(rows)
metadata_api_df

Save data frame to CSV:

In [None]:
metadata_api_df.to_csv('arXiv_astroph_metadata_api.csv', index=False, header=True)