## Scraping Arxiv Metadata using Arxiv API

In this tutorial, we'll scrape metadata from Arxiv using Arxiv API. Before starting let's have quick overview of Arxiv data and ways of accessing it.

### About Arxiv

Arxiv is an electronic repository for preprints. It's mostly used by quantitative fields such as computer science, math etc. Papers on Arxiv are not subject to peer-review. Arxiv which is owned and operated by Cornell University is currently hosting over 1.5 million papers. More information about Arxiv can be found in [here](https://arxiv.org/).

### Arxiv Data Access

There are several options to scrape data from Arxiv. You need to first determine what kind of data you need. There are two types of data one can scrape from Arxiv.

1. Metadata
2. Full-Text

Metadata includes information about papers such as title, authors, publication and last update dates, abstract, category, Arxiv ID for the paper etc. Metadata for Arxiv papers are stored and retrieved in Atom format. More information about Atom can be found in [here](https://validator.w3.org/feed/docs/atom.html).

Full-Text data includes paper contents as well, mostly in Tex/Latex format. In this tutorial we'll study scraping metadata. There are three ways in which one can access metadata:

1. OAI-PMH: Good for bulk metadata access.
2. Arxiv API: Good for real-time programmatic use.
3. RSS Feeds: Best for accessing daily updates on Arxiv.

### Arxiv API

Scraping metadata using Arxiv API is pretty straigtforward. One first need to construct a query. A query to Arxiv API is a single URL. For instance, let's say we'd like to access the first 2 papers whose titles include the word "graph" in category of statistical theory stat.TH. For the available category types and other query constructing tips please refer to [Arxiv API User Manual](https://arxiv.org/help/api/user-manual#query_details). Note that some categories are cross-listed.

Example Query:

http://export.arxiv.org/api/query?search_query=ti:graph+AND+cat:stat.TH&start=0&max_results=2


### Handy Libraries in Python

We'll use `urllib` to query Arxiv API and `feedparser` to parse the Atom returned. Let's retrieve the feed to the above query and print metadata for every paper.


In [1]:
# import the required libraries
import urllib.request
import feedparser

In [2]:
def query_arxiv(search_query, start, max_results = -1):
    # accessing Arxiv API
    base_url = 'http://export.arxiv.org/api/query?'
    
    # constructing our query
    query = 'search_query=%s&start=%i%s' % (search_query,
                                          start,
                                          "" if max_results == -1 else ("&max_results=%i"% max_results))
    
    # perform a GET request using the base_url and query
    response = urllib.request.urlopen(base_url+query).read()

    # parse the response using feedparser
    feed = feedparser.parse(response)

    return feed

In [15]:
def main():
    search_query = 'ti:graph+AND+cat:stat.TH'
    start = 0
    max_results = 2

    # Querying the Arxiv API
    feed = query_arxiv(search_query, start, max_results) 
    
    # Print the feed information
    print('Feed last updated: %s' % feed.feed.updated)
    print('Total results for this query: %s' % feed.feed.opensearch_totalresults)
    print('Max results for this query: %s\n' % len(feed.entries))
    
    for entry in feed.entries:
        print("Title: ", entry.title)
        print("Authors: ")
        for name in (author.name for author in entry.authors):
            print(name)
        print("Publication Date: ", entry.published)
        print("Arxiv ID: ", entry.id, "\n")
    

In [16]:
main()

Feed last updated: 2019-03-09T00:00:00-05:00
Total results for this query: 227
Max results for this query: 2

Title:  Bayesian Graph Selection Consistency For Decomposable Graphs
Authors: 
Yabo Niu
Debdeep Pati
Bani Mallick
Publication Date:  2019-01-14T05:24:55Z
Arxiv ID:  http://arxiv.org/abs/1901.04134v1 

Title:  Graph selection with GGMselect
Authors: 
Christophe Giraud
Sylvie Huet
Nicolas Verzelen
Publication Date:  2009-07-03T12:40:37Z
Arxiv ID:  http://arxiv.org/abs/0907.0619v2 



### Sample Queries

* "LSTM" in title or abstract

http://export.arxiv.org/api/query?search_query=all:LSTM

* "graph" in title but category shouldn't be stat.TH, we start the result number 2 based on our query and expect 4 papers in total. Note that Arxiv returns 2000 maximum per query so it's useful that the order of papers are the same for the same query every time.

http://export.arxiv.org/api/query?search_query=ti:graph+ANDNOT+cat:stat.TH&start=2&max_results=4

* Papers of either Larry Wasserman or Daphne Koller. Note that if you don't specify number of results, it'll return 10 by default. Moreover, %22 is used to texts with more than one words. 

http://export.arxiv.org/api/query?search_query=au:%22daphne+koller%22+OR+au:%22larry+wasserman%22
