There are [two ways](https://arxiv.org/help/bulk_data) to access bulk metadata:
- API — Completed code, but will not use.
    - [User manual](https://arxiv.org/help/api/user-manual)
- OAI-PMH — This is the preferred way to bulk-download or keep an up-to-date copy of arXiv metadata.

In [155]:
import urllib, time, pandas as pd, numpy as np
from bs4 import BeautifulSoup
from enum import Enum

Get metadata from arXiv API:

In [278]:
# Specify categories to search
categories = [
    'astro-ph',     # general astrophysics
    'astro-ph.ga',  # astrophysics of galaxies
    'astro-ph.co',  # cosmology and nongalactic astrophysics
    'astro-ph.ep',  # earth and planetary astrophysics
    'astro-ph.he',  # high energy astrophysical phenomena
    'astro-ph.im',  # instrumentation and methods for astrophysics 
    'astro-ph.sr']  # solar and stellar astrophysics

results_per_page = 1000
rows = []

def get_metadata_from_arXiv_API():
    for category in categories:
        print('Getting metadata for articles within the ' + category + ' category...')
        # Search parameters
        search_query = 'cat:' + category
        startIndex = 0

        # Loop until total results reached
        while True:
            url = 'http://export.arxiv.org/api/query?search_query=' + search_query + '&start=' + str(startIndex) + '&max_results=' + str(results_per_page)
            results = urllib.request.urlopen(url).read()
            soup = BeautifulSoup(results, 'xml')

            if startIndex == 0:
                total_results = int(soup.find('opensearch:totalResults').string)

            # Get all entry tags
            entries = soup.find_all('entry')

            for entry in entries: 
                # Collect authors
                authors = []
                for author in entry.find_all('name'):
                    authors.append(author.string)
                author_str = ', '.join(authors)

                # Get DOI if it exists
                doi = entry.find('arxiv:doi')
                if doi:
                    doi = doi.string

                # Extract links if they exist
                doi_link = None
                pdf_link = None
                links = entry.find_all('link')
                for link in links:
                    link_title = link.get('title')
                    if link_title and link_title == 'doi':
                        doi_link = link['href']
                    elif link_title and link_title == 'pdf':
                        pdf_link = link['href']

                # Get journal if it exists
                journal = entry.find('arxiv:journal_ref')
                if journal:
                    journal = journal.string

                # Get comment if it exists
                comment = entry.find('arxiv:comment')
                if comment:
                    comment = comment.string

                row = {
                    'id': entry.id.string,
                    'updated': entry.updated.string,
                    'published': entry.published.string,
                    'title': entry.title.string,
                    'summary': entry.summary.string,
                    'authors': author_str,
                    'doi': doi,
                    'doi_link': doi_link,
                    'journal': journal,
                    'pdf_link': pdf_link,
                    'category': entry.find('arxiv:primary_category')['term'],
                    'comment': comment
                }
                rows.append(row)

                # Quit looping if we have gotten all results
                # print(str(len(rows)) + ' of ' + str(total_results) + '...')
            if len(rows) == total_results:
                print()
                break
            else:
                print(str(len(rows)) +'/' + str(total_results) + '...')
                startIndex += results_per_page
                time.sleep(3) # recommended to sleep

In [279]:
start = time.time()
get_metadata_from_arXiv_API()
end = time.time()
print(str(end - start) + ' seconds')

Getting metadata for articles within the astro-ph category
1000/105380...
2000/105380...
3000/105380...
4000/105380...
5000/105380...
6000/105380...
7000/105380...
8000/105380...
9000/105380...
10000/105380...
11000/105380...
11000/105380...
12000/105380...
13000/105380...
14000/105380...
15000/105380...
16000/105380...
17000/105380...
18000/105380...
18000/105380...
19000/105380...
20000/105380...
21000/105380...
22000/105380...
23000/105380...
23600/105380...
24600/105380...
25600/105380...
26600/105380...
27600/105380...
28600/105380...
29600/105380...
30600/105380...
31600/105380...
32600/105380...
33600/105380...
34600/105380...
35600/105380...
36600/105380...
37600/105380...
38600/105380...
39600/105380...
40600/105380...
41600/105380...
42600/105380...
43600/105380...
43600/105380...
43600/105380...
44600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/105380...
45600/

KeyboardInterrupt: 

In [131]:
metadata_api_df = pd.DataFrame(rows)
metadata_api_df

Unnamed: 0,authors,category,comment,doi,doi_link,id,journal,pdf_link,published,summary,title,updated
0,"Ramesh Narayan, Bohdan Paczyński, Tsvi Piran",astro-ph,14 pages,10.1086/186493,http://dx.doi.org/10.1086/186493,http://arxiv.org/abs/astro-ph/9204001v1,Astrophys.J. 395 (1992) L83-L86,http://arxiv.org/pdf/astro-ph/9204001v1,1992-04-13T18:20:01Z,It is proposed that gamma-ray bursts are cre...,Gamma-Ray Bursts as the Death Throes of Massiv...,1992-04-13T18:20:01Z
1,"Lawrence Krauss, Martin White",astro-ph,13 pages plus figures (not included),10.1086/171792,http://dx.doi.org/10.1086/171792,http://arxiv.org/abs/astro-ph/9204002v1,"Astrophys.J.397:357,1992",http://arxiv.org/pdf/astro-ph/9204002v1,1992-04-26T17:54:00Z,The four observables associated with gravita...,Gravitational Lensing and the Variability of G,1992-04-26T17:54:00Z
2,J. I. Katz,astro-ph,10 pages (Replaced to provide omitted line.),10.1007/BF00645080,http://dx.doi.org/10.1007/BF00645080,http://arxiv.org/abs/astro-ph/9204003v2,,http://arxiv.org/pdf/astro-ph/9204003v2,1992-04-29T16:36:30Z,The BATSE experiment on GRO has demonstrated...,The Ptolemaic Gamma-Ray Burst Universe,1992-04-30T20:39:38Z
3,"B P Schmidt, R P Kirshner, R G Eastman",astro-ph,21 pages,10.1086/171659,http://dx.doi.org/10.1086/171659,http://arxiv.org/abs/astro-ph/9204004v1,Astrophys.J. 395 (1992) 366-386,http://arxiv.org/pdf/astro-ph/9204004v1,1992-04-30T19:20:04Z,We use the Expanding Photosphere Method to d...,Expanding Photospheres of Type II Supernovae a...,1992-04-30T19:20:04Z
4,"B. J. Carrigan, J. I. Katz",astro-ph,24 pages,10.1086/171906,http://dx.doi.org/10.1086/171906,http://arxiv.org/abs/astro-ph/9204005v1,Astrophys.J. 399 (1992) 100-107,http://arxiv.org/pdf/astro-ph/9204005v1,1992-04-30T19:18:05Z,We have calculated gamma-ray radiative trans...,Radiation Transfer in Gamma-Ray Bursts,1992-04-30T19:18:05Z
5,"D. J. Johnson, M. W. Friedlander, J. I. Katz",astro-ph,29 pages,10.1086/172552,http://dx.doi.org/10.1086/172552,http://arxiv.org/abs/astro-ph/9204006v1,,http://arxiv.org/pdf/astro-ph/9204006v1,1992-04-30T19:30:14Z,Dust is observed to form in nova ejecta. The...,Nova Dust Nucleation: Kinetics and Photodissoc...,1992-04-30T19:30:14Z
6,Valerio Faraoni,astro-ph,12 pages,10.1086/171866,http://dx.doi.org/10.1086/171866,http://arxiv.org/abs/astro-ph/9205001v1,"Astrophys.J.398:425,1992",http://arxiv.org/pdf/astro-ph/9205001v1,1992-05-01T16:41:45Z,We apply Perlick's (1990a) rigorous formulat...,Nonstationary Gravitational Lenses and the Fer...,1992-05-01T16:41:45Z
7,"T. Hanawa, R. Matsumoto, K. Shibata",astro-ph,12 pages,10.1086/186454,http://dx.doi.org/10.1086/186454,http://arxiv.org/abs/astro-ph/9205002v1,Astrophys.J. 393 (1992) L71-L74,http://arxiv.org/pdf/astro-ph/9205002v1,1992-05-02T02:12:56Z,The effect of the magnetic skew on the Parke...,Giant Molecular Cloud Formation through the Pa...,1992-05-02T02:12:56Z
8,J. I. Katz,astro-ph,3 pages,10.1063/1.42698,http://dx.doi.org/10.1063/1.42698,http://arxiv.org/abs/astro-ph/9205003v1,,http://arxiv.org/pdf/astro-ph/9205003v1,1992-05-04T18:57:07Z,I present a model for acceleration of proton...,Particle Acceleration in (by) Accretion Discs,1992-05-04T18:57:07Z
9,"A. Cappi, S. Maurogordato",astro-ph,20 pages,,,http://arxiv.org/abs/astro-ph/9205004v1,"Astron.Astrophys.259:423-434,1992",http://arxiv.org/pdf/astro-ph/9205004v1,1992-05-08T12:22:00Z,We compare the spatial distributions of gala...,The Spatial Distribution of Nearby Galaxy Clus...,1992-05-08T12:22:00Z


Save data frame to CSV:

In [None]:
metadata_api_df.to_csv('arXiv_metadata_api.csv', index=False, header=True)

This is too slow. I stopped after 5000 records, after finding that there is another API to use for bulk metadata harvesting.

## Harvesting with the Open Archives Initiative:

- https://arxiv.org/help/oa/index
- http://www.openarchives.org/OAI/openarchivesprotocol.html

Differences:
- With ArXiv API, we could retrieve articles according to astro-ph subject class, e.g. astro-ph.ga, but with OAI, we cannot.

In [168]:
url = 'http://export.arxiv.org/oai2?verb=ListSets'
results = urllib.request.urlopen(url).read()
soup = BeautifulSoup(results, 'xml')
sets = soup.find_all('setSpec')
for s in sets:
    print(s.text)

cs
econ
eess
math
physics
physics:astro-ph
physics:cond-mat
physics:gr-qc
physics:hep-ex
physics:hep-lat
physics:hep-ph
physics:hep-th
physics:math-ph
physics:nlin
physics:nucl-ex
physics:nucl-th
physics:physics
physics:quant-ph
q-bio
q-fin
stat


We are interested in the `physics:astro-ph` set. 

In [238]:
url = 'http://export.arxiv.org/oai2?verb=ListRecords&set=physics:astro-ph&metadataPrefix=arXiv'
results = urllib.request.urlopen(url).read()
soup = BeautifulSoup(results, 'xml')
resumptionToken = soup.find('resumptionToken')
print('Number of article records obtained: ' + str(len(soup.find_all('record'))))
print('Total number of articles: ' + str(resumptionToken['completeListSize']))
print('Resumption token: ' + resumptionToken.string)
print('Date that we made this request: ' + soup.find('responseDate').string)
print()

print(soup.prettify())

Number of article identifiers obtained: 1000
Total number of articles: 250047
Resumption token: 3350519|1001
Date that we made this request: 2019-02-08T20:08:02Z

<?xml version="1.0" encoding="utf-8"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
 <responseDate>
  2019-02-08T20:08:02Z
 </responseDate>
 <request metadataPrefix="arXiv" set="physics:astro-ph" verb="ListRecords">
  http://export.arxiv.org/oai2
 </request>
 <ListRecords>
  <record>
   <header>
    <identifier>
     oai:arXiv.org:0704.0009
    </identifier>
    <datestamp>
     2010-03-18
    </datestamp>
    <setSpec>
     physics:astro-ph
    </setSpec>
   </header>
   <metadata>
    <arXiv xmlns="http://arxiv.org/OAI/arXiv/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://arxiv.org/OAI/arXiv/ http://arxiv.org/OAI/arXiv.xsd

As it returns only 10k identifiers, we need to get the rest through a batch approach using `resumptionToken`. We also need to pause ~20s, otherwise we get a status `503` error.

https://academia.stackexchange.com/questions/38969/getting-a-dump-of-arxiv-metadata
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#HTTPResponseFormat

In [261]:
url = 'http://export.arxiv.org/oai2?verb=ListRecords&set=physics:astro-ph&metadataPrefix=arXiv'
resumptionToken = 'placeholder'
responseDate = None
rows = []

while resumptionToken is not None:
    # Send query and receive results
    results = urllib.request.urlopen(url).read()
    
    # Parse with Beautiful Soup
    soup = BeautifulSoup(results, 'xml')
    records = soup.find_all('record')
    for record in records:
        # Get header data
        identifier = record.find('identifier')
        datestamp = record.find('datestamp')
        spec = record.find('setSpec')
        
        # Get metadata
        filename = record.find('id')
        created = record.find('created')
        updated = record.find('updated')
        authors = []
        for author in record.find_all('author'):
            forenames = author.forenames
            keyname = author.keyname
            if forenames and keyname:
                authors.append(author.forenames.text.strip() + ' ' + author.keyname.text.strip())
        author_str = ', '.join(authors)
        title = record.find('title')
        categories = record.find('categories')
        journal = record.find('journal-ref')
        doi = record.find('doi')
        abstract = record.find('abstract')
        comments = record.find('comment')

        # Add record as row
        row = {
            'identifier': getattr(identifier, 'text', None),
            'filename': getattr(filename, 'text', None),
            'spec': getattr(spec, 'text', None),
            'title': getattr(title, 'text', None),
            'datestamp': getattr(datestamp, 'text', None),
            'created': getattr(created, 'text', None),
            'updated': getattr(updated, 'text', None), # may have more than one instance that we're missing
            'authors': authors,
            'categories': getattr(categories, 'text', None),
            'journal': getattr(journal, 'text', None),
            'doi': getattr(doi, 'text', None),
            'abstract': getattr(abstract, 'text', None),
            'comments': getattr(comments, 'text', None)
        }
        rows.append(row)
        
    # Get resumption token if provided
    resumptionToken = soup.find('resumptionToken')
    
    # Continue if we have resumption token
    if resumptionToken is not None:
        print(str(len(rows)) +'/' + str(resumptionToken['completeListSize']) + '...')
        resumptionToken = resumptionToken.text
        url = 'http://export.arxiv.org/oai2?verb=ListRecords&resumptionToken=' + resumptionToken
        time.sleep(20) # avoid http 503 status
    else:
        # Otherwise, obtain date of last request and the while loop ends here
        responseDate = soup.find('responseDate').text

1000/250047...
2000/250047...
3000/250047...
4000/250047...
5000/250047...
6000/250047...
7000/250047...
8000/250047...
9000/250047...
10000/250047...
11000/250047...
12000/250047...
13000/250047...
14000/250047...
15000/250047...
16000/250047...
17000/250047...
18000/250047...
19000/250047...
20000/250047...
21000/250047...
22000/250047...
23000/250047...
24000/250047...
25000/250047...
26000/250047...
27000/250047...
28000/250047...
29000/250047...
30000/250047...
31000/250047...
32000/250047...
33000/250047...
34000/250047...
35000/250047...
36000/250047...
37000/250047...
38000/250047...
39000/250047...
40000/250047...
41000/250047...
42000/250047...
43000/250047...
44000/250047...
45000/250047...
46000/250047...
47000/250047...
48000/250047...
49000/250047...
50000/250047...
51000/250047...
52000/250047...
53000/250047...
54000/250047...
55000/250047...
56000/250047...
57000/250047...
58000/250047...
59000/250047...
60000/250047...
61000/250047...
62000/250047...
63000/250047...
6

Load a data frame with metadata for each astro-ph article: 

In [302]:
identifiers_df = pd.DataFrame(rows)
identifiers_df['date_retrieved'] = np.full(len(identifiers_df), responseDate)
identifiers_df['filename_parsed'] = identifiers_df['filename'].str.replace('/', '')
identifiers_df

KeyError: 'filename'

Save data frame to CSV:

In [263]:
identifiers_df.to_csv('arXiv_metadata_oai.csv', index=False, header=True)

Example reading in CSV without removing leading zeros, especially in filename:

In [301]:
opened_identifiers_df = pd.read_csv('arXiv_identifiers.csv', dtype={'filename': str})
opened_identifiers_df

Unnamed: 0,abstract,authors,categories,comments,created,datestamp,doi,filename,identifier,journal,spec,title,updated,date_retrieved
0,We discuss the results from the combined IRA...,"['Paul Harvey', 'Bruno Merin', 'Tracy L. Huard...",astro-ph,,2007-04-02,2010-03-18,10.1086/518646,0704.0009,oai:arXiv.org:0704.0009,"Astrophys.J.663:1149-1173,2007",physics:astro-ph,"The Spitzer c2d Survey of Large, Nearby, Inste...",,2019-02-08T23:20:52Z
1,Results from spectroscopic observations of t...,"['Nceba Mhlahlo', 'David H. Buckley', 'Vikram ...",astro-ph,,2007-03-31,2009-06-23,10.1111/j.1365-2966.2007.11762.x,0704.0017,oai:arXiv.org:0704.0017,"Mon.Not.Roy.Astron.Soc.378:211-220,2007",physics:astro-ph,Spectroscopic Observations of the Intermediate...,,2019-02-08T23:20:52Z
2,"The very nature of the solar chromosphere, i...","['M. A. Loukitcheva', 'S. K. Solanki', 'S. Whi...",astro-ph,,2007-03-31,2009-06-23,10.1007/s10509-007-9626-1,0704.0023,oai:arXiv.org:0704.0023,"Astrophys.Space Sci.313:197-200,2008",physics:astro-ph,ALMA as the ideal probe of the solar chromosphere,,2019-02-08T23:20:52Z
3,We present a theoretical framework for plasm...,"['A. A. Schekochihin', 'S. C. Cowley', 'W. Dor...",astro-ph nlin.CD physics.plasm-ph physics.spac...,,2007-03-31,2015-05-13,10.1088/0067-0049/182/1/310,0704.0044,oai:arXiv.org:0704.0044,"ApJS 182, 310 (2009)",physics:astro-ph,Astrophysical gyrokinetics: kinetic and fluid ...,2009-05-09,2019-02-08T23:20:52Z
4,We report on the analysis of selected single...,"['Alexander Stroeer', 'John Veitch', 'Christia...",gr-qc astro-ph,,2007-03-31,2008-11-26,10.1088/0264-9381/24/19/S17,0704.0048,oai:arXiv.org:0704.0048,"Class.Quant.Grav.24:S541-S550,2007",physics:astro-ph,Inference on white dwarf binary systems using ...,2007-04-03,2019-02-08T23:20:52Z
5,We derive masses and radii for both componen...,"['T. G. Beatty', 'J. M. Fernandez', 'D. W. Lat...",astro-ph,,2007-03-31,2009-06-23,10.1086/518413,0704.0059,oai:arXiv.org:0704.0059,"Astrophys.J.663:573-582,2007",physics:astro-ph,The Mass and Radius of the Unseen M-Dwarf Comp...,2007-04-09,2019-02-08T23:20:52Z
6,We show that the globular cluster mass funct...,"['Dean E. McLaughlin', 'S. Michael Fall']",astro-ph,,2007-04-01,2010-11-11,10.1086/533485,0704.0080,oai:arXiv.org:0704.0080,"Astrophys.J.679:1272-1287,2008",physics:astro-ph,Shaping the Globular Cluster Mass Function by ...,2008-06-11,2019-02-08T23:20:52Z
7,We present semi-analytical constraint on the...,['HongSheng Zhao'],astro-ph,,2007-04-02,2007-05-23,,0704.0094,oai:arXiv.org:0704.0094,,physics:astro-ph,Timing and Lensing of the Colliding Bullet Clu...,,2019-02-08T23:20:52Z
8,Context. Swift data are revolutionising our ...,"['P. A. Evans', 'A. P. Beardmore', 'K. L. Page...",astro-ph,,2007-04-02,2009-11-13,10.1051/0004-6361:20077530,0704.0128,oai:arXiv.org:0704.0128,,physics:astro-ph,An online repository of Swift/XRT light curves...,2007-04-19,2019-02-08T23:20:52Z
9,We report the first detection of the 6.2micr...,"['D. Lutz', 'E. Sturm', 'L. J. Tacconi', 'E. V...",astro-ph,,2007-04-02,2009-11-13,10.1086/518537,0704.0133,oai:arXiv.org:0704.0133,,physics:astro-ph,PAH emission and star formation in the host of...,,2019-02-08T23:20:52Z


Now we have all the filenames for all of the astro-ph articles that arXiv specifies in their metadata!

## Some analyses

Find out how many articles there are in each category:

In [300]:
unique, counts = np.unique(np.concatenate(opened_identifiers_df['categories'].str.split()), return_counts=True)
print(np.asarray((unique, counts)).T)

[['acc-phys' '1']
 ['adap-org' '14']
 ['alg-geom' '3']
 ['astro-ph' '105361']
 ['astro-ph.CO' '46302']
 ['astro-ph.EP' '15193']
 ['astro-ph.GA' '34545']
 ['astro-ph.HE' '32271']
 ['astro-ph.IM' '14857']
 ['astro-ph.SR' '37369']
 ['atom-ph' '8']
 ['bayes-an' '3']
 ['chao-dyn' '74']
 ['chem-ph' '2']
 ['comp-gas' '11']
 ['cond-mat' '193']
 ['cond-mat.dis-nn' '29']
 ['cond-mat.mes-hall' '62']
 ['cond-mat.mtrl-sci' '140']
 ['cond-mat.other' '121']
 ['cond-mat.quant-gas' '76']
 ['cond-mat.soft' '78']
 ['cond-mat.stat-mech' '582']
 ['cond-mat.str-el' '57']
 ['cond-mat.supr-con' '183']
 ['cs.AI' '27']
 ['cs.CC' '4']
 ['cs.CE' '45']
 ['cs.CG' '5']
 ['cs.CL' '2']
 ['cs.CR' '1']
 ['cs.CV' '97']
 ['cs.CY' '14']
 ['cs.DB' '37']
 ['cs.DC' '148']
 ['cs.DL' '114']
 ['cs.DM' '3']
 ['cs.DS' '10']
 ['cs.GR' '18']
 ['cs.HC' '19']
 ['cs.IR' '17']
 ['cs.IT' '63']
 ['cs.LG' '87']
 ['cs.MM' '5']
 ['cs.MS' '21']
 ['cs.NA' '18']
 ['cs.NE' '28']
 ['cs.NI' '10']
 ['cs.OH' '9']
 ['cs.OS' '1']
 ['cs.PF' '17']
 ['cs