A simple spider to crawl metadata of arxiv papers.
Python 3.5+
Run python3 arxiv_spider.py -h
to see help, or refer to the following table:
argument | description | default |
---|---|---|
-s yymm | start month (contain) | 0704 |
-e yymm | end month (contain) | 0704 |
-m int | maximum number of metadata to be archived | 1073741824 (consider to be a large number like +INF) |
-p int | processes used when downloading to accelerate | 1 |
-r filename | download metadata of a list of arxiv_id specified by a file | |
-d arxiv_id | download the specified arxiv_id |
python arxiv_spider.py -s 1001 -e 1010
download metadata of all papers submitted between Jan 2010 and Oct 2010python arxiv_spider.py -s 1001 -e 1001 -m 1000
download metadata of first 100000 papers (ordered by submit time) submitted in Jan 2010python arxiv_spider.py -d 0710.0010
download metadata of paper whose id is 0710.0010 (that is, the 10th paper in Oct 2007)python arxiv_spider.py -r arxiv_ids.log
download metadata of papers specified by filearxiv_ids.log
The format of arxiv_id can be found here.
The format of arxiv_ids.log
is a file that contains a list of arxiv_ids separated by '\n', like
2010.00005
1105.1432
...
0708.1234
Worth to mention that the program will log all arxiv_id that failed to download (due to network reason, for example) in arxiv_download_error.log
, so run python arxiv_spider.py -r arxiv_download_error.log
will retry all failed metadata downloading attempts.
The metadata will be stored in metadata_yymmddhhmmss.json
.
The output format is identical to the official arXiv Dataset.
The metadata normally contains authors, abstract, title, submit time, doi(if there is), etc. More information on what was recorded can be seen here.
The arxiv_id of all failure attempts will be saved in arxiv_download_error.log
The program uses arXiv Open Archives Initiative (OAI) interface to download data, which has a connection limit. One common reason that cause the failure is the user requests the OAI too often. The server will return 503 or reset the connection if one ip requests more than about 30000 times a day. Thus, though the program supports multiprocessing to accelerate the download procedure, I highly recommend you to download with only one thread.
The code in under MIT License, however, the metadata of arXiv may have other limitations. It is nice to check these sites before using the metadata and obey the arXiv TOU.