# Data acquisition

## Get preprints and categories from arXiv

This snippet pulls down the metadata for the last 2000 papers from the arXiv,
and labels them by their primary category.

At time of use (April 2022) this was sufficient to get all papers from 2021.

(The actual paper contents need to be downloaded separately. See the file
`arxiv_bulk_download.py`.)

In [1]:
from lxml import etree

In [2]:
# The prefix of arXiv ID that will be filtered for
year = '21'

arxiv_queryurl = 'http://export.arxiv.org/api/query?search_query=cat:hep-lat&sortBy=submittedDate&sortOrder=descending&max_results=2000'
root = etree.parse(arxiv_queryurl).getroot()

In [3]:
paper_categories = {}

for entry in root.iterchildren():
    arxiv_id, primary = None, None
    is_hep_lat = False
    for element in entry.iterchildren():
        if element.tag.endswith('id'):
            arxiv_id = element.text[21:-2]
            if not arxiv_id.startswith(year):
                break
        elif element.tag.endswith('primary_category'):
            primary = element.get('term')
    if arxiv_id is not None and primary is not None:
        paper_categories[arxiv_id] = primary
        
print('\n'.join(map('{}\t{}'.format, *zip(*reversed(paper_categories.items())))))

2101.00020	hep-ph
2101.00689	hep-lat
2101.00723	hep-lat
2101.01021	hep-ph
2101.01026	hep-th
2101.01028	gr-qc
2101.01150	hep-lat
2101.01498	hep-lat
2101.01808	hep-lat
2101.02124	hep-ph
2101.02224	hep-lat
2101.02227	hep-th
2101.02236	hep-th
2101.02254	hep-lat
2101.02676	gr-qc
2101.02694	hep-lat
2101.02715	hep-ph
2101.03161	hep-lat
2101.03383	hep-lat
2101.03605	hep-lat
2101.03858	hep-ph
2101.03901	nucl-th
2101.03938	hep-lat
2101.04132	hep-th
2101.04196	hep-lat
2101.04483	hep-ph
2101.04642	hep-lat
2101.04762	hep-lat
2101.04933	hep-ph
2101.04942	hep-lat
2101.04978	hep-lat
2101.05243	hep-ph
2101.05289	quant-ph
2101.05295	gr-qc
2101.05320	hep-lat
2101.05528	hep-ph
2101.05755	hep-lat
2101.05813	nucl-th
2101.06074	hep-lat
2101.06144	hep-lat
2101.06439	hep-lat
2101.06953	hep-lat
2101.07213	hep-ph
2101.07230	hep-lat
2101.07243	cond-mat.str-el
2101.07281	hep-ph
2101.07318	hep-th
2101.07850	hep-ph
2101.08103	hep-lat
2101.08176	hep-lat
2101.08240	hep-ph
2101.08241	hep-ph
2101.08340	hep-ph
2101.08341

## Get journal name for papers

This uses the [INSPIRE API](https://github.com/inspirehep/rest-api-doc) to identify the
publication name (or if unavailable, conference name) for the papers returned from the above.

In [7]:
import requests

In [40]:
conference_data = {}

def get_conference_name(url):
    conference_data = json.loads(requests.get(url).text)
    return conference_data['metadata']['titles'][0]['title']

In [61]:
publication_names = {}
multiple_publication_papers = {}

for i, arxiv_id in enumerate(paper_categories):
    data = json.loads(requests.get(f'https://inspirehep.net/api/arxiv/{arxiv_id}').text)
    publication_info = data['metadata'].get('publication_info', [{}])
    if len(publication_info) > 1:
        multiple_publication_papers[arxiv_id] = publication_info
        print(f"Warning: {arxiv_id} has {len(publication_info)} publication records")
    publication = publication_info[0].get('journal_title', '')
    
    if (not publication) and 'conference_record' in publication_info[0]:
        conference_url = publication_info[0]['conference_record']['$ref']
        if conference_url not in conferences:
            conferences[conference_url] = get_conference_name(conference_url)
        publication = f"Proceedings of {conferences[conference_url]}"

    publication_names[arxiv_id] = publication



In [64]:
print('\n'.join(map('{}\t{}'.format, *zip(*reversed(publication_names.items())))))

2101.00020	Acta Phys.Polon.B
2101.00689	Phys.Rev.D
2101.00723	Phys.Rev.D
2101.01021	Progr.Phys.
2101.01026	Phys.Rev.D
2101.01028	
2101.01150	Phys.Rev.D
2101.01498	Phys.Rev.D
2101.01808	Int.J.Mod.Phys.A
2101.02124	Phys.Rev.D
2101.02224	Phys.Rev.D
2101.02227	JHEP
2101.02236	
2101.02254	Proceedings of Criticality in QCD and the Hadron Resonance Gas
2101.02676	
2101.02694	JHEP
2101.02715	JHEP
2101.03161	Indian J.Phys.
2101.03383	Phys.Lett.B
2101.03605	Phys.Rev.D
2101.03858	Phys.Rev.D
2101.03901	
2101.03938	Phys.Rev.D
2101.04132	JHEP
2101.04196	Phys.Rev.D
2101.04483	Sci.China Phys.Mech.Astron.
2101.04642	PoS
2101.04762	
2101.04933	Phys.Rev.D
2101.04942	Phys.Rev.D
2101.04978	Phys.Rev.D
2101.05243	Nucl.Phys.B
2101.05289	Phys.Rev.Res.
2101.05295	
2101.05320	
2101.05528	JHEP
2101.05755	Phys.Rev.D
2101.05813	Phys.Rev.D
2101.06074	
2101.06144	Phys.Rev.D
2101.06439	Symmetry
2101.06953	Phys.Rev.D
2101.07213	Phys.Rev.D
2101.07230	Phys.Rev.D
2101.07243	
2101.07281	Phys.Rev.D
2101.07318	
2101.07850	Ph

In [67]:
# Print a list of unique journal names for ease of labelling
print('\n'.join(sorted(set(publication_names.values()))))


AAPPS Bull.
Acta Phys.Polon.B
Acta Phys.Polon.Supp.
Annals Phys.
Chin.J.Phys.
Chin.Phys.C
Chin.Phys.Lett.
Class.Quant.Grav.
Commun.Theor.Phys.
EPJ Web Conf.
Eur.Phys.J.A
Eur.Phys.J.C
Eur.Phys.J.Plus
Eur.Phys.J.ST
Few Body Syst.
Front.Phys.(Beijing)
Indian J.Phys.
Int.J.Mod.Phys.A
Int.J.Mod.Phys.E
J.Phys.A
J.Phys.Conf.Ser.
J.Phys.G
JCAP
JHEP
Mod.Phys.Lett.A
Natl.Sci.Rev.
Nature Commun.
New J.Phys.
Nucl.Part.Phys.Proc.
Nucl.Phys.A
Nucl.Phys.B
PRX Quantum
PTEP
Particles
Phil.Trans.A.Math.Phys.Eng.Sci.
Phil.Trans.Roy.Soc.Lond.A
Phys.Lett.B
Phys.Rept.
Phys.Rev.A
Phys.Rev.B
Phys.Rev.C
Phys.Rev.D
Phys.Rev.E
Phys.Rev.Lett.
Phys.Rev.Res.
PoS
Proceedings of 15th International Symposium on Radiative Corrections: Applications of Quantum Field Theory to Phenomenology AND LoopFest XIX: Workshop on Radiative Corrections for the LHC and Future Colliders
Proceedings of 19th International Conference on Hadron Spectroscopy and Structure
Proceedings of 24th International Symposium on Spin Physics
Proceed