# Load metadata

arXiv.org submitters. (2024). arXiv Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7548853

* Download the data manually, put it in folder `data`.
* Drop abstracts and comments (to save memory)
* Load all the data into a dataframe
* Add more general categories, e.g. Physics, one-hot encode categories
* Save it in CSV format, zipped

In [1]:
%%time
import pandas as pd
import json

frames = []
max_bytes = -1  # 1024 * 1024 * 10  # max bytes to read from file
# one json per line
with open('data/arxiv-metadata-oai-snapshot.json') as json_file:    
    print("Reading file")
    lines = json_file.readlines(max_bytes)
    line_count = len(lines)
    counter = 0
    for line in lines:
        data = json.loads(line)
        frames.append(pd.json_normalize(data).drop('abstract', axis=1).drop('comments', axis=1))
        counter += 1
        if counter % 100000 == 0: print(f"Processed {counter} / {line_count} lines")
arxiv_df = pd.concat(frames, ignore_index=True)  

Reading file
Processed 100000 / 2412624 lines
Processed 200000 / 2412624 lines
Processed 300000 / 2412624 lines
Processed 400000 / 2412624 lines
Processed 500000 / 2412624 lines
Processed 600000 / 2412624 lines
Processed 700000 / 2412624 lines
Processed 800000 / 2412624 lines
Processed 900000 / 2412624 lines
Processed 1000000 / 2412624 lines
Processed 1100000 / 2412624 lines
Processed 1200000 / 2412624 lines
Processed 1300000 / 2412624 lines
Processed 1400000 / 2412624 lines
Processed 1500000 / 2412624 lines
Processed 1600000 / 2412624 lines
Processed 1700000 / 2412624 lines
Processed 1800000 / 2412624 lines
Processed 1900000 / 2412624 lines
Processed 2000000 / 2412624 lines
Processed 2100000 / 2412624 lines
Processed 2200000 / 2412624 lines
Processed 2300000 / 2412624 lines
Processed 2400000 / 2412624 lines
CPU times: user 16min 33s, sys: 8.59 s, total: 16min 42s
Wall time: 16min 42s


## Add a created date
Extract date of first version from the version column, add a 'created' date column.

In [2]:
created = [version[0]['created'] for version in arxiv_df['versions']]
arxiv_df['created'] = pd.DatetimeIndex(created)
arxiv_df['year'] = [datetime.year for datetime in arxiv_df['created']]
arxiv_df['month'] = [datetime.month for datetime in arxiv_df['created']]

## Category
Add a column with a less specific category, e.g. "physics.gen-ph" -> "physics"

See: https://arxiv.org/category_taxonomy

Note that "math.GM" and "physics.gen-ph" are junk categories.

In [3]:
gen_categories = []  # the categories for all entries
for categories in arxiv_df['categories']:
    categories = categories.split()
    entry_categories = []  # the categorie(s) for this publication
    for category in categories:
        entry_category = category
        if ("cs." in category) or ("cmp-lg" in category): entry_categories.append("Computer Science")
        elif "econ." in category: entry_categories.append("Economics")
        elif "eess." in category: entry_categories.append("Electrical Engineering and Systems Science")
        elif ("math." in category) or ("alg-geom" in category) or ("dg-ga" in category) or ("funct-an" in category) or ("dg-ga" in category) or ("q-alg" in category): 
            if "math.GM" in category: entry_categories.append("General")  # General Mathematics is a bin for papers that are obviously wrong
            else: entry_categories.append("Mathematics")
        elif "physics.gen-ph" in category: entry_categories.append("General")  # General Physics is a bin for papers that are obviously wrong
        elif ("astro-ph" in category) or ("cond-mat." in category) or ("gr-qc" in category) or \
             ("hep-" in category) or ("math-ph" in category) or ("nlin." in category)  or ("nucl-" in category) or \
             ("physics." in category) or ("quant-ph" in category) or ("acc-phys" in category) or ("adap-org" in category) or \
             ("ao-sci" in category) or ("atom-ph" in category) or ("bayes-an" in category) or \
             ("chao-dyn" in category) or ("chem-ph" in category) or ("comp-gas" in category) or \
             ("cond-mat" in category) or ("mtrl-th" in category) or ("patt-sol" in category) or \
             ("plasm-ph" in category) or ("solv-int" in category): entry_categories.append("Physics")
        elif ("q-bio." in category) or ("q-bio" in category) or ("supr-con" in category): entry_categories.append("Quantitative Biology")
        elif "q-fin" in category: entry_categories.append("Quantitative Finance")
        elif "stat." in category: entry_categories.append("Statistics")
        else: entry_categories.append(category)
    entry_categories = list(set(entry_categories))
    gen_categories.append(entry_categories)
gen_categories = pd.Series(gen_categories)

One-hot encode the general categories

In [4]:
one_hot = gen_categories.str.join('|').str.get_dummies()
arxiv_df = arxiv_df.join(one_hot)

In [5]:
arxiv_df.head()

Unnamed: 0,id,submitter,authors,title,journal-ref,doi,report-no,categories,license,versions,...,month,Computer Science,Economics,Electrical Engineering and Systems Science,General,Mathematics,Physics,Quantitative Biology,Quantitative Finance,Statistics
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",...,4,0,0,0,0,0,1,0,0,0
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",...,3,1,0,0,0,1,0,0,0,0
2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,,,,physics.gen-ph,,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",...,4,1,0,0,0,0,0,0,0,0
3,704.0004,David Callan,David Callan,A determinant of Stirling cycle numbers counts...,,,,math.CO,,"[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",...,3,0,0,0,0,1,0,0,0,0
4,704.0005,Alberto Torchinsky,Wael Abu-Shammala and Alberto Torchinsky,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,"Illinois J. Math. 52 (2008) no.2, 681-689",,,math.CA math.FA,,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",...,4,0,0,0,0,1,0,0,0,0


## Save as compressed CSV

In [6]:
import zipfile as zf

with zf.ZipFile('data/arxiv_metadata.csv.zip', 'w') as ziparchive:
    ziparchive.writestr('arxiv_metadata.csv', arxiv_df.to_csv())