# Process snapshot

* Download a snapshop of arXiv [arXiv.org submitters, 2024] manually, put it in folder data.
* Load the snapshot into a dataframe
* Add columns for date and general categories
* Save it in CSV format

## References
arXiv.org submitters. (2024). arXiv Dataset [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7548853

In [1]:
import pandas as pd
import json
import os
import gzip

## Load the snapshot into a dataframe

In [2]:
DATA_PATH = '../data'

In [3]:
%%time

# path to snapshot of data
snapshot_path = os.path.join(DATA_PATH, 'arxiv-metadata-oai-snapshot.json')

frames = []
max_bytes = -1  # 1024 * 1024 * 10  # max bytes to read from file at a time
# one json per line
with open(snapshot_path) as json_file:    
    print("Reading file")
    lines = json_file.readlines(max_bytes)
    line_count = len(lines)
    counter = 0
    for line in lines:
        # load semi-structured JSON data into data frame (https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html)
        data = json.loads(line)
        normalized_data = pd.json_normalize(data)
        frames.append(normalized_data)
        # print progress info
        counter += 1
        if counter % 100000 == 0: print(f"Processed {counter} / {line_count} lines")
# put result into data frame
arxiv_df = pd.concat(frames, ignore_index=True)  

Reading file
Processed 100000 / 2412624 lines
Processed 200000 / 2412624 lines
Processed 300000 / 2412624 lines
Processed 400000 / 2412624 lines
Processed 500000 / 2412624 lines
Processed 600000 / 2412624 lines
Processed 700000 / 2412624 lines
Processed 800000 / 2412624 lines
Processed 900000 / 2412624 lines
Processed 1000000 / 2412624 lines
Processed 1100000 / 2412624 lines
Processed 1200000 / 2412624 lines
Processed 1300000 / 2412624 lines
Processed 1400000 / 2412624 lines
Processed 1500000 / 2412624 lines
Processed 1600000 / 2412624 lines
Processed 1700000 / 2412624 lines
Processed 1800000 / 2412624 lines
Processed 1900000 / 2412624 lines
Processed 2000000 / 2412624 lines
Processed 2100000 / 2412624 lines
Processed 2200000 / 2412624 lines
Processed 2300000 / 2412624 lines
Processed 2400000 / 2412624 lines
CPU times: user 7min 42s, sys: 6.82 s, total: 7min 49s
Wall time: 7min 51s


## Add date columns

Extract date of first version from the version column, add 'year' and 'month' columns.

In [4]:
created = [version[0]['created'] for version in arxiv_df['versions']]
arxiv_df['created'] = pd.DatetimeIndex(created)
arxiv_df['year'] = [datetime.year for datetime in arxiv_df['created']]
arxiv_df['month'] = [datetime.month for datetime in arxiv_df['created']]

## Add general category columns

Add a column with a less specific category, e.g. arXiv category "physics.gen-ph" -> general category "physics"

See: https://arxiv.org/category_taxonomy

Note that "math.GM" and "physics.gen-ph" are ragbag categories.

In [5]:
gen_categories = []  # the categories for all entries
for categories in arxiv_df['categories']:
    categories = categories.split()
    entry_categories = []  # the categorie(s) for this publication
    for category in categories:
        entry_category = category
        if ("cs." in category) or ("cmp-lg" in category): entry_categories.append("Computer Science")
        elif "econ." in category: entry_categories.append("Economics")
        elif "eess." in category: entry_categories.append("Electrical Engineering and Systems Science")
        elif ("math." in category) or ("alg-geom" in category) or ("dg-ga" in category) or ("funct-an" in category) or ("dg-ga" in category) or ("q-alg" in category): 
            if "math.GM" in category: entry_categories.append("General")  # General Mathematics is a bin for papers that are obviously wrong
            else: entry_categories.append("Mathematics")
        elif "physics.gen-ph" in category: entry_categories.append("General")  # General Physics is a bin for papers that are obviously wrong
        elif ("astro-ph" in category) or ("cond-mat." in category) or ("gr-qc" in category) or \
             ("hep-" in category) or ("math-ph" in category) or ("nlin." in category)  or ("nucl-" in category) or \
             ("physics." in category) or ("quant-ph" in category) or ("acc-phys" in category) or ("adap-org" in category) or \
             ("ao-sci" in category) or ("atom-ph" in category) or ("bayes-an" in category) or \
             ("chao-dyn" in category) or ("chem-ph" in category) or ("comp-gas" in category) or \
             ("cond-mat" in category) or ("mtrl-th" in category) or ("patt-sol" in category) or \
             ("plasm-ph" in category) or ("solv-int" in category): entry_categories.append("Physics")
        elif ("q-bio." in category) or ("q-bio" in category) or ("supr-con" in category): entry_categories.append("Quantitative Biology")
        elif "q-fin" in category: entry_categories.append("Quantitative Finance")
        elif "stat." in category: entry_categories.append("Statistics")
        else: entry_categories.append(category)
    entry_categories = list(set(entry_categories))
    gen_categories.append(entry_categories)
gen_categories = pd.Series(gen_categories)

One-hot encode the new categories, add then to the dataframe

In [6]:
one_hot = gen_categories.str.join('|').str.get_dummies()
arxiv_df = arxiv_df.join(one_hot)

## Save as CSV

In [7]:
out_path = os.path.join(DATA_PATH, 'arxiv_metadata.csv')
arxiv_df.to_csv(out_path)