# Goal

Download data from arXiv. Note arXiv has an API with a nice guide: https://info.arxiv.org/help/api/index.html

There exists a dataset for ML articles, but it contains neither author nor time fields, which makes it seem less-than-useful for our purposes: https://huggingface.co/datasets/CShorten/ML-ArXiv-Papers

We'll choose a few sub-subjects from the big list https://github.com/Mahdisadjadi/arxivscraper/blob/main/categories_v2.md, as well as a few date ranges.


In [6]:
#pip install arxivscraper

In [4]:
#pip install unidecode

In [1]:
import arxivscraper
import pandas as pd
import time
import pickle
from unidecode import unidecode
from datetime import date, timedelta
import glob
from collections import Counter

In [2]:
import os

In [4]:
os.getcwd()

'/Users/amisheth/sparseBMDS/code/EN_clust'

In [3]:
#pip install -U numpy

In [5]:
#pip uninstall sentence-transformers --yes

# checking that this works...

In [10]:
# Trying out the following: https://github.com/Mahdisadjadi/arxivscraper
scraper = arxivscraper.Scraper(category = 'stat', date_from = '2024-01-10', date_until = '2024-01-11',
                               filters = {'categories' : ['ML']})
# for _ in range(5):  # Example: Scrape in small batches
#     output = scraper.scrape()
#     time.sleep(10) 
output = scraper.scrape()
cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
df = pd.DataFrame(output, columns = cols)
df = df.drop_duplicates(subset = ['id'])
df['authors'] = [ [unidecode(x) for x in y] for y in df['authors'] ]
df.shape

fetching up to  1000 records...
fetching is completed in 3.0 seconds.
Total number of records 28


(28, 8)

In [73]:
df

Unnamed: 0,id,title,categories,abstract,doi,created,updated,authors
0,2305.11913,machine learning for phase-resolved reconstruc...,physics.ao-ph cs.ai cs.lg,accurate short-term predictions of phase-resol...,10.1016/j.oceaneng.2023.116059,2023-05-18,2023-10-18,"[svenja ehlers, marco klein, alexander heinlei..."
1,2310.07425,statistical properties of speckle patterns for...,physics.optics cond-mat.stat-mech physics.ao-ph,the statistical properties of speckle patterns...,10.1103/physreva.109.013501,2023-10-11,2024-01-10,"[fernando l. metz, cristian bonatto, sandra d...."
2,2401.03736,"lessons learned: reproducibility, replicabilit...",cs.lg physics.ao-ph,while extensive guidance exists for ensuring t...,,2024-01-08,2024-01-09,"[milton s. gomez, tom beucler]"
3,2401.04125,deepphysinet: bridging deep learning and atmos...,physics.ao-ph cs.ai cs.lg,accurate weather forecasting holds significant...,,2024-01-04,,"[wenyuan li, zili liu, keyan chen, hao chen, s..."
4,2401.04431,sea wave data reconstruction using micro-seism...,physics.ins-det cs.lg physics.ao-ph,sea wave monitoring is key in many application...,10.3389/fmars.2022.798167,2024-01-09,,"[lorenzo iafolla, emiliano fiorenza, massimo c..."


# Let's go

In [5]:
## set parameters here
#first_month = (2017, 12) # inclusively
first_month = (2018, 1) # inclusively
last_month = (2024, 3)   # exclusively
#last_month = (2021, 5)   # exclusively

cat = 'q-bio'
subcat = 'BM'

## build date range tuples, month by month
y = first_month[0]
m = first_month[1]
date_tuples = []
while True:
    start_dt = date(y, m, 1)
    m += 1
    if m > 12:
        m = 1
        y += 1
    end_dt = date(y, m, 1)
    date_tuples.append((start_dt, end_dt))
    if y == last_month[0] and m == last_month[1]:
        break

## fetch and save per month
for date_range in date_tuples:
    dt = date_range[0]
    df = pd.DataFrame()
    while dt < date_range[1]: # day by day until next month
        scraper = arxivscraper.Scraper(category = cat, date_from = str(dt), date_until = str(dt), 
                                       filters = {'categories':[subcat]})
        output = scraper.scrape();
        cols = ('id', 'title', 'categories', 'abstract', 'doi', 'created', 'updated', 'authors')
        try:
            df = pd.concat([df, pd.DataFrame(output, columns = cols)])
            print(dt, df.shape)
        except:
            print('skipping', str(dt))
        time.sleep(10) # respect arXiv rules    
        dt = dt + timedelta(days = 1) 
    dt = dt - timedelta(days = 1)
    #df = df.drop_duplicates(subset=['id'])
    #
    name = "arXivScrape_" + subcat + "_" + str(dt).split('-')[0] + "_" + str(dt).split('-')[1] + ".pkl"
    df.to_pickle(name)

fetching up to  1000 records...
fetching is completed in 3.2 seconds.
Total number of records 0
2018-01-01 (0, 8)
fetching up to  1000 records...
skipping 2018-01-02
fetching up to  1000 records...
fetching is completed in 2.0 seconds.
Total number of records 2
2018-01-03 (2, 8)
fetching up to  1000 records...
fetching is completed in 2.0 seconds.
Total number of records 1
2018-01-04 (3, 8)
fetching up to  1000 records...
skipping 2018-01-05
fetching up to  1000 records...
skipping 2018-01-06
fetching up to  1000 records...
skipping 2018-01-07
fetching up to  1000 records...
fetching is completed in 2.2 seconds.
Total number of records 1
2018-01-08 (4, 8)
fetching up to  1000 records...
fetching is completed in 2.4 seconds.
Total number of records 2
2018-01-09 (6, 8)
fetching up to  1000 records...
fetching is completed in 1.9 seconds.
Total number of records 0
2018-01-10 (6, 8)
fetching up to  1000 records...
fetching is completed in 2.0 seconds.
Total number of records 1
2018-01-11 (

KeyboardInterrupt: 

In [5]:
df

Unnamed: 0,id,title,categories,abstract,doi,created,updated,authors
0,2302.09407,fractional marcus-hush-chidsey-yakopcic curren...,cond-mat.mtrl-sci physics.app-ph,we propose a circuit-level model combining the...,,2023-02-18,2024-01-31,"[georgii paradezhenko, dmitrii prodan, anastas..."
1,2304.03417,gallium arsenide optical phased array photonic...,physics.optics physics.app-ph,a 16-channel optical phased array is fabricate...,10.1364/oe.492556,2023-04-06,,"[michael nickerson, bowen song, jim brookhyser..."
2,2305.10918,predictions and measurements of thermal conduc...,cond-mat.mtrl-sci cond-mat.mes-hall physics.ap...,the lattice thermal conductivity ($\kappa$) of...,10.1103/physrevb.108.184306,2023-05-18,,"[zherui han, zixin xiong, william t. riffe, hu..."
3,2305.13587,single-particle vibrational spectroscopy using...,physics.optics physics.app-ph physics.bio-ph,vibrational spectroscopy is a ubiquitous techn...,10.1038/s41566-023-01264-3,2023-05-22,,"[shui-jing tang, mingjie zhang, jialve sun, ji..."
4,2310.16039,modeling of fluctuations in dynamical optoelec...,quant-ph physics.app-ph physics.optics,we present a full-wave maxwell-density matrix ...,,2023-10-24,2024-01-31,"[johannes popp, johannes stowasser, michael a...."
...,...,...,...,...,...,...,...,...
1,2402.17039,incorporating climate change effects into the ...,physics.soc-ph cs.cy physics.app-ph,the demand-supply balance of electricity syste...,10.1016/j.segan.2020.100403,2024-02-26,2024-02-28,"[inès harang, fabian heymann, laurens p. stoop]"
2,2402.18234,extreme ultraviolet lithography reaches 5 nm r...,physics.optics physics.app-ph,extreme ultraviolet (euv) lithography is the l...,,2024-02-28,,"[iason giannopoulos, iacopo mochi, michaela vo..."
3,2402.18366,estimation of railway vehicle response for tra...,physics.app-ph,"in railway transportation, the evaluation of t...",,2024-02-28,,"[qingjing wang, wenhao ding, qing he, ping wang]"
4,2402.18421,high-speed cmos compatible plasmonic modulator...,physics.optics physics.app-ph,wafer-level testing is an important step for p...,,2024-02-28,,"[maryam sadat amiri naeini, pierre berini]"
