# Accesing *ArXiv*

## Elena Fernández Fernández

Let's use **again** the ArXiv API to do a new query: https://pypi.org/project/arxiv/

# ArXiv API

First let's import the libraries

In [1]:
import arxiv
import time
import os
import urllib.request
from urllib.request import urlopen

Let's build a query

In [2]:
# Construct the default API client.
client = arxiv.Client()

And let's extract the 50 most recent Twitter ArXiv articles

In [3]:
# Search for the 50 most recent articles matching the keyword "facebook"
search = arxiv.Search(
  query = 'facebook', #if we want to do a query that contains two terms we need to add double quotation marks. For exampke: '"climate change"'
  max_results = 50,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

In [4]:
results = client.results(search)

In [5]:
print(results)

<itertools.islice object at 0x000001486FE02390>


Ok, so now let's first have a look at the titles. 

In [6]:
for r in client.results(search):
    print(r.title)

Automated Demand Forecasting in small to medium-sized enterprises
Integrating Zero-Shot Classification to Advance Long COVID Literature: A Systematic Social Media-Centered Review
Pirates of Charity: Exploring Donation-based Abuses in Social Media Platforms
ScamChatBot: An End-to-End Analysis of Fake Account Recovery on Social Media via Chatbots
Exploration of the Dynamics of Buy and Sale of Social Media Accounts
ConvMesh: Reimagining Mesh Quality Through Convex Optimization
Use of diverse data sources to control which topics emerge in a science map
Depression detection from Social Media Bangla Text Using Recurrent Neural Networks
CTRAPS: CTAP Client Impersonation and API Confusion on FIDO2
Characterizing the Fragmentation of the Social Media Ecosystem
A Graph Neural Architecture Search Approach for Identifying Bots in Social Media
Detecting Visual Triggers in Cannabis Imagery: A CLIP-Based Multi-Labeling Framework with Local-Global Aggregation
Optimal Transcoding Preset Selection for L

And now, if we compare that with the actual ArXiv website, it looks correct: https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50

Now let's begin saving the first pdf in that list in our laptop.

In [7]:
paper = next(arxiv.Client().results(arxiv.Search(id_list = ["2412.18779"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()

'./2412.18779v1.Integrating_Zero_Shot_Classification_to_Advance_Long_COVID_Literature__A_Systematic_Social_Media_Centered_Review.pdf'

In [8]:
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename = "2412.18779.pdf")

'./2412.18779.pdf'

In [9]:
pwd

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 2\\PDF extraction'

And now we need to create a folder in our directory. Remember that we can also do that using bash commands here in Jupyter Notebooks. 

In [10]:
mkdir files_arxiv

Ya existe el subdirectorio o el archivo files_arxiv.


In [11]:
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath = "C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 2\\PDF extraction\\files_arxiv",
                    filename = "2412.18779.pdf")

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 2\\PDF extraction\\files_arxiv\\2412.18779.pdf'

So far I have been using just the code provided by the Arxiv API to do all those things (https://pypi.org/project/arxiv/). Now let's go to the next level and let's extract a bunch of articles. Looking at the API it looks like we need to extract the IDs of the papers. Let's do that!

In [12]:
ids = []

for result in client.results(search):
    ids.append(result.entry_id)

In [13]:
ids

['http://arxiv.org/abs/2412.20420v1',
 'http://arxiv.org/abs/2412.18779v1',
 'http://arxiv.org/abs/2412.15621v1',
 'http://arxiv.org/abs/2412.15072v1',
 'http://arxiv.org/abs/2412.14985v1',
 'http://arxiv.org/abs/2412.08484v1',
 'http://arxiv.org/abs/2412.07550v1',
 'http://arxiv.org/abs/2412.05861v1',
 'http://arxiv.org/abs/2412.02349v1',
 'http://arxiv.org/abs/2411.16826v1',
 'http://arxiv.org/abs/2411.16285v1',
 'http://arxiv.org/abs/2412.08648v1',
 'http://arxiv.org/abs/2411.14613v1',
 'http://arxiv.org/abs/2412.04484v1',
 'http://arxiv.org/abs/2411.12508v1',
 'http://arxiv.org/abs/2411.11426v1',
 'http://arxiv.org/abs/2411.09214v1',
 'http://arxiv.org/abs/2411.06122v1',
 'http://arxiv.org/abs/2411.04752v1',
 'http://arxiv.org/abs/2411.04542v1',
 'http://arxiv.org/abs/2411.05043v1',
 'http://arxiv.org/abs/2410.22716v1',
 'http://arxiv.org/abs/2410.20293v2',
 'http://arxiv.org/abs/2410.17496v1',
 'http://arxiv.org/abs/2410.16977v1',
 'http://arxiv.org/abs/2410.14617v1',
 'http://arx

In [14]:
type(ids)

list

In [15]:
ids[0]

'http://arxiv.org/abs/2412.20420v1'

What we need is just the ID number of the paper. Let's select that.

In [16]:
ids[0].split("/")[-1]

'2412.20420v1'

In [17]:
ids_2 = []

for i in ids:
    ids_2.append(i.split("/")[-1])

In [18]:
ids_2

['2412.20420v1',
 '2412.18779v1',
 '2412.15621v1',
 '2412.15072v1',
 '2412.14985v1',
 '2412.08484v1',
 '2412.07550v1',
 '2412.05861v1',
 '2412.02349v1',
 '2411.16826v1',
 '2411.16285v1',
 '2412.08648v1',
 '2411.14613v1',
 '2412.04484v1',
 '2411.12508v1',
 '2411.11426v1',
 '2411.09214v1',
 '2411.06122v1',
 '2411.04752v1',
 '2411.04542v1',
 '2411.05043v1',
 '2410.22716v1',
 '2410.20293v2',
 '2410.17496v1',
 '2410.16977v1',
 '2410.14617v1',
 '2411.05788v1',
 '2410.06443v1',
 '2410.05401v1',
 '2410.01708v1',
 '2409.18931v1',
 '2409.18393v1',
 '2409.15652v3',
 '2409.13461v1',
 '2409.08405v1',
 '2409.02358v1',
 '2409.01470v1',
 '2408.12753v1',
 '2408.12743v2',
 '2408.12449v2',
 '2408.09725v1',
 '2408.09683v1',
 '2408.09435v1',
 '2408.08964v3',
 '2408.08437v1',
 '2408.08126v1',
 '2408.07322v1',
 '2407.18471v1',
 '2407.16014v1',
 '2407.13549v1']

In [19]:
len(ids_2)

50

And now let's loop around that to get all the PDFs into our laptop. First let's create a new folder called "arxiv_pdfs"

In [23]:
mkdir arxiv_pdfs_facebook

Ya existe el subdirectorio o el archivo arxiv_pdfs_facebook.


# IMPORTANT

REMEMBER to create a new folder for the new query that you are going to do to create your own PDF database. Remember to change it in the filename variable too

In [24]:
#mkdir (remove the #) and now add the name that you would like to have for a new folder (I suggest arxiv_pdfs_whatever (your query))

And now let's get all the pdfs

In [25]:
for id in ids_2:
    # Search for the article with the given ID
    search = arxiv.Search(id_list=[id])
    paper = next(client.results(search))

    # Download the PDF to the current working directory with a default filename
    filename = os.path.join("arxiv_pdfs_facebook", urllib.parse.quote(id))
    paper.download_pdf(filename=f"{filename}.pdf")
    
    time.sleep(3)  # 3 seconds (this is the indication of the ArXiv API)

And voila! According to the arxiv API (https://pypi.org/project/arxiv/1.4.8/) the daily limit is 300.000 results: that is a lot!