# Accesing *ArXiv*

## Elena Fernández Fernández

Let's give a try to webscrape the ArXiv website (disclaimer: I previously emailed them and ask if it was ok and legal to do this and they said yes!).

ArXiv is one of the most popular Computer Science article repositories out there and a great resource for Text Data Mining Research! Let's try to get the latest 50 articles about Twitter: https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50

If you right click on your mouse, you will see that the source code of the website is very similar to La Gaceta de Madrid. So: let's try to re-use that script for this!

The first thing that you need to do is to import the necessary libraries for webscraping

In [1]:
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
import shutil #this one is for saving the PDFs from our computer.
import os

Then let's put the url into our laptop

In [2]:
html = urlopen("https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50")

And now let's access the text

In [3]:
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <!-- new favicon config and versions by realfavicongenerator.net -->
  <link href="https://static.arxiv.org/static/base/1.0.0a5/images/icons/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
  <link href="https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="https://static.arxiv.org/static/base/1.0.0a5/images/icons/site.webmanifest" rel="manifest"/>
  <link color="#b31b1b" href="https://static.arxiv.org/static/base/1.0.0a5/images/icons/safari-pinned-tab.svg" rel="mask-icon"/>
  <link href="https://static.arxiv.org/static/base/1.0.0a5/images/icons/favicon.ico" rel="shortcut icon"/>
  <meta content="#b31b1b" name="msapplicat

The PDFs, which is what we are looking for, are contained in the category "a", so we need to filter our search to get all the information stored within "a".

In [4]:
soup.find_all("a")

[<a class="is-sr-only" href="#main-container">Skip to main content</a>,
 <a class="level-item" href="https://cornell.edu/"><img alt="Cornell University" aria-label="logo" src="https://static.arxiv.org/static/base/1.0.0a5/images/cornell-reduced-white-SMALL.svg" width="200"/></a>,
 <a href="https://info.arxiv.org/about/ourmembers.html">member institutions</a>,
 <a href="https://info.arxiv.org/about/donate.html">Donate</a>,
 <a aria-label="arxiv-logo" class="arxiv" href="https://arxiv.org/">
 <img alt="arxiv logo" aria-label="logo" src="https://static.arxiv.org/static/base/1.0.0a5/images/arxiv-logo-one-color-white.svg" style="width:85px;" width="85"/>
 </a>,
 <a href="https://info.arxiv.org/help">Help</a>,
 <a href="https://arxiv.org/search/advanced">Advanced Search</a>,
 <a href="https://arxiv.org/login">Login</a>,
 <a href="https://github.com/arXiv/arxiv-search/releases">Search v0.5.6 released 2020-02-24</a>,
 <a href="/search/?query=twitter&amp;searchtype=all&amp;abstracts=show&amp;ord

Once we have all the "a" information, we need to define our search even more, as the PDFs links that we are looking for are stored within the "href" category inside of "a". We will store them in a list (pdfs).

In [5]:
pdfs = []
for link in soup.find_all('a'):
    pdfs.append(link.get("href"))
pdfs

['#main-container',
 'https://cornell.edu/',
 'https://info.arxiv.org/about/ourmembers.html',
 'https://info.arxiv.org/about/donate.html',
 'https://arxiv.org/',
 'https://info.arxiv.org/help',
 'https://arxiv.org/search/advanced',
 'https://arxiv.org/login',
 'https://github.com/arXiv/arxiv-search/releases',
 '/search/?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50',
 '/search/advanced?terms-0-term=twitter&terms-0-field=all&size=50&order=-announced_date_first',
 '',
 '/search/?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=50',
 '/search/?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=0',
 '/search/?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=50',
 '/search/?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50&start=100',
 '/search/?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size

Now we need to narrow our search even more. What we need are all the strings in which the PDFs are stored. Let's try to get them!

In [6]:
type(pdfs)

list

In [7]:
for i in pdfs:
    if "pdf" in i:
        print(i)

https://arxiv.org/pdf/2412.20581


TypeError: argument of type 'NoneType' is not iterable

Something is not working. What is it? Let's ask Perplexity AI: https://www.perplexity.ai/search/Im-trying-to-U7GFFcNcTb6df18PJ_RpZg

So: the ArXiv web developers have built some sort of mechanisms that do not allow us to webscrape their website! But good news: we can use their API: https://pypi.org/project/arxiv/

# ArXiv API

First let's install the python library arxiv

In [8]:
pip install arxiv

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


And now let's import it

In [9]:
import arxiv

Let's also import the time library

In [10]:
import time

Let's build a query

In [11]:
# Construct the default API client.
client = arxiv.Client()

And let's extract the 50 most recent Twitter ArXiv articles

In [12]:
# Search for the 50 most recent articles matching the keyword "quantum."
search = arxiv.Search(
  query = "twitter",
  max_results = 50,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

In [13]:
results = client.results(search)

In [14]:
print(results)

<itertools.islice object at 0x0000028A0CFBDF30>


Ok, so now let's first have a look at the titles. 

In [15]:
for r in client.results(search):
    print(r.title)

"The Prophet said so!": On Exploring Hadith Presence on Arabic Social Media
Machine Learning for Sentiment Analysis of Imported Food in Trinidad and Tobago
Integrating Zero-Shot Classification to Advance Long COVID Literature: A Systematic Social Media-Centered Review
Dynamics of Collective Information Processing for Risk Encoding in Social Networks during Crises
Evaluating the Performance of Large Language Models in Scientific Claim Detection and Classification
A Herd of Young Mastodonts: the User-Centered Footprints of Newcomers After Twitter Acquisition
Safe Spaces or Toxic Places? Content Moderation and Social Dynamics of Online Eating Disorder Communities
Climate Policy Elites' Twitter Interactions across Nine Countries
BotSim: LLM-Powered Malicious Social Botnet Simulation
Analyzing Toxicity in Open Source Software Communications Using Psycholinguistics and Moral Foundations Theory
MOPO: Multi-Objective Prompt Optimization for Affective Text Generation
SentiQNF: A Novel Approach 

And now, if we compare that with the actual ArXiv website, it looks correct: https://arxiv.org/search/cs?query=twitter&searchtype=all&abstracts=show&order=-announced_date_first&size=50

Now let's begin saving the first pdf in that list in our laptop.

In [16]:
paper = next(arxiv.Client().results(arxiv.Search(id_list=["2406.12444"])))
# Download the PDF to the PWD with a default filename.
paper.download_pdf()

'./2406.12444v1.Who_Checks_the_Checkers__Exploring_Source_Credibility_in_Twitter_s_Community_Notes.pdf'

In [17]:
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="2406.12444v1.pdf")

'./2406.12444v1.pdf'

In [18]:
pwd

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 1\\APIs 1. Arxiv'

And now we need to create a folder in our directory. Remember that we can also do that using bash commands here in Jupyter Notebooks. 

In [19]:
mkdir files_arxiv

Ya existe el subdirectorio o el archivo files_arxiv.


# IMPORTANT

REMEMBER to create a new folder for the new query that you are going when you will be doing the exercise

In [32]:
#mkdir (remove the #) and now add the name that you would like to have for a new folder (I suggest arxiv_pdfs_whatever (your query))

And remember to change the name of the folder in the dirpath when you do the exercise. 

In [20]:
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath = 'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 1\\APIs 1. Arxiv\\files_arxiv',
                    filename = "2406.12444v1.pdf")

'C:\\Users\\usuario\\ELENA\\it-training uzh\\it-training uzh\\Python for Digital Humanities\\Day 1\\APIs 1. Arxiv\\files_arxiv\\2406.12444v1.pdf'

So far I have been using just the code provided by the Arxiv API to do all those things (https://pypi.org/project/arxiv/). Now let's go to the next level and let's extract a bunch of articles. Looking at the API it looks like we need to extract the IDs of the papers. Let's do that!

In [21]:
ids = []

for result in client.results(search):
    ids.append(result.entry_id)

In [22]:
ids

['http://arxiv.org/abs/2412.20581v1',
 'http://arxiv.org/abs/2412.19781v1',
 'http://arxiv.org/abs/2412.18779v1',
 'http://arxiv.org/abs/2412.17342v1',
 'http://arxiv.org/abs/2412.16486v1',
 'http://arxiv.org/abs/2412.16383v1',
 'http://arxiv.org/abs/2412.15721v1',
 'http://arxiv.org/abs/2412.15545v1',
 'http://arxiv.org/abs/2412.13420v1',
 'http://arxiv.org/abs/2412.13133v2',
 'http://arxiv.org/abs/2412.12948v1',
 'http://arxiv.org/abs/2412.12731v1',
 'http://arxiv.org/abs/2412.11647v1',
 'http://arxiv.org/abs/2412.15257v1',
 'http://arxiv.org/abs/2412.09578v1',
 'http://arxiv.org/abs/2412.07550v1',
 'http://arxiv.org/abs/2412.06951v1',
 'http://arxiv.org/abs/2412.06179v1',
 'http://arxiv.org/abs/2412.10413v1',
 'http://arxiv.org/abs/2412.05624v1',
 'http://arxiv.org/abs/2412.05176v1',
 'http://arxiv.org/abs/2412.03217v1',
 'http://arxiv.org/abs/2412.02637v1',
 'http://arxiv.org/abs/2412.02148v1',
 'http://arxiv.org/abs/2412.01249v1',
 'http://arxiv.org/abs/2411.19766v1',
 'http://arx

In [23]:
type(ids)

list

In [24]:
ids[0]

'http://arxiv.org/abs/2412.20581v1'

What we need is just the ID number of the paper. Let's select that.

In [25]:
ids[0].split("/")[-1]

'2412.20581v1'

In [26]:
ids_2 = []

for i in ids:
    ids_2.append(i.split("/")[-1])

In [27]:
ids_2

['2412.20581v1',
 '2412.19781v1',
 '2412.18779v1',
 '2412.17342v1',
 '2412.16486v1',
 '2412.16383v1',
 '2412.15721v1',
 '2412.15545v1',
 '2412.13420v1',
 '2412.13133v2',
 '2412.12948v1',
 '2412.12731v1',
 '2412.11647v1',
 '2412.15257v1',
 '2412.09578v1',
 '2412.07550v1',
 '2412.06951v1',
 '2412.06179v1',
 '2412.10413v1',
 '2412.05624v1',
 '2412.05176v1',
 '2412.03217v1',
 '2412.02637v1',
 '2412.02148v1',
 '2412.01249v1',
 '2411.19766v1',
 '2411.19733v1',
 '2412.02712v1',
 '2411.18817v1',
 '2411.16826v1',
 '2411.16813v2',
 '2411.16031v1',
 '2411.16754v1',
 '2411.15586v1',
 '2411.15462v1',
 '2411.14986v2',
 '2411.14652v1',
 '2411.14230v1',
 '2411.13681v1',
 '2411.11500v1',
 '2411.09214v1',
 '2411.07917v1',
 '2411.06477v1',
 '2411.06295v1',
 '2411.05577v1',
 '2411.05448v2',
 '2411.04862v1',
 '2411.02666v1',
 '2411.02557v1',
 '2411.01852v2']

In [28]:
len(ids_2)

50

And now let's loop around that to get all the PDFs into our laptop. First let's create a new folder called "arxiv_pdfs"

In [29]:
mkdir arxiv_pdfs

Ya existe el subdirectorio o el archivo arxiv_pdfs.


# IMPORTANT

REMEMBER to create a new folder for the new query that you are going when you will be doing the exercise. Remember to change the name of the folder in filename down there

In [33]:
#mkdir (remove the #) and now add the name that you would like to have for a new folder (I suggest arxiv_pdfs_whatever (your query))

And now let's get all the pdfs

In [30]:
for id in ids_2:
    # Search for the article with the given ID
    search = arxiv.Search(id_list=[id])
    paper = next(client.results(search))

    # Download the PDF to the current working directory with a default filename
    filename = os.path.join("arxiv_pdfs", urllib.parse.quote(id))
    result.download_pdf(filename=f"{filename}.pdf")
    
    time.sleep(3)  # 3 seconds (this is the indication of the ArXiv API)

And voila! According to the arxiv API (https://pypi.org/project/arxiv/1.4.8/) the daily limit is 300.000 results: that is a lot!

# Exercise

Now use this same notebook and do a new search using a different key term. 
* If you would like to use two words (Mark Zuckerberg, Climate Change, Donald Trump...) use this syntax: '"climate change"' (quotations inside quotations).
* Remember to **change the name of the folders** when you will be creating new ones for the individual PDF and for the list of PDFs to not have both queries all mixed up in the same folder