# Pulling papers from Arxiv

(C) 2025 by [Damir Cavar](https://damir.cavar.me/)

**Version:** 1.0, September 2025

**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-for-ipython).

This is a tutorial related to [L645 Advanced Natural Language Processing](http://damir.cavar.me/l645/) in Fall 2025 at Indiana University. The following tutorial assumes that you are using a newer distribution of [Python 3.x](https://python.org/).

This code will pull PDFs and some meta information from Arxiv and store the data in subfolders.

Install the `arxiv` module:

Load the module:

In [1]:
import arxiv
from urllib.error import HTTPError
import pathvalidate
import os

Create a client:

In [2]:
client = arxiv.Client()

The following function will get the meta-information and the PDF and store it in the target folder:

In [3]:
def get_results(search, output_path):
    for result in client.results(search):
        print(f"Title: {result.title}")
        print(f"Authors: {', '.join(author.name for author in result.authors)}")
        print(f"Summary: {result.summary[:200]}...") # Print first 200 chars of summary
        print(f"PDF URL: {result.pdf_url}")
        print("-" * 30)
        valid_filename = pathvalidate.sanitize_filename(f"{result.title}_summary.txt")
        valid_filename = output_path + valid_filename[:min(len(valid_filename),240)]
        if os.path.exists(valid_filename):
            continue
        with open(valid_filename, "a", encoding="utf-8") as f:
            f.write(f"{result.title}\n")
            f.write(f"Authors: {', '.join(author.name for author in result.authors)}\n")
            f.write(f"Summary: {result.summary}\n")
            f.write(f"PDF URL: {result.pdf_url}\n")
            # f.write("-" * 30 + "\n")
        valid_filename = pathvalidate.sanitize_filename(f"{result.title}.pdf")
        valid_filename = valid_filename[:min(len(valid_filename),240)]
        if os.path.exists(os.path.join(output_path, valid_filename)):
            continue
        try:
            result.download_pdf(dirpath=output_path, filename=valid_filename)
        except HTTPError as e:
            print(f"Failed to download PDF for {result.title}: {e}")
            continue

Limit the number of results:

In [6]:
max_results = 2

Defining the search query, for example, searching for papers related to "Alzheimer's" or "Lung Cancer" in some category, e.g., the "cs.LG" category, as for example in "cat:cs.LG AND Alzheimer's":

In [7]:
search_query_paths = [ ("Lung Cancer", "./data/arxiv_papers/LungCancer/"),
                      ("Alzheimer's", "./data/arxiv_papers/Alzheimers/") ]
for query, path in search_query_paths:
	os.makedirs(path, exist_ok=True)

In [8]:
for search_query, output_path in search_query_paths:
    search = arxiv.Search(
        query=search_query,
        max_results=max_results,  # Limit the number of results
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )
    get_results(search, output_path)

Title: Dynamic Residual Encoding with Slide-Level Contrastive Learning for End-to-End Whole Slide Image Representation
Authors: Jing Jin, Xu Liu, Te Gao, Zhihong Shi, Yixiong Liang, Ruiqing Zheng, Hulin Kuang, Min Zeng, Shichao Kan
Summary: Whole Slide Image (WSI) representation is critical for cancer subtyping,
cancer recognition and mutation prediction.Training an end-to-end WSI
representation model poses significant challenges, as a s...
PDF URL: http://arxiv.org/pdf/2511.05034v1
------------------------------
Title: LG-NuSegHop: A Local-to-Global Self-Supervised Pipeline For Nuclei Instance Segmentation
Authors: Vasileios Magoulianitis, Catherine A. Alexander, Jiaxin Yang, C. -C. Jay Kuo
Summary: Nuclei segmentation is the cornerstone task in histology image reading,
shedding light on the underlying molecular patterns and leading to disease or
cancer diagnosis. Yet, it is a laborious task that...
PDF URL: http://arxiv.org/pdf/2511.04892v1
------------------------------
Title: Early

**(C) 2025 by [Damir Cavar](http://damir.cavar.me/) <<dcavar@iu.edu>>**