# ArxivClient Example Usage

This notebook demonstrates basic usage of the ArxivClient to fetch PDF content from ArXiv papers.

## 1. Import ArxivClient

In [1]:
from arxiv_paper_summarizer.arxiv import ArxivClient, fetch_papers_by_url, load_paper_as_file_by_url

## 2. Fetch Paper Content

Using the URL: https://arxiv.org/abs/2309.08600

In [2]:
# Define the ArXiv URL
url = "https://arxiv.org/abs/2309.08600"

# Fetch paper content
papers = fetch_papers_by_url(url)
paper = papers[0]

print(f"Successfully fetched paper: {paper.title}")

Successfully fetched paper: Sparse Autoencoders Find Highly Interpretable Features in Language Models


## 3. Display Paper Information

In [3]:
print(f"Title: {paper.title}")
print(f"Authors: {', '.join(paper.authors)}")
print(f"Published: {paper.published}")
print(f"URL: {paper.url}")
print(f"Text length: {len(paper.text):,} characters")

Title: Sparse Autoencoders Find Highly Interpretable Features in Language Models
Authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
Published: 2023-09-15T17:56:55+00:00
URL: http://arxiv.org/abs/2309.08600v3
Text length: 49,578 characters


## 4. Preview Paper Content

In [4]:
# Show first 500 characters of the paper text
print("First 500 characters of the paper:")
print("-" * 50)
print(paper.text[:500] + "...")

First 500 characters of the paper:
--------------------------------------------------
SPARSE AUTOENCODERS FIND HIGHLY INTER-
PRETABLE FEATURES IN LANGUAGE MODELS
Hoagy Cunningham∗12, Aidan Ewart∗13, Logan Riggs∗1, Robert Huben, Lee Sharkey4
1EleutherAI, 2MATS, 3Bristol AI Safety Centre, 4Apollo Research
{hoagycunningham, aidanprattewart, logansmith5}@gmail.com
ABSTRACT
One of the roadblocks to a better understanding of neural networks’ internals is
polysemanticity, where neurons appear to activate in multiple, semantically dis-
tinct contexts.
Polysemanticity prevents us from ide...


## 5. Get PDF as File Object

In [5]:
# Get PDF as BytesIO object
pdf_file = load_paper_as_file_by_url(url)

if pdf_file:
    print(f"PDF file size: {len(pdf_file.getvalue()):,} bytes")
    print("PDF successfully loaded as file object")
else:
    print("Failed to load PDF")

PDF file size: 1,820,169 bytes
PDF successfully loaded as file object


## 6. Utility Functions Demo

In [6]:
# Create ArxivClient instance for utility functions
client = ArxivClient()

# Extract ArXiv ID from URL
arxiv_id = client.parse_arxiv_id(url)
print(f"Extracted ArXiv ID: {arxiv_id}")

# Test with different URL formats
test_urls = [
    "https://arxiv.org/abs/2309.08600",
    "https://huggingface.co/papers/2309.08600",
    "arxiv.org/abs/2309.08600v1"
]

print("\nTesting URL parsing:")
for test_url in test_urls:
    try:
        extracted_id = client.parse_arxiv_id(test_url)
        print(f"  {test_url} -> {extracted_id}")
    except Exception as e:
        print(f"  {test_url} -> Error: {e}")

Extracted ArXiv ID: 2309.08600

Testing URL parsing:
  https://arxiv.org/abs/2309.08600 -> 2309.08600
  https://huggingface.co/papers/2309.08600 -> 2309.08600
  arxiv.org/abs/2309.08600v1 -> 2309.08600


## Summary

This example demonstrates the core ArxivClient functionality:

- **`fetch_papers_by_url()`**: Fetch paper content and metadata
- **`load_paper_as_file_by_url()`**: Get PDF as file object
- **`parse_arxiv_id()`**: Extract ArXiv ID from various URL formats

The ArxivClient handles PDF downloading, text extraction, and provides easy access to paper metadata.