Skip to content

feat(fetchers): ArXivFetcher — paper metadata and abstract extraction #57

@chaliy

Description

@chaliy

What

Add an ArXivFetcher that matches arxiv.org/abs/{id} and arxiv.org/pdf/{id} URLs, returning structured paper metadata optimized for research agents.

Why

Research agents and agents working on ML/AI tasks frequently encounter arXiv links. The current DefaultFetcher returns the noisy arXiv HTML page. The arXiv API provides clean, structured metadata including abstracts, author lists, and categorization.

Requirements

  • Match: https://arxiv.org/abs/{id}, https://arxiv.org/pdf/{id}
  • Fetch via arXiv API: http://export.arxiv.org/api/query?id_list={id}
  • Return: title, authors, abstract, categories, published/updated dates, DOI, journal ref
  • For /pdf/ URLs: return metadata + indicate binary content (consistent with core binary handling)
  • Include links to: PDF, HTML (if available via ar5iv.labs.arxiv.org), related papers
  • Format field: "arxiv_paper"

Design Notes

  • arXiv API returns Atom XML — will need XML parsing (consider quick-xml crate)
  • ar5iv.labs.arxiv.org provides HTML versions of papers — could be fetched for full-text
  • Rate limiting: arXiv asks for reasonable usage, no strict API key required
  • Consider extracting references/citations if available in API response

Tier

3 — Differentiated capability

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions