crossref-citation-extraction

CLI for extracting arXiv references from Crossref data files, validating them against arXiv (DataCite) records, and aggregating citations by cited work.

Overview

This tool processes Crossref snapshot data to identify works that cite arXiv papers. It extracts arXiv IDs from reference metadata using pattern matching, validates them against an arXiv index (built from DataCite records), and outputs citation data organized by cited arXiv work.

Crossref tar.gz → Extract arXiv IDs → Partition by arXiv prefix → Aggregate by cited work → Validate against arXiv index → Output

Building

cargo build --release

The binary will be at target/release/crossref-citation-extraction.

Usage

Build arXiv Index

Pre-build a finite state tranducer (FST) index from arXiv (DataCite) records for faster pipeline runs:

./target/release/crossref-citation-extraction build-index \
  --input arxiv-datacite-records.jsonl.gz \
  --output arxiv.fst

Run Pipeline

Extract and validate arXiv citations using a pre-built FST index:

./target/release/crossref-citation-extraction pipeline \
  --input crossref-snapshot.tar.gz \
  --arxiv-fst arxiv.fst \
  --output-dir ./results

Alternatively, build the FST on-the-fly from arXiv records:

./target/release/crossref-citation-extraction pipeline \
  --input crossref-snapshot.tar.gz \
  --arxiv-records arxiv-records.jsonl.gz \
  --output-dir ./results

Filter by provenance to include only publisher-asserted or Crossref-matched references:

./target/release/crossref-citation-extraction pipeline \
  --input crossref-snapshot.tar.gz \
  --arxiv-fst arxiv.fst \
  --provenance publisher,crossref \
  --output-dir ./results

Resume a pipeline from existing partitions (skips extraction phase):

./target/release/crossref-citation-extraction pipeline \
  --partitions-dir ./temp/partitions \
  --arxiv-fst arxiv.fst \
  --output-dir ./results

Pipeline Options

Option	Description
`--input`	Crossref snapshot (tar.gz, directory, or JSON file)
`--arxiv-fst`	Path to pre-built arXiv FST index
`--arxiv-records`	Path to arXiv records JSONL.gz (builds FST on-the-fly)
`--output-dir`	Directory for output files (default: current directory)
`--outputs`	Output types: valid, failed, publisher, crossref, mined, or all
`--provenance`	Filter by provenance: publisher, crossref, mined
`--temp-dir`	Directory for intermediate partition files
`--partitions-dir`	Use existing partitions (skip extraction phase)
`--batch-size`	Batch size for memory management (default: 5000000)
`--checkpoint-interval`	Partitions between checkpoints (default: 50)
`--keep-intermediates`	Keep partition files after completion
`--log-level`	Logging level: DEBUG, INFO, WARN, ERROR

Input Formats

Crossref input (--input) accepts tar.gz archives (streamed without extraction), directories containing JSON files, or single JSON files for testing.

arXiv records (--arxiv-records) accepts gzipped JSONL files, uncompressed JSONL files, directories with updated_*/ subdirectories, or directories containing JSONL.gz files.

Output Format

Each line of the output is a JSON object representing a cited arXiv work:

{
  "arxiv_doi": "10.48550/arXiv.2403.03542",
  "arxiv_id": "2403.03542",
  "reference_count": 5,
  "citation_count": 3,
  "cited_by": [
    {
      "citing_doi": "10.5678/citing-paper",
      "provenance": "publisher",
      "raw_match": "arXiv:2403.03542",
      "ref_json": "{...reference metadata...}"
    }
  ]
}

The --outputs option controls which files are generated:

Output Type	Filename	Description
valid	results.jsonl	All validated arXiv citations
failed	results.failed	Citations that failed validation
publisher	results.publisher	Publisher-asserted references only
crossref	results.crossref	Crossref-matched references only
mined	results.mined	Text-extracted references only

Use --outputs all to generate all output types.

Provenance

Each citation includes a provenance field indicating how the arXiv ID was obtained: publisher means the ID was explicitly provided in the reference metadata, crossref means it was matched by Crossref, and mined means it was extracted from unstructured text.

arXiv ID Patterns

The extractor recognizes modern format (arXiv:2403.03542, arXiv.2403.03542v2), old format (arXiv:hep-ph/9901234, arXiv:cs.DM/9910013), DOI format (10.48550/arXiv.2403.03542), and URL format (arxiv.org/abs/2403.03542).

References must contain "arxiv" context to match—bare 2403.03542 won't match.

Architecture

The tool uses a streaming architecture to process large datasets with bounded memory. Tar.gz files input files are streamed without full extraction. References are written to per-partition parquet files based on arXiv ID prefix (first 4 characters). Partitions are processed in parallel using Polars for efficient group-by operations. Validation uses a memory-efficient finite state transducer for arXiv ID lookup.

Long-running pipelines can be resumed from checkpoint files stored in the partition directory.

Development

cargo build          # Debug build
cargo build --release # Release build
cargo test           # Run tests
cargo fmt            # Format code
cargo clippy         # Lint

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benches		benches
docs/plans		docs/plans
scripts/compare_citations		scripts/compare_citations
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crossref-citation-extraction

Overview

Building

Usage

Build arXiv Index

Run Pipeline

Pipeline Options

Input Formats

Output Format

Provenance

arXiv ID Patterns

Architecture

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crossref-citation-extraction

Overview

Building

Usage

Build arXiv Index

Run Pipeline

Pipeline Options

Input Formats

Output Format

Provenance

arXiv ID Patterns

Architecture

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages