Skip to content

refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1)#98

Merged
dchud merged 2 commits intomainfrom
feat/bd-qzew-extract-iso2709
Apr 19, 2026
Merged

refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1)#98
dchud merged 2 commits intomainfrom
feat/bd-qzew-extract-iso2709

Conversation

@dchud
Copy link
Copy Markdown
Owner

@dchud dchud commented Apr 19, 2026

Summary

PR1 of 2 for bd-qzew (MarcError enrichment). Mechanical extraction only — no semantic or error-API changes. PR2 (enrichment via ParseContext) follows once this lands.

The three MARC readers (bibliographic, authority, holdings) each carried their own near-duplicate copy of the byte-level ISO 2709 parsing loop. This consolidates the truly-identical pieces into a new src/iso2709.rs module.

What's extracted into src/iso2709.rs

  • Constants: FIELD_TERMINATOR, SUBFIELD_DELIMITER, RECORD_TERMINATOR, LEADER_LEN, DIRECTORY_ENTRY_LEN
  • read_leader_bytes — 24-byte leader read with EOF-as-Ok(None) handling
  • read_record_data — truncation-aware remaining-bytes read
  • parse_4digits / parse_5digits — ASCII digit parsers
  • parse_directory_entry — single 12-byte directory entry parse (exported for PR2 use)
  • is_control_field_tag — tag predicate (00X, < \"010\") (exported for PR2 use)

What each reader keeps (record-type-specific)

  • Leader validation: type z for authority, x/y/v/u for holdings, any for bib
  • Subfield-parsing loop (see deferral note below)
  • Field dispatch by tag (1XX → heading for authority, 852 → location for holdings, field bag for bib)

Deferred to PR2 (deliberate scope reduction)

The bd-qzew description listed parse_data_field and parse_subfields as primitives to extract. These are not in this PR because the three readers have subtly different subfield-parsing semantics that would require either three variants or a semantic change:

Reader UTF-8 mode On unrecognized byte
bib lossy error
authority lossy skip
holdings strict (errors on invalid UTF-8) skip

PR2 introduces ParseContext and formalizes per-format parsing behavior — that's the right place to converge.

Dead-code preservation

A latent recovery comparison appears in all three readers:
```rust
if record_data.len() < (record_length - 24) && self.recovery_mode != RecoveryMode::Strict { ... }
```
This branch is currently unreachable because the buffer is allocated at full length. Preserved verbatim across all three readers to maintain behavioral parity. PR2 wires up proper truncation handling via ParseContext.

Net diff

File Change
src/iso2709.rs (new) +280 (~120 are tests)
src/lib.rs +1 (mod registration)
src/reader.rs -52 net
src/authority_reader.rs -19 net
src/holdings_reader.rs -19 net

Test plan

  • .cargo/check.sh clean locally (rustfmt, clippy, rustdoc, audit, maturin, 489 unit tests + 30 doc tests + 611 Python tests)
  • Pre-push hook ran the full check on push
  • CI green on all platforms (macOS, Linux, Windows)
  • Reviewer eyeballs the deferred-scope rationale in this description

🤖 Generated with Claude Code

dchud and others added 2 commits April 19, 2026 14:03
Captures the brainstorming outcome for the MarcError enrichment work:
- PR sequencing (PR1 mechanical extract, PR2 enrichment)
- ParseContext shape and filename-plumbing API (with_source + from_path)
- Restructured MarcError variant list and per-variant requirements
- Python exception subclass hierarchy (strict pymarc parent superset)
- Pymarc-compat docs scoped tightly to exception layer
- src-python ParseError unification with MarcError
- Self-review addenda: pickle round-trip needs explicit __reduce__,
  _format() definition, PR1-vs-PR2 plan separation

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The three MARC readers (bibliographic, authority, holdings) each carried
their own near-duplicate copy of the byte-level ISO 2709 parsing loop.
This change consolidates the truly-identical pieces into a new
`src/iso2709.rs` module that all three readers call into. No semantic or
error-API changes; existing tests pass unmodified.

Extracted into src/iso2709.rs:
- Constants: FIELD_TERMINATOR, SUBFIELD_DELIMITER, RECORD_TERMINATOR,
  LEADER_LEN, DIRECTORY_ENTRY_LEN
- read_leader_bytes: 24-byte leader read with EOF-as-Ok(None) handling
- read_record_data: truncation-aware remaining-bytes read
- parse_4digits / parse_5digits: ASCII digit parsers
- parse_directory_entry: single 12-byte directory entry
- is_control_field_tag: tag predicate (00X, < "010")

Each reader keeps:
- Record-type-specific leader validation (z for authority; x/y/v/u for
  holdings; bib accepts any)
- Its own subfield-parsing loop (lossy vs strict UTF-8, error-on-
  unrecognized vs skip-unrecognized differ across readers — convergence
  deferred to PR2 where ParseContext formalizes per-format behavior)
- Field dispatch by tag (1XX -> heading for authority; 852 -> location
  for holdings; field bag for bib)

A latent dead-code recovery comparison (`record_data.len() < (record_length - 24)`)
was preserved verbatim in all three readers to maintain behavioral parity;
proper truncation handling lands in PR2.

Changes:
  src/iso2709.rs           +280 lines (~120 of which are tests)
  src/lib.rs               +1 line (mod registration)
  src/reader.rs            -52 lines net
  src/authority_reader.rs  -19 lines net
  src/holdings_reader.rs   -19 lines net

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dchud dchud self-assigned this Apr 19, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 19, 2026

Merging this PR will improve performance by 37.62%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 21 improved benchmarks
✅ 39 untouched benchmarks
⏩ 16 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_sequential_reading_1k 9.7 ms 8.6 ms +12.72%
WallTime test_write_only_10k_rustfile 50.3 ms 44.5 ms +12.94%
WallTime test_threaded_reading_4x_1k 96.6 ms 76.6 ms +26.03%
WallTime test_threaded_reading_4x_10k 975.4 ms 772.7 ms +26.24%
WallTime test_threaded_reading_2x_10k 283 ms 238.3 ms +18.8%
WallTime test_file_parallel_4x_10k_with_extraction 1,079 ms 815.3 ms +32.35%
WallTime test_parallel_read_4x_10k 969.7 ms 762.3 ms +27.21%
WallTime test_threading_speedup_2x_10k 282.8 ms 241.2 ms +17.27%
WallTime test_threaded_with_title_extraction_2x_10k 289.9 ms 253.8 ms +14.22%
WallTime test_threaded_with_title_extraction_4x_10k 991.4 ms 781.7 ms +26.82%
WallTime test_parallel_read_4x_1k 94.3 ms 76.6 ms +23.02%
WallTime test_parallel_read_with_extract_4x_1k 104.2 ms 82.9 ms +25.64%
WallTime test_threaded_reading_1k 28.6 ms 23.8 ms +20.01%
WallTime test_threading_speedup_4x_10k 978.8 ms 744.3 ms +31.5%
WallTime test_file_sequential_4x_10k 343 ms 306.2 ms +12.02%
WallTime test_file_sequential_2x_10k 167.3 ms 150.9 ms +10.89%
WallTime test_file_parallel_2x_10k 303.2 ms 249.2 ms +21.65%
WallTime test_parallel_read_with_extract_4x_10k 1,058.5 ms 855.3 ms +23.75%
WallTime test_file_parallel_4x_10k 996.6 ms 724.2 ms +37.62%
WallTime test_file_sequential_1x_10k 83.8 ms 75.1 ms +11.65%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing feat/bd-qzew-extract-iso2709 (2ca456d) with main (5d4b1fd)2

Open in CodSpeed

Footnotes

  1. 16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (2ca456d) during the generation of this report, so 5d4b1fd was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@dchud dchud merged commit 8129ac4 into main Apr 19, 2026
48 checks passed
@dchud dchud deleted the feat/bd-qzew-extract-iso2709 branch April 19, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant