refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1)#98
refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1)#98
Conversation
Captures the brainstorming outcome for the MarcError enrichment work: - PR sequencing (PR1 mechanical extract, PR2 enrichment) - ParseContext shape and filename-plumbing API (with_source + from_path) - Restructured MarcError variant list and per-variant requirements - Python exception subclass hierarchy (strict pymarc parent superset) - Pymarc-compat docs scoped tightly to exception layer - src-python ParseError unification with MarcError - Self-review addenda: pickle round-trip needs explicit __reduce__, _format() definition, PR1-vs-PR2 plan separation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The three MARC readers (bibliographic, authority, holdings) each carried their own near-duplicate copy of the byte-level ISO 2709 parsing loop. This change consolidates the truly-identical pieces into a new `src/iso2709.rs` module that all three readers call into. No semantic or error-API changes; existing tests pass unmodified. Extracted into src/iso2709.rs: - Constants: FIELD_TERMINATOR, SUBFIELD_DELIMITER, RECORD_TERMINATOR, LEADER_LEN, DIRECTORY_ENTRY_LEN - read_leader_bytes: 24-byte leader read with EOF-as-Ok(None) handling - read_record_data: truncation-aware remaining-bytes read - parse_4digits / parse_5digits: ASCII digit parsers - parse_directory_entry: single 12-byte directory entry - is_control_field_tag: tag predicate (00X, < "010") Each reader keeps: - Record-type-specific leader validation (z for authority; x/y/v/u for holdings; bib accepts any) - Its own subfield-parsing loop (lossy vs strict UTF-8, error-on- unrecognized vs skip-unrecognized differ across readers — convergence deferred to PR2 where ParseContext formalizes per-format behavior) - Field dispatch by tag (1XX -> heading for authority; 852 -> location for holdings; field bag for bib) A latent dead-code recovery comparison (`record_data.len() < (record_length - 24)`) was preserved verbatim in all three readers to maintain behavioral parity; proper truncation handling lands in PR2. Changes: src/iso2709.rs +280 lines (~120 of which are tests) src/lib.rs +1 line (mod registration) src/reader.rs -52 lines net src/authority_reader.rs -19 lines net src/holdings_reader.rs -19 lines net Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merging this PR will improve performance by 37.62%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | test_sequential_reading_1k |
9.7 ms | 8.6 ms | +12.72% |
| ⚡ | WallTime | test_write_only_10k_rustfile |
50.3 ms | 44.5 ms | +12.94% |
| ⚡ | WallTime | test_threaded_reading_4x_1k |
96.6 ms | 76.6 ms | +26.03% |
| ⚡ | WallTime | test_threaded_reading_4x_10k |
975.4 ms | 772.7 ms | +26.24% |
| ⚡ | WallTime | test_threaded_reading_2x_10k |
283 ms | 238.3 ms | +18.8% |
| ⚡ | WallTime | test_file_parallel_4x_10k_with_extraction |
1,079 ms | 815.3 ms | +32.35% |
| ⚡ | WallTime | test_parallel_read_4x_10k |
969.7 ms | 762.3 ms | +27.21% |
| ⚡ | WallTime | test_threading_speedup_2x_10k |
282.8 ms | 241.2 ms | +17.27% |
| ⚡ | WallTime | test_threaded_with_title_extraction_2x_10k |
289.9 ms | 253.8 ms | +14.22% |
| ⚡ | WallTime | test_threaded_with_title_extraction_4x_10k |
991.4 ms | 781.7 ms | +26.82% |
| ⚡ | WallTime | test_parallel_read_4x_1k |
94.3 ms | 76.6 ms | +23.02% |
| ⚡ | WallTime | test_parallel_read_with_extract_4x_1k |
104.2 ms | 82.9 ms | +25.64% |
| ⚡ | WallTime | test_threaded_reading_1k |
28.6 ms | 23.8 ms | +20.01% |
| ⚡ | WallTime | test_threading_speedup_4x_10k |
978.8 ms | 744.3 ms | +31.5% |
| ⚡ | WallTime | test_file_sequential_4x_10k |
343 ms | 306.2 ms | +12.02% |
| ⚡ | WallTime | test_file_sequential_2x_10k |
167.3 ms | 150.9 ms | +10.89% |
| ⚡ | WallTime | test_file_parallel_2x_10k |
303.2 ms | 249.2 ms | +21.65% |
| ⚡ | WallTime | test_parallel_read_with_extract_4x_10k |
1,058.5 ms | 855.3 ms | +23.75% |
| ⚡ | WallTime | test_file_parallel_4x_10k |
996.6 ms | 724.2 ms | +37.62% |
| ⚡ | WallTime | test_file_sequential_1x_10k |
83.8 ms | 75.1 ms | +11.65% |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Comparing feat/bd-qzew-extract-iso2709 (2ca456d) with main (5d4b1fd)2
Footnotes
-
16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
-
No successful run was found on
main(2ca456d) during the generation of this report, so 5d4b1fd was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
Summary
PR1 of 2 for bd-qzew (MarcError enrichment). Mechanical extraction only — no semantic or error-API changes. PR2 (enrichment via
ParseContext) follows once this lands.The three MARC readers (bibliographic, authority, holdings) each carried their own near-duplicate copy of the byte-level ISO 2709 parsing loop. This consolidates the truly-identical pieces into a new
src/iso2709.rsmodule.What's extracted into
src/iso2709.rsFIELD_TERMINATOR,SUBFIELD_DELIMITER,RECORD_TERMINATOR,LEADER_LEN,DIRECTORY_ENTRY_LENread_leader_bytes— 24-byte leader read with EOF-as-Ok(None)handlingread_record_data— truncation-aware remaining-bytes readparse_4digits/parse_5digits— ASCII digit parsersparse_directory_entry— single 12-byte directory entry parse (exported for PR2 use)is_control_field_tag— tag predicate (00X,< \"010\") (exported for PR2 use)What each reader keeps (record-type-specific)
zfor authority,x/y/v/ufor holdings, any for bibDeferred to PR2 (deliberate scope reduction)
The bd-qzew description listed
parse_data_fieldandparse_subfieldsas primitives to extract. These are not in this PR because the three readers have subtly different subfield-parsing semantics that would require either three variants or a semantic change:PR2 introduces
ParseContextand formalizes per-format parsing behavior — that's the right place to converge.Dead-code preservation
A latent recovery comparison appears in all three readers:
```rust
if record_data.len() < (record_length - 24) && self.recovery_mode != RecoveryMode::Strict { ... }
```
This branch is currently unreachable because the buffer is allocated at full length. Preserved verbatim across all three readers to maintain behavioral parity. PR2 wires up proper truncation handling via
ParseContext.Net diff
src/iso2709.rs(new)src/lib.rssrc/reader.rssrc/authority_reader.rssrc/holdings_reader.rsTest plan
.cargo/check.shclean locally (rustfmt, clippy, rustdoc, audit, maturin, 489 unit tests + 30 doc tests + 611 Python tests)🤖 Generated with Claude Code