refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1) by dchud · Pull Request #98 · dchud/mrrc

dchud · 2026-04-19T18:04:54Z

Summary

PR1 of 2 for bd-qzew (MarcError enrichment). Mechanical extraction only — no semantic or error-API changes. PR2 (enrichment via ParseContext) follows once this lands.

The three MARC readers (bibliographic, authority, holdings) each carried their own near-duplicate copy of the byte-level ISO 2709 parsing loop. This consolidates the truly-identical pieces into a new src/iso2709.rs module.

What's extracted into `src/iso2709.rs`

Constants: FIELD_TERMINATOR, SUBFIELD_DELIMITER, RECORD_TERMINATOR, LEADER_LEN, DIRECTORY_ENTRY_LEN
read_leader_bytes — 24-byte leader read with EOF-as-Ok(None) handling
read_record_data — truncation-aware remaining-bytes read
parse_4digits / parse_5digits — ASCII digit parsers
parse_directory_entry — single 12-byte directory entry parse (exported for PR2 use)
is_control_field_tag — tag predicate (00X, < \"010\") (exported for PR2 use)

What each reader keeps (record-type-specific)

Leader validation: type z for authority, x/y/v/u for holdings, any for bib
Subfield-parsing loop (see deferral note below)
Field dispatch by tag (1XX → heading for authority, 852 → location for holdings, field bag for bib)

Deferred to PR2 (deliberate scope reduction)

The bd-qzew description listed parse_data_field and parse_subfields as primitives to extract. These are not in this PR because the three readers have subtly different subfield-parsing semantics that would require either three variants or a semantic change:

Reader	UTF-8 mode	On unrecognized byte
bib	lossy	error
authority	lossy	skip
holdings	strict (errors on invalid UTF-8)	skip

PR2 introduces ParseContext and formalizes per-format parsing behavior — that's the right place to converge.

Dead-code preservation

A latent recovery comparison appears in all three readers:
```rust
if record_data.len() < (record_length - 24) && self.recovery_mode != RecoveryMode::Strict { ... }
```
This branch is currently unreachable because the buffer is allocated at full length. Preserved verbatim across all three readers to maintain behavioral parity. PR2 wires up proper truncation handling via ParseContext.

Net diff

File	Change
`src/iso2709.rs` (new)	+280 (~120 are tests)
`src/lib.rs`	+1 (mod registration)
`src/reader.rs`	-52 net
`src/authority_reader.rs`	-19 net
`src/holdings_reader.rs`	-19 net

Test plan

.cargo/check.sh clean locally (rustfmt, clippy, rustdoc, audit, maturin, 489 unit tests + 30 doc tests + 611 Python tests)
Pre-push hook ran the full check on push
CI green on all platforms (macOS, Linux, Windows)
Reviewer eyeballs the deferred-scope rationale in this description

🤖 Generated with Claude Code

Captures the brainstorming outcome for the MarcError enrichment work: - PR sequencing (PR1 mechanical extract, PR2 enrichment) - ParseContext shape and filename-plumbing API (with_source + from_path) - Restructured MarcError variant list and per-variant requirements - Python exception subclass hierarchy (strict pymarc parent superset) - Pymarc-compat docs scoped tightly to exception layer - src-python ParseError unification with MarcError - Self-review addenda: pickle round-trip needs explicit __reduce__, _format() definition, PR1-vs-PR2 plan separation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The three MARC readers (bibliographic, authority, holdings) each carried their own near-duplicate copy of the byte-level ISO 2709 parsing loop. This change consolidates the truly-identical pieces into a new `src/iso2709.rs` module that all three readers call into. No semantic or error-API changes; existing tests pass unmodified. Extracted into src/iso2709.rs: - Constants: FIELD_TERMINATOR, SUBFIELD_DELIMITER, RECORD_TERMINATOR, LEADER_LEN, DIRECTORY_ENTRY_LEN - read_leader_bytes: 24-byte leader read with EOF-as-Ok(None) handling - read_record_data: truncation-aware remaining-bytes read - parse_4digits / parse_5digits: ASCII digit parsers - parse_directory_entry: single 12-byte directory entry - is_control_field_tag: tag predicate (00X, < "010") Each reader keeps: - Record-type-specific leader validation (z for authority; x/y/v/u for holdings; bib accepts any) - Its own subfield-parsing loop (lossy vs strict UTF-8, error-on- unrecognized vs skip-unrecognized differ across readers — convergence deferred to PR2 where ParseContext formalizes per-format behavior) - Field dispatch by tag (1XX -> heading for authority; 852 -> location for holdings; field bag for bib) A latent dead-code recovery comparison (`record_data.len() < (record_length - 24)`) was preserved verbatim in all three readers to maintain behavioral parity; proper truncation handling lands in PR2. Changes: src/iso2709.rs +280 lines (~120 of which are tests) src/lib.rs +1 line (mod registration) src/reader.rs -52 lines net src/authority_reader.rs -19 lines net src/holdings_reader.rs -19 lines net Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codspeed-hq · 2026-04-19T18:10:51Z

Merging this PR will improve performance by 37.62%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 21 improved benchmarks
✅ 39 untouched benchmarks
⏩ 16 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_sequential_reading_1k`	9.7 ms	8.6 ms	+12.72%
⚡	WallTime	`test_write_only_10k_rustfile`	50.3 ms	44.5 ms	+12.94%
⚡	WallTime	`test_threaded_reading_4x_1k`	96.6 ms	76.6 ms	+26.03%
⚡	WallTime	`test_threaded_reading_4x_10k`	975.4 ms	772.7 ms	+26.24%
⚡	WallTime	`test_threaded_reading_2x_10k`	283 ms	238.3 ms	+18.8%
⚡	WallTime	`test_file_parallel_4x_10k_with_extraction`	1,079 ms	815.3 ms	+32.35%
⚡	WallTime	`test_parallel_read_4x_10k`	969.7 ms	762.3 ms	+27.21%
⚡	WallTime	`test_threading_speedup_2x_10k`	282.8 ms	241.2 ms	+17.27%
⚡	WallTime	`test_threaded_with_title_extraction_2x_10k`	289.9 ms	253.8 ms	+14.22%
⚡	WallTime	`test_threaded_with_title_extraction_4x_10k`	991.4 ms	781.7 ms	+26.82%
⚡	WallTime	`test_parallel_read_4x_1k`	94.3 ms	76.6 ms	+23.02%
⚡	WallTime	`test_parallel_read_with_extract_4x_1k`	104.2 ms	82.9 ms	+25.64%
⚡	WallTime	`test_threaded_reading_1k`	28.6 ms	23.8 ms	+20.01%
⚡	WallTime	`test_threading_speedup_4x_10k`	978.8 ms	744.3 ms	+31.5%
⚡	WallTime	`test_file_sequential_4x_10k`	343 ms	306.2 ms	+12.02%
⚡	WallTime	`test_file_sequential_2x_10k`	167.3 ms	150.9 ms	+10.89%
⚡	WallTime	`test_file_parallel_2x_10k`	303.2 ms	249.2 ms	+21.65%
⚡	WallTime	`test_parallel_read_with_extract_4x_10k`	1,058.5 ms	855.3 ms	+23.75%
⚡	WallTime	`test_file_parallel_4x_10k`	996.6 ms	724.2 ms	+37.62%
⚡	WallTime	`test_file_sequential_1x_10k`	83.8 ms	75.1 ms	+11.65%
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing feat/bd-qzew-extract-iso2709 (2ca456d) with main (5d4b1fd)²}

16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
No successful run was found on main (2ca456d) during the generation of this report, so 5d4b1fd was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

dchud and others added 2 commits April 19, 2026 14:03

dchud self-assigned this Apr 19, 2026

dchud merged commit 8129ac4 into main Apr 19, 2026
48 checks passed

dchud deleted the feat/bd-qzew-extract-iso2709 branch April 19, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1)#98

refactor(readers): extract shared ISO 2709 primitives (bd-qzew PR1)#98
dchud merged 2 commits intomainfrom
feat/bd-qzew-extract-iso2709

dchud commented Apr 19, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dchud commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's extracted into src/iso2709.rs

What each reader keeps (record-type-specific)

Deferred to PR2 (deliberate scope reduction)

Dead-code preservation

Net diff

Test plan

Uh oh!

codspeed-hq bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 37.62%

Performance Changes

Footnotes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dchud commented Apr 19, 2026 •

edited

Loading

What's extracted into `src/iso2709.rs`

codspeed-hq bot commented Apr 19, 2026 •

edited

Loading