Pymarc-compatible current_exception and current_chunk accessors on MARCReader#182
Merged
Conversation
Closes #156 The Python MARCReader wrapper now matches pymarc's default-reader shape for the diagnostic accessors. After each __next__ step: - current_chunk holds the bytes of the record just read from the source (declared-length bytes per the leader). Set on every successful chunk read, regardless of whether the parse step then succeeds or fails. - current_exception holds the typed MrrcException caught by permissive=True, or None on a clean read. Implementation: - Rust binding: PyMARCReader carries last_chunk: Option<Vec<u8>>, populated in both read_record() and __next__() after reading the record bytes, before the parse step. Exposed via #[getter] last_chunk on _MARCReader. - Python wrapper: MARCReader.__init__ initializes current_exception and current_chunk to None; __next__ reads _inner.last_chunk and stashes both, mirroring pymarc semantics. The clean-read path resets current_exception so callers don't see stale state. - Type stub: _MARCReader.last_chunk getter. Tests: tests/python/test_pymarc_compat_accessors.py covers default state, clean reads, permissive swallow, strict-mode raises, and cross-library iteration shape parity (the cross-library test importorskips pymarc so the project doesn't gain a hard dependency). Documented two divergences from pymarc in docs/guides/migration-from-pymarc.md: - Encoding strictness — mrrc raises EncodingError on invalid UTF-8 in subfield values where pymarc applies lossy substitution silently. Same iteration shape under permissive=True; callers using `except Exception:` keep working. - current_chunk None on byte-read errors — when the underlying read of the next record's bytes fails before parsing begins (truncated stream, I/O error), current_chunk may be None even though current_exception is set. For parse failures of fully-read chunks (the common case), current_chunk carries the full record bytes as pymarc does. Bead: bd-0x73.7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merging this PR will degrade performance by 18.51%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | test_pipeline_parallel_4x_10k_threaded |
118.2 ms | 102.1 ms | +15.81% |
| ⚡ | WallTime | test_pipeline_sequential_extraction_4x_10k |
118.5 ms | 104.1 ms | +13.86% |
| ⚡ | WallTime | test_process_4_files_sequential |
99.8 ms | 84.5 ms | +18.13% |
| ⚡ | WallTime | test_process_4_files_parallel_4_threads |
118.3 ms | 97 ms | +21.96% |
| ❌ | WallTime | test_threaded_reading_2x_10k |
285.3 ms | 350.1 ms | -18.51% |
| ⚡ | WallTime | test_pipeline_sequential_1x_10k |
22.4 ms | 20.3 ms | +10.21% |
| ⚡ | WallTime | test_pipeline_sequential_4x_10k |
99.6 ms | 85.5 ms | +16.55% |
| ⚡ | WallTime | test_pipeline_parallel_2x_10k_threaded |
54.5 ms | 47.5 ms | +14.71% |
Comparing feat/permissive-pymarc-compat (b1e0f2e) with main (297626a)
Footnotes
-
16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
…ccessors Four additional pytest cases for the pymarc-compat accessors: - Multi-record sequencing: walk a 3-record stream; current_chunk reflects each successive record's leader-declared length and content, not just the first. - EOF retention: after StopIteration, current_chunk and current_exception retain their last values (matches pymarc). - Default strict mode: current_chunk tracks on every successful chunk read regardless of permissive, so even strict callers get it populated. - iter_with_errors independence: the alternate diagnostic surface (yielded tuple) does not clobber the pymarc-compat accessors; they stay at their initial None when only iter_with_errors is exercised. Documents the two surfaces' non-interference contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dchud
added a commit
that referenced
this pull request
May 11, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dchud
added a commit
that referenced
this pull request
May 11, 2026
Splits the responsibilities cleanly between the two error-handling docs to remove the duplication my earlier commit introduced: docs/reference/error-handling.md (the catalog — for users writing new mrrc code): - Updates the stale "MARCReader.current_exception / current_chunk" section. The reference said "mrrc does not currently expose these reader-side attributes" — predated PR #182 (bd-0x73.7). Replaced with the actual landed semantics plus the encoding-strictness and byte-read-error divergences. - Adds a "Known hierarchy divergences from pymarc" subsection noting the FatalReaderError parentage difference (its tree diagram showed the actual mrrc shape but didn't call out the comparison to pymarc) and the PymarcException → MrrcException base-class rename. - Expands the omissions row in the mapping table into a full sub-table with rationale and mrrc-equivalent behavior for NoFieldsFound, WriteNeedsRecord, NoActiveFile, BadLeaderValue, MissingLinkedFields. docs/guides/migration-from-pymarc.md (the porter's recipe — for users with existing pymarc code in front of them): - Drops the duplicated class-name mapping table and the omissions table that just landed; replaced with a one-sentence summary and a link to the reference for the catalog. - Keeps the inline migration recipes (base-class rename, the two FatalReaderError catch patterns) since porters want them inline, not "go read the reference." - Trims explanatory prose for the FatalReaderError divergence and links to the reference for the rationale. Net diff: -66 noise lines in the guide, +93 substance lines in the reference. CHANGELOG entry updated to reflect the split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dchud
added a commit
that referenced
this pull request
May 11, 2026
* Pymarc exception-class-name parity verification + documentation Closes #155 bd-0x73.6's narrow scope per the 2026-05-03 re-scope: verify that a pymarc-shaped error-handling code path keeps working after a port, and document the small gaps where it doesn't. Adds tests/python/test_pymarc_exception_parity.py — 47 cases that pin: - Every pymarc 5.3.1 exception class name a typical except clause catches is importable from both mrrc.exceptions and the top-level mrrc package (10 names: RecordLengthInvalid, RecordLeaderInvalid, RecordDirectoryInvalid, BaseAddressInvalid, BaseAddressNotFound, EndOfRecordNotFound, TruncatedRecord, FieldNotFound, FatalReaderError, BadSubfieldCodeWarning). - Each shared-name class is a proper Exception/Warning subclass and, for the exceptions, inherits from MrrcException so callers can catch the mrrc base class instead of PymarcException. - pymarc names mrrc deliberately omits (NoFieldsFound, WriteNeedsRecord, NoActiveFile, BadLeaderValue, MissingLinkedFields) remain absent — guarding against accidental drift — and each omission carries a one-line rationale in the test source. - A pymarc-shape except RecordDirectoryInvalid: actually catches what mrrc raises on the E101 non-ASCII-tag fixture. - A port using except MrrcException: catches every typed variant. Extends docs/guides/migration-from-pymarc.md with the class-name mapping table, the omissions table with rationale per omission, and an explicit section on the FatalReaderError inheritance-hierarchy divergence (mrrc keeps the fatal record-level classes as siblings under MrrcException; FatalReaderError in mrrc means "recovered-error cap exceeded" specifically). Two pymarc-compatible migration recipes spelled out for that divergence. No code changes — the verification is the deliverable. Bead: bd-0x73.6 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Consolidate pymarc-compat docs: catalog in reference, recipes in guide Splits the responsibilities cleanly between the two error-handling docs to remove the duplication my earlier commit introduced: docs/reference/error-handling.md (the catalog — for users writing new mrrc code): - Updates the stale "MARCReader.current_exception / current_chunk" section. The reference said "mrrc does not currently expose these reader-side attributes" — predated PR #182 (bd-0x73.7). Replaced with the actual landed semantics plus the encoding-strictness and byte-read-error divergences. - Adds a "Known hierarchy divergences from pymarc" subsection noting the FatalReaderError parentage difference (its tree diagram showed the actual mrrc shape but didn't call out the comparison to pymarc) and the PymarcException → MrrcException base-class rename. - Expands the omissions row in the mapping table into a full sub-table with rationale and mrrc-equivalent behavior for NoFieldsFound, WriteNeedsRecord, NoActiveFile, BadLeaderValue, MissingLinkedFields. docs/guides/migration-from-pymarc.md (the porter's recipe — for users with existing pymarc code in front of them): - Drops the duplicated class-name mapping table and the omissions table that just landed; replaced with a one-sentence summary and a link to the reference for the catalog. - Keeps the inline migration recipes (base-class rename, the two FatalReaderError catch patterns) since porters want them inline, not "go read the reference." - Trims explanatory prose for the FatalReaderError divergence and links to the reference for the rationale. Net diff: -66 noise lines in the guide, +93 substance lines in the reference. CHANGELOG entry updated to reflect the split. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #156
Summary
After each
MARCReader.__next__step, two pymarc-compatible accessors carry diagnostic information about what was just read:current_chunk— bytes of the record just read from the source (declared length per the leader). Set on every successful chunk read, regardless of whether the parse step then succeeded or failed.current_exception— typedMrrcExceptionswallowed bypermissive=True, orNoneon a clean read.This is the bead's recommended Option A: enrich
permissive=Truerather than add a newpymarc_compatmode. Existing pymarc-shape error-diagnosis code (if reader.current_exception: ...) works against mrrc unchanged.Implementation
src-python/src/readers.rs):PyMARCReadercarrieslast_chunk: Option<Vec<u8>>, populated in bothread_record()and__next__()after the chunk is read, before the parse step. Exposed via#[getter] fn last_chunk.mrrc/__init__.py):MARCReader.__init__initializes the two attributes toNone;__next__reads_inner.last_chunkand stashes both. Clean-read path resetscurrent_exceptionso callers don't see stale state.mrrc/_mrrc.pyi): documented_MARCReader.last_chunkproperty.Tests
tests/python/test_pymarc_compat_accessors.pycovers:Nonebefore any iteration).current_chunkmatches leader-declared length,current_exceptionstaysNone.current_exceptionafter a prior failure.pytest.importorskip("pymarc")so we don't gain a hard dep.Documented divergences
Updated
docs/guides/migration-from-pymarc.md:EncodingError(swallowed viacurrent_exceptionunderpermissive=True) on invalid UTF-8 in subfield values where pymarc applies lossy substitution silently. Iteration shape unchanged.current_chunkon byte-read errors: When the underlying read of the next record's bytes fails before parsing begins (truncated stream, I/O error),current_chunkmay beNoneeven thoughcurrent_exceptionis set. For parse failures of fully-read chunks (the common case),current_chunkcarries the full record bytes as pymarc does.Verification
cargo test --test fuzz_regressions+ full Python suite — 538 Rust + 700+ Python tests passtests/python/test_pymarc_compat_accessors.py— 5 pass, 1 skip (pymarc parity, since pymarc isn't a project dep).cargo/check.sh --quickand full — all greenBead: bd-0x73.7
🤖 Generated with Claude Code