Skip to content

Pymarc-compatible current_exception and current_chunk accessors on MARCReader#182

Merged
dchud merged 2 commits into
mainfrom
feat/permissive-pymarc-compat
May 11, 2026
Merged

Pymarc-compatible current_exception and current_chunk accessors on MARCReader#182
dchud merged 2 commits into
mainfrom
feat/permissive-pymarc-compat

Conversation

@dchud
Copy link
Copy Markdown
Owner

@dchud dchud commented May 10, 2026

Closes #156

Summary

After each MARCReader.__next__ step, two pymarc-compatible accessors carry diagnostic information about what was just read:

  • current_chunk — bytes of the record just read from the source (declared length per the leader). Set on every successful chunk read, regardless of whether the parse step then succeeded or failed.
  • current_exception — typed MrrcException swallowed by permissive=True, or None on a clean read.

This is the bead's recommended Option A: enrich permissive=True rather than add a new pymarc_compat mode. Existing pymarc-shape error-diagnosis code (if reader.current_exception: ...) works against mrrc unchanged.

Implementation

  • Rust binding (src-python/src/readers.rs): PyMARCReader carries last_chunk: Option<Vec<u8>>, populated in both read_record() and __next__() after the chunk is read, before the parse step. Exposed via #[getter] fn last_chunk.
  • Python wrapper (mrrc/__init__.py): MARCReader.__init__ initializes the two attributes to None; __next__ reads _inner.last_chunk and stashes both. Clean-read path resets current_exception so callers don't see stale state.
  • Type stub (mrrc/_mrrc.pyi): documented _MARCReader.last_chunk property.

Tests

tests/python/test_pymarc_compat_accessors.py covers:

  • Default state (both None before any iteration).
  • Clean read: current_chunk matches leader-declared length, current_exception stays None.
  • Clean iteration clears stale current_exception after a prior failure.
  • Permissive swallow: both accessors populated, full chunk bytes preserved.
  • Strict mode raises (the accessors aren't the intended catch surface).
  • Cross-library iteration-shape parity vs pymarc — guarded by pytest.importorskip("pymarc") so we don't gain a hard dep.

Documented divergences

Updated docs/guides/migration-from-pymarc.md:

  • Encoding strictness: mrrc raises EncodingError (swallowed via current_exception under permissive=True) on invalid UTF-8 in subfield values where pymarc applies lossy substitution silently. Iteration shape unchanged.
  • current_chunk on byte-read errors: When the underlying read of the next record's bytes fails before parsing begins (truncated stream, I/O error), current_chunk may be None even though current_exception is set. For parse failures of fully-read chunks (the common case), current_chunk carries the full record bytes as pymarc does.

Verification

  • cargo test --test fuzz_regressions + full Python suite — 538 Rust + 700+ Python tests pass
  • tests/python/test_pymarc_compat_accessors.py — 5 pass, 1 skip (pymarc parity, since pymarc isn't a project dep)
  • .cargo/check.sh --quick and full — all green
  • PR CI green on all platforms

Bead: bd-0x73.7

🤖 Generated with Claude Code

Closes #156

The Python MARCReader wrapper now matches pymarc's default-reader
shape for the diagnostic accessors. After each __next__ step:

- current_chunk holds the bytes of the record just read from the
  source (declared-length bytes per the leader). Set on every
  successful chunk read, regardless of whether the parse step then
  succeeds or fails.
- current_exception holds the typed MrrcException caught by
  permissive=True, or None on a clean read.

Implementation:

- Rust binding: PyMARCReader carries last_chunk: Option<Vec<u8>>,
  populated in both read_record() and __next__() after reading the
  record bytes, before the parse step. Exposed via #[getter]
  last_chunk on _MARCReader.
- Python wrapper: MARCReader.__init__ initializes current_exception
  and current_chunk to None; __next__ reads _inner.last_chunk and
  stashes both, mirroring pymarc semantics. The clean-read path
  resets current_exception so callers don't see stale state.
- Type stub: _MARCReader.last_chunk getter.

Tests: tests/python/test_pymarc_compat_accessors.py covers default
state, clean reads, permissive swallow, strict-mode raises, and
cross-library iteration shape parity (the cross-library test
importorskips pymarc so the project doesn't gain a hard dependency).

Documented two divergences from pymarc in
docs/guides/migration-from-pymarc.md:

- Encoding strictness — mrrc raises EncodingError on invalid UTF-8 in
  subfield values where pymarc applies lossy substitution silently.
  Same iteration shape under permissive=True; callers using
  `except Exception:` keep working.
- current_chunk None on byte-read errors — when the underlying
  read of the next record's bytes fails before parsing begins
  (truncated stream, I/O error), current_chunk may be None even
  though current_exception is set. For parse failures of
  fully-read chunks (the common case), current_chunk carries
  the full record bytes as pymarc does.

Bead: bd-0x73.7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dchud dchud self-assigned this May 10, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 10, 2026

Merging this PR will degrade performance by 18.51%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 7 improved benchmarks
❌ 1 regressed benchmark
✅ 52 untouched benchmarks
⏩ 16 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_pipeline_parallel_4x_10k_threaded 118.2 ms 102.1 ms +15.81%
WallTime test_pipeline_sequential_extraction_4x_10k 118.5 ms 104.1 ms +13.86%
WallTime test_process_4_files_sequential 99.8 ms 84.5 ms +18.13%
WallTime test_process_4_files_parallel_4_threads 118.3 ms 97 ms +21.96%
WallTime test_threaded_reading_2x_10k 285.3 ms 350.1 ms -18.51%
WallTime test_pipeline_sequential_1x_10k 22.4 ms 20.3 ms +10.21%
WallTime test_pipeline_sequential_4x_10k 99.6 ms 85.5 ms +16.55%
WallTime test_pipeline_parallel_2x_10k_threaded 54.5 ms 47.5 ms +14.71%

Comparing feat/permissive-pymarc-compat (b1e0f2e) with main (297626a)

Open in CodSpeed

Footnotes

  1. 16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

…ccessors

Four additional pytest cases for the pymarc-compat accessors:

- Multi-record sequencing: walk a 3-record stream; current_chunk
  reflects each successive record's leader-declared length and
  content, not just the first.
- EOF retention: after StopIteration, current_chunk and
  current_exception retain their last values (matches pymarc).
- Default strict mode: current_chunk tracks on every successful
  chunk read regardless of permissive, so even strict callers get
  it populated.
- iter_with_errors independence: the alternate diagnostic surface
  (yielded tuple) does not clobber the pymarc-compat accessors;
  they stay at their initial None when only iter_with_errors is
  exercised. Documents the two surfaces' non-interference contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dchud dchud merged commit ffeccfd into main May 11, 2026
47 of 48 checks passed
@dchud dchud deleted the feat/permissive-pymarc-compat branch May 11, 2026 00:01
dchud added a commit that referenced this pull request May 11, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dchud added a commit that referenced this pull request May 11, 2026
Splits the responsibilities cleanly between the two error-handling
docs to remove the duplication my earlier commit introduced:

docs/reference/error-handling.md (the catalog — for users writing
new mrrc code):
- Updates the stale "MARCReader.current_exception / current_chunk"
  section. The reference said "mrrc does not currently expose these
  reader-side attributes" — predated PR #182 (bd-0x73.7). Replaced
  with the actual landed semantics plus the encoding-strictness
  and byte-read-error divergences.
- Adds a "Known hierarchy divergences from pymarc" subsection
  noting the FatalReaderError parentage difference (its tree
  diagram showed the actual mrrc shape but didn't call out the
  comparison to pymarc) and the PymarcException → MrrcException
  base-class rename.
- Expands the omissions row in the mapping table into a full
  sub-table with rationale and mrrc-equivalent behavior for
  NoFieldsFound, WriteNeedsRecord, NoActiveFile, BadLeaderValue,
  MissingLinkedFields.

docs/guides/migration-from-pymarc.md (the porter's recipe — for
users with existing pymarc code in front of them):
- Drops the duplicated class-name mapping table and the omissions
  table that just landed; replaced with a one-sentence summary and
  a link to the reference for the catalog.
- Keeps the inline migration recipes (base-class rename, the two
  FatalReaderError catch patterns) since porters want them inline,
  not "go read the reference."
- Trims explanatory prose for the FatalReaderError divergence and
  links to the reference for the rationale.

Net diff: -66 noise lines in the guide, +93 substance lines in the
reference. CHANGELOG entry updated to reflect the split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dchud added a commit that referenced this pull request May 11, 2026
* Pymarc exception-class-name parity verification + documentation

Closes #155

bd-0x73.6's narrow scope per the 2026-05-03 re-scope: verify that a
pymarc-shaped error-handling code path keeps working after a port,
and document the small gaps where it doesn't.

Adds tests/python/test_pymarc_exception_parity.py — 47 cases that
pin:

- Every pymarc 5.3.1 exception class name a typical except clause
  catches is importable from both mrrc.exceptions and the top-level
  mrrc package (10 names: RecordLengthInvalid, RecordLeaderInvalid,
  RecordDirectoryInvalid, BaseAddressInvalid, BaseAddressNotFound,
  EndOfRecordNotFound, TruncatedRecord, FieldNotFound,
  FatalReaderError, BadSubfieldCodeWarning).
- Each shared-name class is a proper Exception/Warning subclass and,
  for the exceptions, inherits from MrrcException so callers can
  catch the mrrc base class instead of PymarcException.
- pymarc names mrrc deliberately omits (NoFieldsFound,
  WriteNeedsRecord, NoActiveFile, BadLeaderValue, MissingLinkedFields)
  remain absent — guarding against accidental drift — and each
  omission carries a one-line rationale in the test source.
- A pymarc-shape except RecordDirectoryInvalid: actually catches
  what mrrc raises on the E101 non-ASCII-tag fixture.
- A port using except MrrcException: catches every typed variant.

Extends docs/guides/migration-from-pymarc.md with the class-name
mapping table, the omissions table with rationale per omission, and
an explicit section on the FatalReaderError inheritance-hierarchy
divergence (mrrc keeps the fatal record-level classes as siblings
under MrrcException; FatalReaderError in mrrc means "recovered-error
cap exceeded" specifically). Two pymarc-compatible migration recipes
spelled out for that divergence.

No code changes — the verification is the deliverable.

Bead: bd-0x73.6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Consolidate pymarc-compat docs: catalog in reference, recipes in guide

Splits the responsibilities cleanly between the two error-handling
docs to remove the duplication my earlier commit introduced:

docs/reference/error-handling.md (the catalog — for users writing
new mrrc code):
- Updates the stale "MARCReader.current_exception / current_chunk"
  section. The reference said "mrrc does not currently expose these
  reader-side attributes" — predated PR #182 (bd-0x73.7). Replaced
  with the actual landed semantics plus the encoding-strictness
  and byte-read-error divergences.
- Adds a "Known hierarchy divergences from pymarc" subsection
  noting the FatalReaderError parentage difference (its tree
  diagram showed the actual mrrc shape but didn't call out the
  comparison to pymarc) and the PymarcException → MrrcException
  base-class rename.
- Expands the omissions row in the mapping table into a full
  sub-table with rationale and mrrc-equivalent behavior for
  NoFieldsFound, WriteNeedsRecord, NoActiveFile, BadLeaderValue,
  MissingLinkedFields.

docs/guides/migration-from-pymarc.md (the porter's recipe — for
users with existing pymarc code in front of them):
- Drops the duplicated class-name mapping table and the omissions
  table that just landed; replaced with a one-sentence summary and
  a link to the reference for the catalog.
- Keeps the inline migration recipes (base-class rename, the two
  FatalReaderError catch patterns) since porters want them inline,
  not "go read the reference."
- Trims explanatory prose for the FatalReaderError divergence and
  links to the reference for the rationale.

Net diff: -66 noise lines in the guide, +93 substance lines in the
reference. CHANGELOG entry updated to reflect the split.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

At-least-one strictness mode that matches pymarc default behavior

1 participant