Skip to content

feat(error): enrich MarcError with positional context (bd-qzew PR2)#100

Merged
dchud merged 19 commits intomainfrom
feat/bd-qzew-error-enrichment
Apr 19, 2026
Merged

feat(error): enrich MarcError with positional context (bd-qzew PR2)#100
dchud merged 19 commits intomainfrom
feat/bd-qzew-error-enrichment

Conversation

@dchud
Copy link
Copy Markdown
Owner

@dchud dchud commented Apr 19, 2026

Summary

PR2 of 2 for bd-qzew. Restructures MarcError from string-message tuple variants to structured struct variants carrying positional metadata (record index, byte offsets, 001 control number, source filename, field tag, indicator/subfield position, the bad bytes, etc.), threads a ParseContext through all three readers, and wires typed Python exception classes with the same metadata as kwargs.

PR1 (extract shared ISO 2709 primitives) merged earlier in this branch's history.

What's in this PR

11 commits, organized as:

  • refactor(error) — restructure variants, hand-write Display/detailed, add truncate_bytes helper, #[source] chains
  • feat(iso2709) — introduce ParseContext + ctx.err_* helpers + shared parse_data_field/parse_subfields
  • feat(readers) — wire ParseContext through bib/authority/holdings; with_source/from_path constructors; opportunistic 001 extraction
  • feat(formats) — MARCXML enrichment via XmlError with Box<dyn Error> cause
  • feat(writer) — writer-path errors carry record_index + 001
  • feat(python) — typed exception hierarchy (mrrc-specific subclasses extend pymarc parents); kwargs/repr/_format/detailed; PyO3 typed-exception construction; ParseError routes through unified mapping
  • test — snapshot tests via insta (Rust) and syrupy (Python)
  • docsdocs/reference/error-handling.md + CHANGELOG [Unreleased] entry
  • Plus 3 fixup commits applying findings from architect review (G1), security audit (G2), and pre-merge comprehensive review (G4)

Quality gates

Per the bead notes' four-gate review plan:

  • G1 — Architect review of the new variant shape after commit 1. Surfaced 5 should-fix items (empty-header Display fallback, detailed() label alignment bug, scope migration helpers to pub(crate), etc.). All applied as a fixup commit.
  • G2 — Security audit of the PyO3 typed-exception construction + ParseError unification after commit 6. Surfaced 2 medium-severity issues: pickle __setstate__ blindly accepting arbitrary keys (could shadow methods), and missing __cause__ chaining when typed-exception construction fails. Both fixed as a fixup commit.
  • G3 — Snapshot baseline review during commit 7. Reviewed every Rust + Python snapshot baseline before accepting; uncovered an alignment drift between languages on TruncatedRecord and tightened the Python detailed() to mirror Rust byte-for-byte.
  • G4 — Comprehensive pre-merge review before opening this PR. Surfaced 3 blockers: cross-language detailed() drift on InvalidIndicator, TruncatedRecord losing positional context in Strict mode, and recovery.rs lenient-path positionless errors. The first two are fixed in the final commit. The third is filed as a follow-up bead because that dispatch path is currently dead code (documented in PR1).

Pymarc compatibility

Strict superset:

  • All pymarc class names exist with the same parents.
  • New mrrc-specific classes (InvalidIndicator, BadSubfieldCode, InvalidField, TruncatedRecord, EncodingError, XmlError, JsonError, WriterError) extend the closest pymarc parent so existing except RecordDirectoryInvalid:-style catches keep catching them.
  • Bare-constructor compatibility preserved: raise RecordLeaderInvalid() still works.
  • Pickle round-trip preserves all positional attributes (with a security whitelist on incoming attribute names).

See docs/reference/error-handling.md for the full hierarchy diagram, the per-variant field reference, and the migration story scoped to the exception layer only.

Follow-up beads (filed during G4)

  • bd-3ois — Thread ParseContext through recovery.rs salvage path
  • bd-6zvz — Enrich MARCJSON parse-path errors via JsonError variant (the actual serde_json parse lives in src-python/src/formats.rs)
  • bd-4axe — Populate ParseError positional metadata at boundary-scanner sites
  • bd-wc5r — FFI integration test verifying typed Python exception construction for each MarcError variant
  • bd-umwp — Add typed exception class stubs to mrrc/_mrrc.pyi
  • Plus bd-uj7c (filed during G2) — Cap per-stream error count in lenient/permissive recovery

Net diff

41 files changed, +3385 / -489.

Area Change
src/error.rs Restructured enum + Display/detailed + truncate_bytes + helpers
src/iso2709.rs New ParseContext + ctx.err_* + shared subfield parser
src/{reader,authority_reader,holdings_reader}.rs Wire ParseContext, add with_source/from_path, opportunistic 001
src/marcxml.rs XmlError construction at parse sites
src/writer.rs WriterError with record context
mrrc/__init__.py Exception class hierarchy + kwargs + pickle + format/detailed
src-python/src/{error,parse_error}.rs PyO3 typed-exception construction
tests/python/test_errors.py (new) 36 tests (hierarchy, bare-construct, pickle, snapshot)
src/snapshots/ (new) 6 Rust snapshot baselines
tests/python/__snapshots__/ (new) 7 Python snapshot baselines
docs/reference/error-handling.md (new) Full reference page
CHANGELOG.md [Unreleased] entry
Cargo.toml, pyproject.toml insta, syrupy dev deps

Test plan

  • .cargo/check.sh clean locally (rustfmt, clippy, 495 unit, 30 doc, 643 Python tests, audit, maturin)
  • Pre-commit hook ran the check on every commit
  • All four quality gates run with findings addressed (G1, G2 inline; G3 baseline review; G4 blockers fixed, should-fixes filed as follow-ups)
  • CI green on all platforms (macOS, Linux, Windows)

🤖 Generated with Claude Code

dchud and others added 11 commits April 19, 2026 15:16
…xt fields

Replaces the previous seven string-message variants with a structured set
of variants that each carry positional metadata (record index, byte
offsets, 001 control number, source filename, field tag, and the bad
bytes capped at 32). The default Display impl produces an actionable
one-liner with the byte offset visually subordinate; a separate
detailed() method renders a multi-line diagnostic.

Variants mirror the per-variant requirements documented in the
bd-qzew bead: InvalidLeader, RecordLengthInvalid, BaseAddressInvalid,
BaseAddressNotFound, DirectoryInvalid, TruncatedRecord,
EndOfRecordNotFound, InvalidIndicator, BadSubfieldCode, InvalidField,
EncodingError, FieldNotFound, IoError, XmlError, JsonError, WriterError.
Wrapping variants (IoError, XmlError, JsonError) carry their underlying
cause via #[source] so std::error::Error::source() walks the chain.

This commit is the wire-shape change only: every existing call site is
mechanically updated to construct the new variants via positional-less
helper constructors (invalid_field_msg, encoding_msg, truncated_msg,
leader_msg, writer_msg). All optional positional fields are populated
with None at this stage. Subsequent commits introduce ParseContext and
thread real positional metadata through the readers, formats, writer,
and Python bindings.

Other changes:
- truncate_bytes helper for the 32-byte found cap
- Field renamed from `source` to `source_name` to avoid collision with
  thiserror's auto-detected error-source field name
- src-python error mapping consumes the new shape via a single
  placeholder fall-through that produces bare PyValueError/PyIOError
  with the new Display output; typed-exception construction lands in a
  later commit
- crate-level allow(clippy::result_large_err): MarcError carries
  structured metadata by design; boxing every Result<T, MarcError>
  would add an allocation without addressing a real perf concern

All 495 Rust tests + 611 Python tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ment, scope migration helpers

Follow-up polish from architecture review of the previous error-type
restructure.

- Display::render_oneline: when no positional context fields are
  populated, lead with the variant name (e.g., "BaseAddressNotFound:
  base address not found") so the rendered message is at least as
  informative as the previous string-message form. Otherwise an empty
  header produced a body-only string that obscured what kind of error
  was raised.
- detailed(): label-padding bug fixed. The previous arithmetic
  (10 - label.len() - 1) underflowed for labels longer than 10
  characters (e.g., "record-relative:") and produced visually drifting
  columns. Now compute the padding from the widest label in the actual
  output so columns align consistently regardless of which fields are
  populated.
- Migration constructor helpers (invalid_field_msg, encoding_msg,
  truncated_msg, leader_msg) demoted to pub(crate). They exist to
  support the mechanical update of internal call sites and are not part
  of the long-term API; the proper construction path will be the
  ParseContext::error_here helper added in the next change. The unused
  writer_msg helper is removed entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the parsing context that error-enrichment work threads through the
ISO 2709 reader stack. ParseContext carries source filename, record
index, stream and record-relative byte offsets, the 001 control number
once available, and the field/subfield/indicator currently being
parsed. A family of ctx.err_* methods constructs the appropriate
MarcError variant with all positional fields auto-populated from the
context, so call sites don't have to repeat the metadata at every
error-construction site.

Also lands the deferred-from-the-extraction-PR shared subfield
parsing primitives:

- DataFieldParseConfig with three named const profiles
  (BIBLIOGRAPHIC, AUTHORITY, HOLDINGS) capturing each existing reader's
  historical behavior for two orthogonal axes:
    * structure: Strict vs Permissive handling of unrecognized bytes
      between subfields
    * utf8: Lossy vs Strict decoding of subfield value bytes
- parse_data_field: parses indicators + subfields into a Field per the
  given config, attaching positional metadata to errors via the context
- parse_subfields: walks subfield bytes after the indicators

This commit is purely additive. The existing iso2709 primitives and
all three readers are unchanged; the next commit rewires them to use
ParseContext and the shared subfield parser.

All checks pass (495 Rust tests + 611 Python tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ings readers

Each reader now owns a ParseContext that tracks the current record
index, byte offsets, source filename, and field-tag-being-parsed, so
errors raised anywhere in the read loop carry consistent positional
metadata. Also adds opportunistic 001 extraction (the value is captured
into ctx.record_control_number as soon as the 001 control field is
parsed, so all subsequent errors for that record include it) and
filename plumbing via two new constructors per reader:

- with_source(name): builder method that sets source_name on the
  context for callers that have a name (Python file objects, in-memory
  cursors, custom stream identifiers).
- from_path(path): convenience constructor on Reader<File> that opens
  the file and sets source_name to path.display().to_string().

Subfield parsing now goes through the shared
iso2709::parse_data_field with the appropriate per-reader
DataFieldParseConfig:
- bib: BIBLIOGRAPHIC (strict structure, lossy UTF-8)
- authority: AUTHORITY (permissive structure, lossy UTF-8)
- holdings: HOLDINGS (permissive structure, strict UTF-8)
This eliminates the three near-duplicate inline subfield-parsing loops
that the readers carried, while preserving each reader's historical
semantic behavior. Holdings's two-pass IndexMap dispatch is collapsed
into the same single-pass tag-routing pattern as authority.

Error-construction sites that previously used the migration helpers
(MarcError::invalid_field_msg / truncated_msg / etc.) are rewritten to
use ctx.err_* methods, so errors get full positional context. Helpers
constructed inside the iso2709 primitives (parse_4digits,
parse_5digits) still raise context-less errors when called from the
directory walk; threading ctx into those primitives is intentionally
deferred so primitive signatures don't change in this commit.

All 495 Rust tests + 611 Python tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ror variant

XML parse failures in marcxml_to_record and marcxml_to_records now
construct MarcError::XmlError wrapping the underlying quick_xml error
via #[source], so std::error::Error::source() walks the chain and
callers can downcast to inspect the original parser failure.

Adds ParseContext::err_xml and ParseContext::err_json constructors for
the format-level wrappers. The XmlError variant's cause type changes
to Box<dyn Error + Send + Sync> so any of quick_xml's error families
(Error, DeError, etc.) can be wrapped uniformly.

MARCJSON-structural errors (missing leader, wrong shape, etc.) stay as
InvalidField since they describe MARC-level structural problems rather
than JSON parser failures; the marcjson module operates on already-
parsed serde_json::Value inputs and has no JSON parser sites of its
own. JsonError remains available for any future call site that does
own a serde_json parse step.

Byte-offset extraction for both formats is left as Option<None> for
now: quick_xml does not expose a byte position from DeError, and
translating serde_json::Error::line()/column() to a byte offset
requires the original input bytes (out of scope for the current
abstraction). The underlying parser positions remain available via
std::error::Error::source() and the wrapped cause's own Display.

All 495 Rust tests + 611 Python tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three writer error sites now construct MarcError::WriterError with the
1-based record index in the output stream and the record's 001 control
number where available, rather than generic InvalidField wrappers:

- "Cannot write to a finished writer" raises WriterError with no
  record context (no record is being processed at that point).
- u32 overflow on record length raises WriterError with the record
  index, 001, and the actual byte count in the message.
- u32 overflow on base address: same shape.

The record_index/record_control_number snapshot happens once at the
top of write_record so multiple potential failure sites share the
same context without re-extracting.

All 495 Rust tests + 611 Python tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the Python exception class hierarchy to expose the structured
positional metadata that the Rust MarcError carries, while preserving
pymarc-style catch behavior.

mrrc/__init__.py:

- All exception classes now share a _MrrcExceptionBase mixin that
  accepts ten positional kwargs (record_index, record_control_number,
  field_tag, indicator_position, subfield_code, found, expected,
  byte_offset, record_byte_offset, source). Every kwarg is optional so
  bare-constructor compatibility is preserved (raise RecordLeaderInvalid()
  still works).
- The mixin implements __reduce__ and __setstate__ so kwargs survive a
  pickle round-trip. By default Exception.__reduce__ only round-trips
  args, dropping instance __dict__; the explicit reducer fixes this.
- _format() produces the actionable one-liner shown by str(err);
  detailed() produces the multi-line diagnostic. Both shapes mirror the
  Rust Display and MarcError::detailed() outputs.
- New mrrc-specific subclasses extend the closest pymarc parent so
  pymarc-style catches still trigger:
    InvalidIndicator(RecordDirectoryInvalid)
    BadSubfieldCode(RecordDirectoryInvalid)
    InvalidField(RecordDirectoryInvalid)
    TruncatedRecord(EndOfRecordNotFound)
    EncodingError(MrrcException)
    XmlError(MrrcException)
    JsonError(MrrcException)
    WriterError(MrrcException)

src-python/src/error.rs:

- marc_error_to_py_err is rewritten to construct the typed Python
  exception class corresponding to each MarcError variant, passing
  positional context as kwargs. Falls through to PyValueError /
  PyIOError if typed-exception construction fails for any reason
  (missing class, kwarg rejected) so an error is never dropped silently.
- IoError variants continue to surface as Python's built-in OSError via
  PyIOError, matching pymarc.

src-python/src/parse_error.rs:

- ParseError::to_py_err now routes each variant through
  marc_error_to_py_err so the boundary-scanner / buffered-reader paths
  raise the same typed Python exception classes as the synchronous
  reader path. The variant API is unchanged so the many existing
  ParseError construction sites need no edits.

All 495 Rust tests + 611 Python tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…as __cause__

Two refinements from the security audit of the typed-exception
construction layer.

mrrc/__init__.py:

- __setstate__ now whitelists incoming state keys against the per-class
  allowed set (_POSITIONAL_FIELDS plus _pickle_extra_fields). Without
  this, a maliciously-crafted pickle could setattr arbitrary names —
  including method names — and shadow class methods on the instance.
  Pickle deserialization itself is the RCE primitive (anyone unpickling
  untrusted data has already lost), but blind setattr unnecessarily
  amplifies the blast radius from "you executed attacker code" to
  "subsequent operations on the unpickled object also execute attacker
  logic." Cheap to fix, no API impact.
- Subclasses with extra __init__ kwargs (InvalidField, TruncatedRecord,
  EncodingError, XmlError, JsonError, WriterError) now declare those
  fields via _pickle_extra_fields so they actually round-trip through
  pickle instead of being silently dropped. Functional bug, surfaced
  alongside the security fix.

src-python/src/error.rs:

- Typed-exception construction failures now chain as __cause__ on the
  fallback PyValueError/PyIOError. A broken install (mrrc module not
  importable, class missing, kwargs rejected) is now debuggable
  instead of silently swallowed.

Filed bd-uj7c for the related "cap per-stream error count in lenient
recovery" concern, which is out of scope for this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds insta and syrupy as dev dependencies and pins the externally-
visible error format for representative MarcError variants and Python
typed exception classes.

Rust (src/error.rs):
- 6 snapshot tests covering Display (one-liner) and detailed()
  (multi-line) for representative variants: InvalidIndicator with full
  context, BaseAddressNotFound with no context (kind-name fallback path
  introduced in the earlier fixup), DirectoryInvalid with truncated
  found bytes, TruncatedRecord detailed(), and WriterError with
  record context.
- Snapshots live in src/snapshots/. Run `cargo insta review` to
  inspect/accept changes when these drift.

Python (tests/python/test_errors.py):
- 36 tests total: hierarchy assertions (subclass relationships,
  pymarc-style catch behavior), bare-constructor compatibility for
  every exception class, pickle round-trip (including subclass
  extras and the security whitelist), and 7 snapshot tests
  pinning str/repr/detailed output for the same variants as the
  Rust suite.
- Snapshots live in tests/python/__snapshots__/ as syrupy .ambr files.

Format alignment:
- Python detailed() reworked to mirror the Rust label-padding
  algorithm (compute width from widest label across the actual
  populated rows, pad each label to that width). Output is now
  byte-for-byte identical to Rust's MarcError::detailed() for the
  same variant + populated fields.
- Subclasses with extra context fields (TruncatedRecord:
  expected_length/actual_length) hook into _extra_detail_rows so
  detailed() doesn't need a per-class override of the whole method.

Total: 495 Rust unit tests (was 489) + 643 Python tests (was 611) +
30 doc tests + audit + maturin all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds docs/reference/error-handling.md documenting:
- The exception hierarchy (pymarc-named parents + mrrc-specific
  subclasses, with the "you also catch X" subclass behavior table)
- Pymarc exception compatibility, scoped tightly to the exception
  layer (with explicit out-of-scope list for record/reader/writer
  APIs and format coverage)
- The optional symbol-level alias snippet for projects mid-migration
- Three before/after patterns showing what callers gain on the
  exception layer (same-catch-more-context, opt-in granularity,
  diagnostic dump)
- Per-variant field reference table
- Position semantics by format (ISO 2709 / MARCXML / MARCJSON)
- Source filename plumbing via with_source / from_path
- Recovery-mode interaction
- Pickle round-trip and the __setstate__ whitelist behavior

Adds the page to the Reference section of mkdocs.yml nav.

CHANGELOG entry under [Unreleased] documents:
- Breaking: MarcError variant restructure, removed InvalidRecord,
  Display format change
- Added: positional context, ParseContext, shared subfield
  primitives, with_source/from_path constructors, opportunistic 001
  extraction, detailed() rendering, mrrc-specific Python subclasses
  with positional kwargs/pickle/_format/detailed, PyO3 typed
  exception construction with __cause__ chaining, the docs page,
  snapshot tests via insta/syrupy
- Changed: source() chain walks underlying causes, From<io::Error>
  produces new struct shape, ParseError routes through
  marc_error_to_py_err, shared ISO 2709 primitives in iso2709.rs

No version bump; project remains on 0.7.x while related error beads
land before the next release cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…catedRecord

Two issues surfaced by the pre-merge comprehensive review.

Cross-language detailed() drift on InvalidIndicator
---------------------------------------------------

The Rust and Python detailed() outputs disagreed on column alignment
for InvalidIndicator: Rust pushed ("indicator", "1: found ...") (label
without colon, value carrying it), Python pushed ("indicator 1:",
"found ...") (number+colon in label). With column-padding to the
widest label, the visible whitespace landed in different places.

Aligned Rust to match Python's shape — label is now "indicator 0:" or
"indicator 1:" and the value is just "found X, expected Y". Snapshot
re-captured. Rust and Python detailed() outputs are now byte-for-byte
identical for InvalidIndicator (matching the same property already
verified for TruncatedRecord).

TruncatedRecord lost positional context in Strict mode
------------------------------------------------------

iso2709::read_record_data raised TruncatedRecord via the migration
helper truncated_msg in Strict mode, producing an error with all
positional fields set to None. Since strict-mode truncation is the
most common error in production, this contradicted bd-qzew's contract
("Always populated for TruncatedRecord: record_index, byte_offset,
record_byte_offset").

read_record_data now takes &ParseContext and constructs the error via
ctx.err_truncated_record(expected_length, actual_length). Switched
the underlying read from read_exact to a loop tracking bytes_read so
actual_length is accurate. Updated all three reader call sites and
the iso2709 self-tests.

Follow-up beads filed for related G4 findings:
- bd-3ois: thread ParseContext through recovery.rs salvage path
  (currently unreachable due to dead-code dispatch, but a latent gap
  for when recovery is properly wired)
- bd-6zvz: enrich MARCJSON parse-path errors via JsonError variant
  (the actual serde_json parse lives in src-python/src/formats.rs,
  not src/marcjson.rs)
- bd-4axe: populate ParseError positional metadata at boundary-
  scanner sites
- bd-wc5r: FFI integration test (Rust → PyO3 → Python) verifying
  typed exception construction for each MarcError variant
- bd-umwp: add typed exception class stubs to mrrc/_mrrc.pyi for
  mypy/pyright workflows

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 19, 2026

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

✅ 60 untouched benchmarks
⏩ 16 skipped benchmarks1


Comparing feat/bd-qzew-error-enrichment (92da5e2) with main (12272ba)2

Open in CodSpeed

Footnotes

  1. 16 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on main (92da5e2) during the generation of this report, so 12272ba was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@dchud dchud marked this pull request as draft April 19, 2026 20:32
dchud and others added 6 commits April 19, 2026 16:33
…s get syrupy

The wheel-test CI workflows were pip-installing test deps as a hardcoded
list (pytest, pytest-benchmark) parallel to pyproject.toml's dev extra.
Adding syrupy in commit 7 (for snapshot tests) updated pyproject.toml
but not the workflow lists, so the wheel-test runners on macOS/Linux/
Windows failed at fixture setup with "fixture 'snapshot' not found".

Fix: extract test runtime deps into a new [test] optional dependency in
pyproject.toml. Both wheel-test workflows now install the wheel together
with that extra (`uv pip install "${WHEEL}[test]"`), so the dep list has
a single source of truth and any future test dep added to [test] is
automatically picked up by CI without a parallel workflow edit.

The dev extra still carries the same test deps explicitly (rather than
extending [test]) to avoid the circular-extra resolution that some
package managers handle awkwardly. Both lists are kept aligned by
review; the test extra is the canonical source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Type checkers (mypy, pyright) read .py source directly when no .pyi
exists for the module. The mrrc exception classes set their positional
context attributes via setattr in __init__, which leaves them as Any
from a type checker's perspective.

Adds class-level attribute annotations to _MrrcExceptionBase and to
each subclass with extra fields (message on InvalidField/EncodingError/
XmlError/JsonError/WriterError, expected_length/actual_length on
TruncatedRecord) so checkers see the real types when callers inspect
attributes after `except mrrc.InvalidIndicator as e: e.record_index`.

No runtime behavior change. Closes bd-umwp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The actual serde_json::from_str call for MARCJSON parsing lives in the
PyO3 binding (src-python/src/formats.rs::marcjson_to_record), not in
src/marcjson.rs which operates on already-parsed serde_json::Value
inputs. The JsonError variant and ParseContext::err_json infrastructure
introduced in PR commit 4 were therefore unreachable; this commit hooks
them up at the actual parse site.

Both json_to_record and marcjson_to_record now construct a transient
ParseContext, use ctx.err_json to wrap the underlying serde_json::Error,
and route through marc_error_to_py_err so the typed JsonError Python
exception class is raised (rather than a bare PyValueError("Invalid
MARCJSON: ...")). The wrapped serde_json::Error is preserved via
#[source] so callers can downcast to inspect line/column.

Closes bd-6zvz.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restructures ParseError from a 3-variant string-message enum into a
struct with a kind discriminator (ParseErrorKind) and an optional
ParseErrorContext (record_index, byte_offset, source_name). Adds
builder methods (with_record_index, with_byte_offset, with_source) so
callers can attach context fluently when they have it; to_py_err
forwards any populated context into the constructed MarcError so the
typed Python exception receives the same positional attributes as the
synchronous reader path.

The 30+ existing construction sites in src-python/src/{backend,
unified_reader,batched_reader,buffered_reader}.rs are migrated
mechanically to function-call form (ParseError::invalid_record(msg),
::record_boundary_error(msg), ::io_error(msg)) — no contextual change
at the call sites themselves because the boundary scanner / buffered
reader paths don't currently track records_read counters or
incremental byte offsets that would feed the new context fields.

The builder methods are #[allow(dead_code)] for now — they're
deliberately retained as infrastructure so per-site context tracking
can be added incrementally as the boundary-scanner / buffered-reader
loops gain counters, without further ParseError surgery.

Closes bd-4axe (the structural unification + context infrastructure);
populating the context at specific scan sites needs a separate
records-read-tracking change in the ReaderBackend iteration loop and
is out of scope for this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
src/recovery.rs's try_recover_record was constructing MarcError values
via the legacy migration helpers (truncated_msg, invalid_field_msg) which
produce errors with all positional fields set to None. Recovered-mode
errors therefore carried no record_index, byte_offset, source_name, or
001 control number — contradicting bd-qzew's contract that errors carry
positional metadata in all recovery modes.

Adds &ParseContext as a parameter to try_recover_record; the bib reader
passes its own ctx through so any error raised during salvage inherits
the same positional shape as strict-mode errors. The truncated-no-
directory case raises ctx.err_truncated_record; the field-data-not-
available case raises ctx.err_invalid_field with the field tag set on
a cloned context.

Note: the dispatch path from MarcReader to try_recover_record is
currently unreachable in practice (the buffer-length comparison can
never trigger because read_record_data returns a buffer of full length
even on a short read). The contract is now correct so when the dead-
code dispatch is fixed in a future change, recovered errors carry full
context automatically.

Removes the now-unused truncated_msg migration helper (the only caller
was recovery.rs).

Closes bd-3ois.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds TestFfiTypedExceptions in tests/python/test_errors.py that drives
the Rust → PyO3 → Python typed-exception conversion path with malformed
MARC inputs and asserts the raised Python class plus populated
positional attributes. Surfaces gaps the unit-only tests missed.

Two real bugs caught and fixed by this test:

1. Authority/holdings reader iterators were rewrapping every MarcError
   as a bare PyValueError("Failed to parse record: ...") via
   pyo3::exceptions::PyValueError::new_err, bypassing the typed
   exception construction entirely. Replaced with
   crate::error::marc_error_to_py_err so callers see the proper
   typed class.

2. Boundary-scanner detection of truncation in
   src-python/src/{backend,unified_reader,reader_helpers}.rs was
   constructing ParseError::invalid_record with a "Truncated record:
   expected N, got M" string, which mapped to mrrc.InvalidField
   instead of mrrc.TruncatedRecord on the Python side.
   Added ParseError::truncated_record(expected, actual) constructor
   that maps to MarcError::TruncatedRecord; updated all 4 call sites.
   Now mrrc.TruncatedRecord surfaces with expected_length/actual_length
   populated.

Closes bd-wc5r.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dchud dchud marked this pull request as ready for review April 19, 2026 20:50
dchud and others added 2 commits April 19, 2026 17:37
bd-qzew added ~380 lines of exception class hierarchy to mrrc/__init__.py,
pushing it past the threshold where the structure starts hurting (mixing
re-exports, module-level helpers, exception classes, format constants,
and several large wrapper classes). The exception hierarchy is the
easiest unit to extract cleanly since nothing else in the file depends
on it at import time.

Moves _MrrcExceptionBase, _POSITIONAL_FIELDS, MrrcException, all 12
mrrc-named exception classes (RecordLengthInvalid, RecordLeaderInvalid,
BaseAddressInvalid, BaseAddressNotFound, RecordDirectoryInvalid,
EndOfRecordNotFound, FieldNotFound, FatalReaderError, InvalidIndicator,
BadSubfieldCode, InvalidField, TruncatedRecord, EncodingError, XmlError,
JsonError, WriterError), and BadSubfieldCodeWarning to a new
mrrc/exceptions.py module.

mrrc/__init__.py re-exports all of them so the public API
(mrrc.InvalidIndicator, mrrc.RecordLeaderInvalid, etc.) is byte-for-byte
identical. Aligns with pymarc's pymarc/exceptions.py convention.

mrrc/__init__.py reduced from 2339 to 1971 lines (-368 net after the
imports added back). All 647 Python tests pass unmodified.

Filed bd-mcei for the larger refactor of the remaining wrapper classes
(Field, Record, Leader, MARCReader, MARCWriter) which together are
~1500 more lines and would naturally live in their own modules
following the same pymarc-aligned pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project's `requires-python = ">=3.10"` already covers the syntax
improvements people typically reach for the future import to enable:
PEP 604 union types (`int | None`) work natively from 3.10+, and PEP 585
generic builtins (`list[int]`, `dict[str, int]`) from 3.9+.

PEP 563 (postponed annotations) is still not the default in any released
Python version — but neither file uses any forward references that would
need the lazy-evaluation behavior.

Removes the import from mrrc/exceptions.py and tests/python/test_errors.py
where I added it during PR2 without checking whether it was actually
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dchud dchud merged commit 8860ef0 into main Apr 19, 2026
47 of 48 checks passed
@dchud dchud deleted the feat/bd-qzew-error-enrichment branch April 19, 2026 22:29
dchud added a commit that referenced this pull request Apr 19, 2026
Empty bead created accidentally by a failed jq parse during follow-up
filing in PR #100. The actual work it would describe is delivered in
bd-3ois (now closed) via PR #100 commit 60fd27d.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dchud added a commit that referenced this pull request Apr 19, 2026
PR #100 merged with the full bd-qzew error enrichment work. Followups
bd-44wb (P1, stable error codes + help URLs), bd-kodg (P1, structured
to_dict/to_json + hex-dump diagnostics), and bd-uj7c (P2, per-stream
error cap) target the same 0.8.0 release window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant