feat(template): add skip_row_if predicate for row filtering (P0-2)#26
Merged
Conversation
b97a0dc to
9133437
Compare
Reports often interleave data rows with shapes that share the column geometry of real records but are not records themselves: subtotal rows with a blank discriminator, day-of-week markers, grand-total rows that zero out the key column and populate only the total. The previous behavior was to extract those rows as records, then surface every implausible value as ``wrong_type`` / ``missing_required`` — burying real errors under noise. Adds ``locate.skip_row_if`` as a list of predicates. Each predicate is one ``SkipRowRule`` and supports any combination of three fields: - ``all_blank: [col, ...]`` — every listed column must be blank. - ``non_blank: [col, ...]`` — every listed column must be non-blank. - ``column: name`` + ``value_pattern: regex`` — single column's stringified value must full-match the regex. Fields set on the same rule are AND-ed (so a compound rule can drop the "blank discriminator AND populated total" grand-total row). Multiple rules in the list are OR-ed (any rule's match drops the row). Matching rows are silently filtered before field coercion — no record in canonical, no row error. Unknown column names in a rule are ignored (the rule simply can't match), so a template that misnames a column won't crash extraction; the rule just never fires. Graduates the three P0-2 xfail tests in ``tests/test_field_scan_gaps.py`` and adds a "Skipping rows during extraction" section to ``docs/guides/templates.md``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9133437 to
2224d17
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes P0-2 — declarative row filtering during extraction.
Reports often interleave data rows with shapes that share the column geometry of real records but are not records themselves: subtotals with a blank discriminator, day-of-week marker rows, grand-totals that zero out the key column and populate only the total. Previously, crease extracted those rows and the validator surfaced every implausible value as
wrong_type/missing_required, burying real errors under noise. Operators had to either pre-process the file or add per-fieldnull_tokenshacks.locate.skip_row_ifis now a first-class list of predicates that drops matching rows before field coercion.Naming note (after v1.2.0 blocks rebase)
The blocks PR (#37) introduced a class also named
SkipRowRulefor in-block row filtering with a different shape (column+cell_pattern/match_blank). This PR's class was renamed toLocateSkipRuleto avoid the collision; the helper functions inextractor.pywere correspondingly renamed (_row_matches_skip→_row_matches_locate_skip, etc.). The YAML field nameskip_row_ifis unchanged.API
Each list entry is a
LocateSkipRule. Three optional fields:all_blank: [col, ...]— every listed column must be blank on the row.non_blank: [col, ...]— every listed column must carry a non-blank value.column: name+value_pattern: regex— that column's stringified value must full-match the regex.Fields set on the same rule AND together. Multiple rules in the list OR together. Matching rows are silently filtered — no record in canonical, no row error.
Test plan
uv run pytest tests/test_field_scan_gaps.py -q— three P0-2 tests graduate.uv run pytest -q— full suite green after rebase.uvx --from 'ruff==0.6.9' ruff check .andruff format --check .— clean.All fixtures and copy use only fictitious values per CLAUDE.md.
🤖 Generated with Claude Code