Skip to content

feat(template): add skip_row_if predicate for row filtering (P0-2)#26

Merged
dev360 merged 1 commit into
mainfrom
feat/p0-2-skip-row-if
May 22, 2026
Merged

feat(template): add skip_row_if predicate for row filtering (P0-2)#26
dev360 merged 1 commit into
mainfrom
feat/p0-2-skip-row-if

Conversation

@dev360
Copy link
Copy Markdown
Owner

@dev360 dev360 commented May 21, 2026

Summary

Closes P0-2 — declarative row filtering during extraction.

Reports often interleave data rows with shapes that share the column geometry of real records but are not records themselves: subtotals with a blank discriminator, day-of-week marker rows, grand-totals that zero out the key column and populate only the total. Previously, crease extracted those rows and the validator surfaced every implausible value as wrong_type / missing_required, burying real errors under noise. Operators had to either pre-process the file or add per-field null_tokens hacks.

locate.skip_row_if is now a first-class list of predicates that drops matching rows before field coercion.

Naming note (after v1.2.0 blocks rebase)

The blocks PR (#37) introduced a class also named SkipRowRule for in-block row filtering with a different shape (column + cell_pattern / match_blank). This PR's class was renamed to LocateSkipRule to avoid the collision; the helper functions in extractor.py were correspondingly renamed (_row_matches_skip_row_matches_locate_skip, etc.). The YAML field name skip_row_if is unchanged.

API

locate:
  skip_row_if:
    # subtotal rows: blank discriminator
    - all_blank: [customer]
    # day-of-week marker rows
    - column: label
      value_pattern: "^(MONDAY|TUESDAY|WEDNESDAY|THURSDAY|FRIDAY|SATURDAY|SUNDAY)$"
    # grand-total row: blank discriminator AND populated total
    - all_blank: [site]
      non_blank: [head_count]

Each list entry is a LocateSkipRule. Three optional fields:

  • all_blank: [col, ...] — every listed column must be blank on the row.
  • non_blank: [col, ...] — every listed column must carry a non-blank value.
  • column: name + value_pattern: regex — that column's stringified value must full-match the regex.

Fields set on the same rule AND together. Multiple rules in the list OR together. Matching rows are silently filtered — no record in canonical, no row error.

Test plan

  • uv run pytest tests/test_field_scan_gaps.py -q — three P0-2 tests graduate.
  • uv run pytest -q — full suite green after rebase.
  • uvx --from 'ruff==0.6.9' ruff check . and ruff format --check . — clean.

All fixtures and copy use only fictitious values per CLAUDE.md.

🤖 Generated with Claude Code

@dev360 dev360 force-pushed the feat/p0-2-skip-row-if branch from b97a0dc to 9133437 Compare May 22, 2026 13:43
@dev360 dev360 enabled auto-merge (squash) May 22, 2026 15:47
Reports often interleave data rows with shapes that share the column
geometry of real records but are not records themselves: subtotal rows
with a blank discriminator, day-of-week markers, grand-total rows that
zero out the key column and populate only the total. The previous
behavior was to extract those rows as records, then surface every
implausible value as ``wrong_type`` / ``missing_required`` — burying
real errors under noise.

Adds ``locate.skip_row_if`` as a list of predicates. Each predicate is
one ``SkipRowRule`` and supports any combination of three fields:

- ``all_blank: [col, ...]`` — every listed column must be blank.
- ``non_blank: [col, ...]`` — every listed column must be non-blank.
- ``column: name`` + ``value_pattern: regex`` — single column's
  stringified value must full-match the regex.

Fields set on the same rule are AND-ed (so a compound rule can drop
the "blank discriminator AND populated total" grand-total row).
Multiple rules in the list are OR-ed (any rule's match drops the row).

Matching rows are silently filtered before field coercion — no record
in canonical, no row error. Unknown column names in a rule are
ignored (the rule simply can't match), so a template that misnames a
column won't crash extraction; the rule just never fires.

Graduates the three P0-2 xfail tests in ``tests/test_field_scan_gaps.py``
and adds a "Skipping rows during extraction" section to
``docs/guides/templates.md``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dev360 dev360 force-pushed the feat/p0-2-skip-row-if branch from 9133437 to 2224d17 Compare May 22, 2026 15:47
@dev360 dev360 merged commit 90e9113 into main May 22, 2026
8 checks passed
@dev360 dev360 deleted the feat/p0-2-skip-row-if branch May 22, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant