feat(template): add blocks: v2 grammar for repeating sections#37
Merged
Conversation
…ck ref
Adds the template-model layer for the upcoming `blocks:` grammar.
Extraction is wired in a follow-up; this commit only teaches the
schema what a v2 template looks like.
Grammar shape:
blocks:
- name: daily_section
tab_pattern: ...
starts_at: { column: D, cell_pattern: ^DELIVERY SCHEDULE$ }
ends_at: { column: A, cell_pattern: ^={3,}$,
strategy: last_in_block }
separator_rows:
- { column: A, cell_pattern: ^={3,}$ }
- { column: A, match_blank: true }
captures:
- field: delivery_date
from: { column: D, cell_pattern: ^(MON|...) (.+)$, regex_group: 2 }
type: date
entities:
- name: delivery
block: daily_section # entity scoped to each block instance;
# captures merge onto every emitted row
...
New Pydantic models (all `extra="forbid"`):
- CellAnchor column + cell_pattern + regex_group + on_multiple
- EndAnchor CellAnchor + strategy (first_in_block | last_in_block)
- SkipRowRule column + (cell_pattern XOR match_blank: true)
- Capture field + from (CellAnchor) + type + required + propagate
- Block name + tab_pattern + starts_at + ends_at + ...
`Entity` gains an optional `block: str` field. When set, the entity is
extraction-scoped to each instance of the named block and inherits its
captures.
`Template.version` is now `Literal[1, 2]` (was `int = 1`). A v2-grammar
template (`blocks:` non-empty or any `Entity.block` set) requires
`version: 2`.
Cross-tree validation at `Template.model_validate`:
- blocks-requires-v2 `blocks:` without `version: 2`
- duplicate block name two blocks share a `name`
- block_ref_not_found `Entity.block` references unknown block
- entity_tab_with_block entity sets `locate.tab`/`tab_pattern` while
also setting `block:` (the block owns tab scope)
- field_shadow_collision capture on block B and FieldSpec on an entity
that targets B share a name
`CellAnchor.column` accepts an int (0-indexed) OR a single Excel letter
("A".."Z"); a `@field_validator` coerces letter -> int. Multi-letter
columns ("AA") are out of scope for v1.
Includes the seed `test_cases/repeating_sections_per_tab/` fixture
that uses the new grammar. Schema validates cleanly; full extraction
fails until the extractor commit lands (entity has no tab/tab_pattern
because the block owns tab scope — that's the correct failure mode
prior to wiring).
103 existing tests pass; the 4 failures are the new seed case waiting
on the extractor.
… scoping
Wires the `blocks:` grammar through the extractor. With this commit, the
seed `test_cases/repeating_sections_per_tab/` case extracts cleanly:
7 delivery rows across 2 daily sections, each carrying the section's
`delivery_date` and `day_of_week` captured from the section header row.
Mechanics:
1. `_extract_entity` dispatches to a new `_extract_for_block` path when
`entity.block` is set. The block (looked up by name from
`Template.blocks`) owns tab targeting via its own `tab_pattern`;
`find_tabs` is invoked with a stand-in Locate carrying the block's
tab scope.
2. `_find_block_instances` makes a single linear pass through the
tab, collecting all `starts_at` and `ends_at` matches in their
respective columns. Instances are paired per strategy:
- `last_in_block` (default, greedy): pick the LAST ends_at hit
in (starts_at, next_starts_at] or (starts_at, EOF].
- `first_in_block`: pick the FIRST.
When `ends_at` is omitted, the instance extends to next start - 1
or EOF. When `starts_at` matches nothing, emits
`block_starts_not_found`. When `ends_at` is configured but no
candidate fires in the window, emits `block_unterminated`.
3. `_resolve_captures` scans each capture's `from.column` inside the
instance's row range. `on_multiple` controls picking (first/last/
error). Zero matches + `required: true` => `capture_no_match`.
Captured groups are coerced via `_coerce_capture` (returns ISO
strings for dates to match the rest of the corpus's JSON shape).
4. `_extract_flat` gains three optional kwargs:
- `cell_range_override`: synthesized per block instance so the
existing CellRange row-window machinery scopes extraction.
- `separator_rows`: applied AFTER header detection (filtering
pre-header would shift indices out from under header_idx).
- `extra_fields`: the propagating captures merged onto each
emitted row.
5. `_locate.find_header_row` and `resolve_header_row` accept optional
`min_row` / `max_row` so an entity's `header_anchor` scan inside a
block instance is restricted to that instance's range. An outer
occurrence of the anchor text can't leak in.
Captures whose `propagate: false` are still scanned (and still surface
`capture_no_match` / `capture_multiple_matches` if applicable) but are
not merged onto rows.
Fixture changes:
- `expected.json` switched to plural `deliveries:` (matches every
other corpus case's pluralization convention; `_pluralize`
produces `deliveries` for `cardinality: many`).
- `input.xlsx`: harmonized the two header rows so both blocks use
`DOCK` (one was `DOCK #`). The header-variant case will land with
a future `FieldSpec.alias:` feature outside the blocks grammar.
107 tests pass on both calamine and openpyxl backends.
…odes
Adds the five extract-time error codes the `blocks:` grammar needs and
five programmatic negative corpus cases that exercise them end-to-end,
plus eight template-load unit tests for the cross-tree validator.
New STRUCTURAL error codes (all in `_errors.py` STRUCTURAL_TYPES; each
also has a human-readable msg in `validator._structural_msg`):
- block_starts_not_found starts_at never fires in a matching tab
- block_unterminated ends_at configured but no candidate found
before next starts_at or EOF
- capture_no_match required capture with zero matches in
the block instance
- capture_multiple_matches capture has >1 hit and on_multiple=error
- block_ref_not_found entity.block names an undeclared block
(template-load time only)
`_coerce_capture` no longer swallows coercion failures — it raises
`CoercionError`, which `_resolve_captures` catches and re-emits as a
row-level `wrong_type` keyed to the capture so the validator picks it
up alongside other field-level coercion failures.
New corpus cases (programmatic, regenerable via
`python -m test_cases.generate`), all using synthetic Acme order data:
blocks_starts_not_found
blocks_unterminated
blocks_capture_no_match_required
blocks_capture_multiple_matches_error
blocks_capture_wrong_type
The first four set verdict=reject (structural); the wrong_type case is
needs_review (rows still extract, the bad capture is flagged).
Helper notes for fixture authors: `_append_string_row` forces
`data_type = 's'` on cells whose string value starts with `=`.
Without this, openpyxl's `ws.append` writes the value as a formula,
which calamine then evaluates to an empty string when round-tripped
through `to_python(skip_empty_area=False)`. That broke the end-anchor
detection for cells like `====`.
Template-load unit tests in tests/test_blocks_template_validation.py
cover the five cross-tree validator paths plus column-letter coercion
and SkipRowRule's exactly-one-match-mode invariant.
140 tests pass on both backends.
Documents the `blocks:` v2 grammar end-to-end:
- docs/guides/blocks.md new page; the YAML shape, every Block
field, greedy-ends_at rationale,
column int-vs-letter, capture knobs,
flat-output guarantee, both the
template-load and extract-time error
tables, the three matchers reference,
and explicit v1 non-features (no
nesting, no above-anchor captures,
no merged-cell expansion, no
multi-column anchor).
- docs/guides/templates.md adds a Versioning section explaining
`version: 1` vs `version: 2` and the
loader-rejection behavior.
- docs/guides/streaming.md adds a section on per-block-instance
buffering and the latency-to-first-row
contract.
- docs/reference/errors.md adds a table for the five new
`block_*` / `capture_*` codes plus a
note on capture coercion failures
re-using `wrong_type`.
- README.md adds a "Repeating sections within
one tab" section between
"Multi-entity files" and "Scattered
metadata", with the same Acme orders
example used in the guide.
- mkdocs.yml adds guides/blocks.md to the Guides
nav, between layouts and streaming.
`mkdocs build --strict` passes. Full pytest (140 tests) still green.
Pre-commit fixers fired on CI:
- `ruff-format` reformatted `src/crease/extractor.py`,
`src/crease/template_model.py`, and `test_cases/cases.py`.
- `end-of-file-fixer` added trailing newlines to the new blocks-grammar
corpus JSON files (`expected.json`, `expected_issues.json`).
Also fixes the source of the EoF drift: `test_cases/types.py` now appends
`\n` when writing the two JSON fixture files, so future
`python -m test_cases.generate` runs stay idempotent against the
end-of-file-fixer hook.
3 tasks
dev360
added a commit
that referenced
this pull request
May 22, 2026
* feat(template): add Block, Capture, CellAnchor schema with Entity.block ref
Adds the template-model layer for the upcoming `blocks:` grammar.
Extraction is wired in a follow-up; this commit only teaches the
schema what a v2 template looks like.
Grammar shape:
blocks:
- name: daily_section
tab_pattern: ...
starts_at: { column: D, cell_pattern: ^DELIVERY SCHEDULE$ }
ends_at: { column: A, cell_pattern: ^={3,}$,
strategy: last_in_block }
separator_rows:
- { column: A, cell_pattern: ^={3,}$ }
- { column: A, match_blank: true }
captures:
- field: delivery_date
from: { column: D, cell_pattern: ^(MON|...) (.+)$, regex_group: 2 }
type: date
entities:
- name: delivery
block: daily_section # entity scoped to each block instance;
# captures merge onto every emitted row
...
New Pydantic models (all `extra="forbid"`):
- CellAnchor column + cell_pattern + regex_group + on_multiple
- EndAnchor CellAnchor + strategy (first_in_block | last_in_block)
- SkipRowRule column + (cell_pattern XOR match_blank: true)
- Capture field + from (CellAnchor) + type + required + propagate
- Block name + tab_pattern + starts_at + ends_at + ...
`Entity` gains an optional `block: str` field. When set, the entity is
extraction-scoped to each instance of the named block and inherits its
captures.
`Template.version` is now `Literal[1, 2]` (was `int = 1`). A v2-grammar
template (`blocks:` non-empty or any `Entity.block` set) requires
`version: 2`.
Cross-tree validation at `Template.model_validate`:
- blocks-requires-v2 `blocks:` without `version: 2`
- duplicate block name two blocks share a `name`
- block_ref_not_found `Entity.block` references unknown block
- entity_tab_with_block entity sets `locate.tab`/`tab_pattern` while
also setting `block:` (the block owns tab scope)
- field_shadow_collision capture on block B and FieldSpec on an entity
that targets B share a name
`CellAnchor.column` accepts an int (0-indexed) OR a single Excel letter
("A".."Z"); a `@field_validator` coerces letter -> int. Multi-letter
columns ("AA") are out of scope for v1.
Includes the seed `test_cases/repeating_sections_per_tab/` fixture
that uses the new grammar. Schema validates cleanly; full extraction
fails until the extractor commit lands (entity has no tab/tab_pattern
because the block owns tab scope — that's the correct failure mode
prior to wiring).
103 existing tests pass; the 4 failures are the new seed case waiting
on the extractor.
* feat(extractor): implement block scanning, capture resolution, entity scoping
Wires the `blocks:` grammar through the extractor. With this commit, the
seed `test_cases/repeating_sections_per_tab/` case extracts cleanly:
7 delivery rows across 2 daily sections, each carrying the section's
`delivery_date` and `day_of_week` captured from the section header row.
Mechanics:
1. `_extract_entity` dispatches to a new `_extract_for_block` path when
`entity.block` is set. The block (looked up by name from
`Template.blocks`) owns tab targeting via its own `tab_pattern`;
`find_tabs` is invoked with a stand-in Locate carrying the block's
tab scope.
2. `_find_block_instances` makes a single linear pass through the
tab, collecting all `starts_at` and `ends_at` matches in their
respective columns. Instances are paired per strategy:
- `last_in_block` (default, greedy): pick the LAST ends_at hit
in (starts_at, next_starts_at] or (starts_at, EOF].
- `first_in_block`: pick the FIRST.
When `ends_at` is omitted, the instance extends to next start - 1
or EOF. When `starts_at` matches nothing, emits
`block_starts_not_found`. When `ends_at` is configured but no
candidate fires in the window, emits `block_unterminated`.
3. `_resolve_captures` scans each capture's `from.column` inside the
instance's row range. `on_multiple` controls picking (first/last/
error). Zero matches + `required: true` => `capture_no_match`.
Captured groups are coerced via `_coerce_capture` (returns ISO
strings for dates to match the rest of the corpus's JSON shape).
4. `_extract_flat` gains three optional kwargs:
- `cell_range_override`: synthesized per block instance so the
existing CellRange row-window machinery scopes extraction.
- `separator_rows`: applied AFTER header detection (filtering
pre-header would shift indices out from under header_idx).
- `extra_fields`: the propagating captures merged onto each
emitted row.
5. `_locate.find_header_row` and `resolve_header_row` accept optional
`min_row` / `max_row` so an entity's `header_anchor` scan inside a
block instance is restricted to that instance's range. An outer
occurrence of the anchor text can't leak in.
Captures whose `propagate: false` are still scanned (and still surface
`capture_no_match` / `capture_multiple_matches` if applicable) but are
not merged onto rows.
Fixture changes:
- `expected.json` switched to plural `deliveries:` (matches every
other corpus case's pluralization convention; `_pluralize`
produces `deliveries` for `cardinality: many`).
- `input.xlsx`: harmonized the two header rows so both blocks use
`DOCK` (one was `DOCK #`). The header-variant case will land with
a future `FieldSpec.alias:` feature outside the blocks grammar.
107 tests pass on both calamine and openpyxl backends.
* feat(extractor): emit structural errors for block / capture failure modes
Adds the five extract-time error codes the `blocks:` grammar needs and
five programmatic negative corpus cases that exercise them end-to-end,
plus eight template-load unit tests for the cross-tree validator.
New STRUCTURAL error codes (all in `_errors.py` STRUCTURAL_TYPES; each
also has a human-readable msg in `validator._structural_msg`):
- block_starts_not_found starts_at never fires in a matching tab
- block_unterminated ends_at configured but no candidate found
before next starts_at or EOF
- capture_no_match required capture with zero matches in
the block instance
- capture_multiple_matches capture has >1 hit and on_multiple=error
- block_ref_not_found entity.block names an undeclared block
(template-load time only)
`_coerce_capture` no longer swallows coercion failures — it raises
`CoercionError`, which `_resolve_captures` catches and re-emits as a
row-level `wrong_type` keyed to the capture so the validator picks it
up alongside other field-level coercion failures.
New corpus cases (programmatic, regenerable via
`python -m test_cases.generate`), all using synthetic Acme order data:
blocks_starts_not_found
blocks_unterminated
blocks_capture_no_match_required
blocks_capture_multiple_matches_error
blocks_capture_wrong_type
The first four set verdict=reject (structural); the wrong_type case is
needs_review (rows still extract, the bad capture is flagged).
Helper notes for fixture authors: `_append_string_row` forces
`data_type = 's'` on cells whose string value starts with `=`.
Without this, openpyxl's `ws.append` writes the value as a formula,
which calamine then evaluates to an empty string when round-tripped
through `to_python(skip_empty_area=False)`. That broke the end-anchor
detection for cells like `====`.
Template-load unit tests in tests/test_blocks_template_validation.py
cover the five cross-tree validator paths plus column-letter coercion
and SkipRowRule's exactly-one-match-mode invariant.
140 tests pass on both backends.
* docs(blocks): grammar guide, errors taxonomy, README bullet, nav entry
Documents the `blocks:` v2 grammar end-to-end:
- docs/guides/blocks.md new page; the YAML shape, every Block
field, greedy-ends_at rationale,
column int-vs-letter, capture knobs,
flat-output guarantee, both the
template-load and extract-time error
tables, the three matchers reference,
and explicit v1 non-features (no
nesting, no above-anchor captures,
no merged-cell expansion, no
multi-column anchor).
- docs/guides/templates.md adds a Versioning section explaining
`version: 1` vs `version: 2` and the
loader-rejection behavior.
- docs/guides/streaming.md adds a section on per-block-instance
buffering and the latency-to-first-row
contract.
- docs/reference/errors.md adds a table for the five new
`block_*` / `capture_*` codes plus a
note on capture coercion failures
re-using `wrong_type`.
- README.md adds a "Repeating sections within
one tab" section between
"Multi-entity files" and "Scattered
metadata", with the same Acme orders
example used in the guide.
- mkdocs.yml adds guides/blocks.md to the Guides
nav, between layouts and streaming.
`mkdocs build --strict` passes. Full pytest (140 tests) still green.
* style: apply ruff-format + ensure JSON fixtures end with newline
Pre-commit fixers fired on CI:
- `ruff-format` reformatted `src/crease/extractor.py`,
`src/crease/template_model.py`, and `test_cases/cases.py`.
- `end-of-file-fixer` added trailing newlines to the new blocks-grammar
corpus JSON files (`expected.json`, `expected_issues.json`).
Also fixes the source of the EoF drift: `test_cases/types.py` now appends
`\n` when writing the two JSON fixture files, so future
`python -m test_cases.generate` runs stay idempotent against the
end-of-file-fixer hook.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the
blocks:v2 grammar so a template can describe a tab thatholds multiple repeating sub-sections delimited by anchor patterns,
with per-section metadata that gets merged onto every row inside the
section. The flat
entities:grammar can't express that pattern today.Mental model:
Output stays flat —
order_datefrom each section's DAY-row is mergedonto every order row from that section. No nested-dict output mode.
What's intentionally not in v1
body.blocks, noblock: [outer, inner]. Forward-compatible if afuture layout needs it.
starts_at. A capture'sfromscan is boundedby the block instance.
top-left cell.
column:is a single int or single letter.Each cut keeps the v1 grammar small and predictable; the doc page
spells them out so template authors know where the edges are.
Commit slice
feat(template): add Block, Capture, CellAnchor schema with Entity.block ref— Pydantic models,
Literal[1, 2]version, cross-tree validators.feat(extractor): implement block scanning, capture resolution, entity scoping— block-instance discovery, capture resolution, separator_rows
pre-filter,
header_anchorscoping per instance, captures mergedonto flat rows.
feat(extractor): emit structural errors for block / capture failure modes— five new
block_*/capture_*error codes; capture coercionfailures re-use
wrong_type; five corpus negative cases +eight template-load unit tests.
docs(blocks): grammar guide, errors taxonomy, README bullet, nav entry—
docs/guides/blocks.md(new page),docs/reference/errors.md,versioning section in
templates.md, streaming note,README bullet,
mkdocs.ymlnav.New error codes (public contract — see
docs/reference/errors.md)block_starts_not_foundtab_patternbutstarts_atnever firesblock_unterminatedends_atconfigured but no candidate before next start / EOFcapture_no_matchcapture_multiple_matcheson_multiple: errormatches more than onceblock_ref_not_foundTest plan
uv run pytest -q— 140 tests pass (was 107)uv run mkdocs build --strictcleanuv run ruff check .cleanpython -m test_cases.generate)ORD-####,example.com)🤖 Generated with Claude Code