feat(template): add `blocks:` v2 grammar for repeating sections by dev360 · Pull Request #37 · dev360/crease

dev360 · 2026-05-22T13:27:39Z

Summary

Adds the blocks: v2 grammar so a template can describe a tab that
holds multiple repeating sub-sections delimited by anchor patterns,
with per-section metadata that gets merged onto every row inside the
section. The flat entities: grammar can't express that pattern today.

Mental model:

template_id: weekly_orders
version: 2

blocks:                                 # top-level region declarations
  - name: daily_section
    tab_pattern: ^W-\d+$
    starts_at: { column: D, cell_pattern: ^ORDER SCHEDULE$ }
    ends_at:   { column: A, cell_pattern: ^={3,}$ }
    captures:
      - field: order_date
        from: { column: D, cell_pattern: ^DAY (\d+-\d+-\d+)$, regex_group: 1 }
        type: date
        date_formats: ['%m-%d-%Y']

entities:
  - name: order
    block: daily_section                # ← scope this entity to each block instance
    cardinality: many
    locate:
      orientation: flat
      header_anchor: { text: ORDER_ID, match_mode: exact }
    fields: [...]

Output stays flat — order_date from each section's DAY-row is merged
onto every order row from that section. No nested-dict output mode.

What's intentionally not in v1

Nested blocks. Entities reference a block by name. There is no
body.blocks, no block: [outer, inner]. Forward-compatible if a
future layout needs it.
Captures above starts_at. A capture's from scan is bounded
by the block instance.
Merged-cell expansion. Anchors must point at the merge's
top-left cell.
Multi-column anchor. column: is a single int or single letter.

Each cut keeps the v1 grammar small and predictable; the doc page
spells them out so template authors know where the edges are.

Commit slice

feat(template): add Block, Capture, CellAnchor schema with Entity.block ref
— Pydantic models, Literal[1, 2] version, cross-tree validators.
feat(extractor): implement block scanning, capture resolution, entity scoping
— block-instance discovery, capture resolution, separator_rows
pre-filter, header_anchor scoping per instance, captures merged
onto flat rows.
feat(extractor): emit structural errors for block / capture failure modes
— five new block_* / capture_* error codes; capture coercion
failures re-use wrong_type; five corpus negative cases +
eight template-load unit tests.
docs(blocks): grammar guide, errors taxonomy, README bullet, nav entry
— docs/guides/blocks.md (new page), docs/reference/errors.md,
versioning section in templates.md, streaming note,
README bullet, mkdocs.yml nav.

New error codes (public contract — see `docs/reference/errors.md`)

Code	Severity	When
`block_starts_not_found`	structural	tab matches `tab_pattern` but `starts_at` never fires
`block_unterminated`	structural	`ends_at` configured but no candidate before next start / EOF
`capture_no_match`	structural	required capture with zero matches in the instance
`capture_multiple_matches`	structural	capture with `on_multiple: error` matches more than once
`block_ref_not_found`	structural	template-load: entity targets an undeclared block

Test plan

uv run pytest -q — 140 tests pass (was 107)
Both calamine and openpyxl backends covered for every new corpus case
uv run mkdocs build --strict clean
uv run ruff check . clean
Corpus fixtures regenerate deterministically (python -m test_cases.generate)
Synthetic fixtures only (Acme/Globex/Hooli vocabulary, ORD-####, example.com)

🤖 Generated with Claude Code

…ck ref Adds the template-model layer for the upcoming `blocks:` grammar. Extraction is wired in a follow-up; this commit only teaches the schema what a v2 template looks like. Grammar shape: blocks: - name: daily_section tab_pattern: ... starts_at: { column: D, cell_pattern: ^DELIVERY SCHEDULE$ } ends_at: { column: A, cell_pattern: ^={3,}$, strategy: last_in_block } separator_rows: - { column: A, cell_pattern: ^={3,}$ } - { column: A, match_blank: true } captures: - field: delivery_date from: { column: D, cell_pattern: ^(MON|...) (.+)$, regex_group: 2 } type: date entities: - name: delivery block: daily_section # entity scoped to each block instance; # captures merge onto every emitted row ... New Pydantic models (all `extra="forbid"`): - CellAnchor column + cell_pattern + regex_group + on_multiple - EndAnchor CellAnchor + strategy (first_in_block | last_in_block) - SkipRowRule column + (cell_pattern XOR match_blank: true) - Capture field + from (CellAnchor) + type + required + propagate - Block name + tab_pattern + starts_at + ends_at + ... `Entity` gains an optional `block: str` field. When set, the entity is extraction-scoped to each instance of the named block and inherits its captures. `Template.version` is now `Literal[1, 2]` (was `int = 1`). A v2-grammar template (`blocks:` non-empty or any `Entity.block` set) requires `version: 2`. Cross-tree validation at `Template.model_validate`: - blocks-requires-v2 `blocks:` without `version: 2` - duplicate block name two blocks share a `name` - block_ref_not_found `Entity.block` references unknown block - entity_tab_with_block entity sets `locate.tab`/`tab_pattern` while also setting `block:` (the block owns tab scope) - field_shadow_collision capture on block B and FieldSpec on an entity that targets B share a name `CellAnchor.column` accepts an int (0-indexed) OR a single Excel letter ("A".."Z"); a `@field_validator` coerces letter -> int. Multi-letter columns ("AA") are out of scope for v1. Includes the seed `test_cases/repeating_sections_per_tab/` fixture that uses the new grammar. Schema validates cleanly; full extraction fails until the extractor commit lands (entity has no tab/tab_pattern because the block owns tab scope — that's the correct failure mode prior to wiring). 103 existing tests pass; the 4 failures are the new seed case waiting on the extractor.

… scoping Wires the `blocks:` grammar through the extractor. With this commit, the seed `test_cases/repeating_sections_per_tab/` case extracts cleanly: 7 delivery rows across 2 daily sections, each carrying the section's `delivery_date` and `day_of_week` captured from the section header row. Mechanics: 1. `_extract_entity` dispatches to a new `_extract_for_block` path when `entity.block` is set. The block (looked up by name from `Template.blocks`) owns tab targeting via its own `tab_pattern`; `find_tabs` is invoked with a stand-in Locate carrying the block's tab scope. 2. `_find_block_instances` makes a single linear pass through the tab, collecting all `starts_at` and `ends_at` matches in their respective columns. Instances are paired per strategy: - `last_in_block` (default, greedy): pick the LAST ends_at hit in (starts_at, next_starts_at] or (starts_at, EOF]. - `first_in_block`: pick the FIRST. When `ends_at` is omitted, the instance extends to next start - 1 or EOF. When `starts_at` matches nothing, emits `block_starts_not_found`. When `ends_at` is configured but no candidate fires in the window, emits `block_unterminated`. 3. `_resolve_captures` scans each capture's `from.column` inside the instance's row range. `on_multiple` controls picking (first/last/ error). Zero matches + `required: true` => `capture_no_match`. Captured groups are coerced via `_coerce_capture` (returns ISO strings for dates to match the rest of the corpus's JSON shape). 4. `_extract_flat` gains three optional kwargs: - `cell_range_override`: synthesized per block instance so the existing CellRange row-window machinery scopes extraction. - `separator_rows`: applied AFTER header detection (filtering pre-header would shift indices out from under header_idx). - `extra_fields`: the propagating captures merged onto each emitted row. 5. `_locate.find_header_row` and `resolve_header_row` accept optional `min_row` / `max_row` so an entity's `header_anchor` scan inside a block instance is restricted to that instance's range. An outer occurrence of the anchor text can't leak in. Captures whose `propagate: false` are still scanned (and still surface `capture_no_match` / `capture_multiple_matches` if applicable) but are not merged onto rows. Fixture changes: - `expected.json` switched to plural `deliveries:` (matches every other corpus case's pluralization convention; `_pluralize` produces `deliveries` for `cardinality: many`). - `input.xlsx`: harmonized the two header rows so both blocks use `DOCK` (one was `DOCK #`). The header-variant case will land with a future `FieldSpec.alias:` feature outside the blocks grammar. 107 tests pass on both calamine and openpyxl backends.

…odes Adds the five extract-time error codes the `blocks:` grammar needs and five programmatic negative corpus cases that exercise them end-to-end, plus eight template-load unit tests for the cross-tree validator. New STRUCTURAL error codes (all in `_errors.py` STRUCTURAL_TYPES; each also has a human-readable msg in `validator._structural_msg`): - block_starts_not_found starts_at never fires in a matching tab - block_unterminated ends_at configured but no candidate found before next starts_at or EOF - capture_no_match required capture with zero matches in the block instance - capture_multiple_matches capture has >1 hit and on_multiple=error - block_ref_not_found entity.block names an undeclared block (template-load time only) `_coerce_capture` no longer swallows coercion failures — it raises `CoercionError`, which `_resolve_captures` catches and re-emits as a row-level `wrong_type` keyed to the capture so the validator picks it up alongside other field-level coercion failures. New corpus cases (programmatic, regenerable via `python -m test_cases.generate`), all using synthetic Acme order data: blocks_starts_not_found blocks_unterminated blocks_capture_no_match_required blocks_capture_multiple_matches_error blocks_capture_wrong_type The first four set verdict=reject (structural); the wrong_type case is needs_review (rows still extract, the bad capture is flagged). Helper notes for fixture authors: `_append_string_row` forces `data_type = 's'` on cells whose string value starts with `=`. Without this, openpyxl's `ws.append` writes the value as a formula, which calamine then evaluates to an empty string when round-tripped through `to_python(skip_empty_area=False)`. That broke the end-anchor detection for cells like `====`. Template-load unit tests in tests/test_blocks_template_validation.py cover the five cross-tree validator paths plus column-letter coercion and SkipRowRule's exactly-one-match-mode invariant. 140 tests pass on both backends.

Documents the `blocks:` v2 grammar end-to-end: - docs/guides/blocks.md new page; the YAML shape, every Block field, greedy-ends_at rationale, column int-vs-letter, capture knobs, flat-output guarantee, both the template-load and extract-time error tables, the three matchers reference, and explicit v1 non-features (no nesting, no above-anchor captures, no merged-cell expansion, no multi-column anchor). - docs/guides/templates.md adds a Versioning section explaining `version: 1` vs `version: 2` and the loader-rejection behavior. - docs/guides/streaming.md adds a section on per-block-instance buffering and the latency-to-first-row contract. - docs/reference/errors.md adds a table for the five new `block_*` / `capture_*` codes plus a note on capture coercion failures re-using `wrong_type`. - README.md adds a "Repeating sections within one tab" section between "Multi-entity files" and "Scattered metadata", with the same Acme orders example used in the guide. - mkdocs.yml adds guides/blocks.md to the Guides nav, between layouts and streaming. `mkdocs build --strict` passes. Full pytest (140 tests) still green.

Pre-commit fixers fired on CI: - `ruff-format` reformatted `src/crease/extractor.py`, `src/crease/template_model.py`, and `test_cases/cases.py`. - `end-of-file-fixer` added trailing newlines to the new blocks-grammar corpus JSON files (`expected.json`, `expected_issues.json`). Also fixes the source of the EoF drift: `test_cases/types.py` now appends `\n` when writing the two JSON fixture files, so future `python -m test_cases.generate` runs stay idempotent against the end-of-file-fixer hook.

* feat(template): add Block, Capture, CellAnchor schema with Entity.block ref Adds the template-model layer for the upcoming `blocks:` grammar. Extraction is wired in a follow-up; this commit only teaches the schema what a v2 template looks like. Grammar shape: blocks: - name: daily_section tab_pattern: ... starts_at: { column: D, cell_pattern: ^DELIVERY SCHEDULE$ } ends_at: { column: A, cell_pattern: ^={3,}$, strategy: last_in_block } separator_rows: - { column: A, cell_pattern: ^={3,}$ } - { column: A, match_blank: true } captures: - field: delivery_date from: { column: D, cell_pattern: ^(MON|...) (.+)$, regex_group: 2 } type: date entities: - name: delivery block: daily_section # entity scoped to each block instance; # captures merge onto every emitted row ... New Pydantic models (all `extra="forbid"`): - CellAnchor column + cell_pattern + regex_group + on_multiple - EndAnchor CellAnchor + strategy (first_in_block | last_in_block) - SkipRowRule column + (cell_pattern XOR match_blank: true) - Capture field + from (CellAnchor) + type + required + propagate - Block name + tab_pattern + starts_at + ends_at + ... `Entity` gains an optional `block: str` field. When set, the entity is extraction-scoped to each instance of the named block and inherits its captures. `Template.version` is now `Literal[1, 2]` (was `int = 1`). A v2-grammar template (`blocks:` non-empty or any `Entity.block` set) requires `version: 2`. Cross-tree validation at `Template.model_validate`: - blocks-requires-v2 `blocks:` without `version: 2` - duplicate block name two blocks share a `name` - block_ref_not_found `Entity.block` references unknown block - entity_tab_with_block entity sets `locate.tab`/`tab_pattern` while also setting `block:` (the block owns tab scope) - field_shadow_collision capture on block B and FieldSpec on an entity that targets B share a name `CellAnchor.column` accepts an int (0-indexed) OR a single Excel letter ("A".."Z"); a `@field_validator` coerces letter -> int. Multi-letter columns ("AA") are out of scope for v1. Includes the seed `test_cases/repeating_sections_per_tab/` fixture that uses the new grammar. Schema validates cleanly; full extraction fails until the extractor commit lands (entity has no tab/tab_pattern because the block owns tab scope — that's the correct failure mode prior to wiring). 103 existing tests pass; the 4 failures are the new seed case waiting on the extractor. * feat(extractor): implement block scanning, capture resolution, entity scoping Wires the `blocks:` grammar through the extractor. With this commit, the seed `test_cases/repeating_sections_per_tab/` case extracts cleanly: 7 delivery rows across 2 daily sections, each carrying the section's `delivery_date` and `day_of_week` captured from the section header row. Mechanics: 1. `_extract_entity` dispatches to a new `_extract_for_block` path when `entity.block` is set. The block (looked up by name from `Template.blocks`) owns tab targeting via its own `tab_pattern`; `find_tabs` is invoked with a stand-in Locate carrying the block's tab scope. 2. `_find_block_instances` makes a single linear pass through the tab, collecting all `starts_at` and `ends_at` matches in their respective columns. Instances are paired per strategy: - `last_in_block` (default, greedy): pick the LAST ends_at hit in (starts_at, next_starts_at] or (starts_at, EOF]. - `first_in_block`: pick the FIRST. When `ends_at` is omitted, the instance extends to next start - 1 or EOF. When `starts_at` matches nothing, emits `block_starts_not_found`. When `ends_at` is configured but no candidate fires in the window, emits `block_unterminated`. 3. `_resolve_captures` scans each capture's `from.column` inside the instance's row range. `on_multiple` controls picking (first/last/ error). Zero matches + `required: true` => `capture_no_match`. Captured groups are coerced via `_coerce_capture` (returns ISO strings for dates to match the rest of the corpus's JSON shape). 4. `_extract_flat` gains three optional kwargs: - `cell_range_override`: synthesized per block instance so the existing CellRange row-window machinery scopes extraction. - `separator_rows`: applied AFTER header detection (filtering pre-header would shift indices out from under header_idx). - `extra_fields`: the propagating captures merged onto each emitted row. 5. `_locate.find_header_row` and `resolve_header_row` accept optional `min_row` / `max_row` so an entity's `header_anchor` scan inside a block instance is restricted to that instance's range. An outer occurrence of the anchor text can't leak in. Captures whose `propagate: false` are still scanned (and still surface `capture_no_match` / `capture_multiple_matches` if applicable) but are not merged onto rows. Fixture changes: - `expected.json` switched to plural `deliveries:` (matches every other corpus case's pluralization convention; `_pluralize` produces `deliveries` for `cardinality: many`). - `input.xlsx`: harmonized the two header rows so both blocks use `DOCK` (one was `DOCK #`). The header-variant case will land with a future `FieldSpec.alias:` feature outside the blocks grammar. 107 tests pass on both calamine and openpyxl backends. * feat(extractor): emit structural errors for block / capture failure modes Adds the five extract-time error codes the `blocks:` grammar needs and five programmatic negative corpus cases that exercise them end-to-end, plus eight template-load unit tests for the cross-tree validator. New STRUCTURAL error codes (all in `_errors.py` STRUCTURAL_TYPES; each also has a human-readable msg in `validator._structural_msg`): - block_starts_not_found starts_at never fires in a matching tab - block_unterminated ends_at configured but no candidate found before next starts_at or EOF - capture_no_match required capture with zero matches in the block instance - capture_multiple_matches capture has >1 hit and on_multiple=error - block_ref_not_found entity.block names an undeclared block (template-load time only) `_coerce_capture` no longer swallows coercion failures — it raises `CoercionError`, which `_resolve_captures` catches and re-emits as a row-level `wrong_type` keyed to the capture so the validator picks it up alongside other field-level coercion failures. New corpus cases (programmatic, regenerable via `python -m test_cases.generate`), all using synthetic Acme order data: blocks_starts_not_found blocks_unterminated blocks_capture_no_match_required blocks_capture_multiple_matches_error blocks_capture_wrong_type The first four set verdict=reject (structural); the wrong_type case is needs_review (rows still extract, the bad capture is flagged). Helper notes for fixture authors: `_append_string_row` forces `data_type = 's'` on cells whose string value starts with `=`. Without this, openpyxl's `ws.append` writes the value as a formula, which calamine then evaluates to an empty string when round-tripped through `to_python(skip_empty_area=False)`. That broke the end-anchor detection for cells like `====`. Template-load unit tests in tests/test_blocks_template_validation.py cover the five cross-tree validator paths plus column-letter coercion and SkipRowRule's exactly-one-match-mode invariant. 140 tests pass on both backends. * docs(blocks): grammar guide, errors taxonomy, README bullet, nav entry Documents the `blocks:` v2 grammar end-to-end: - docs/guides/blocks.md new page; the YAML shape, every Block field, greedy-ends_at rationale, column int-vs-letter, capture knobs, flat-output guarantee, both the template-load and extract-time error tables, the three matchers reference, and explicit v1 non-features (no nesting, no above-anchor captures, no merged-cell expansion, no multi-column anchor). - docs/guides/templates.md adds a Versioning section explaining `version: 1` vs `version: 2` and the loader-rejection behavior. - docs/guides/streaming.md adds a section on per-block-instance buffering and the latency-to-first-row contract. - docs/reference/errors.md adds a table for the five new `block_*` / `capture_*` codes plus a note on capture coercion failures re-using `wrong_type`. - README.md adds a "Repeating sections within one tab" section between "Multi-entity files" and "Scattered metadata", with the same Acme orders example used in the guide. - mkdocs.yml adds guides/blocks.md to the Guides nav, between layouts and streaming. `mkdocs build --strict` passes. Full pytest (140 tests) still green. * style: apply ruff-format + ensure JSON fixtures end with newline Pre-commit fixers fired on CI: - `ruff-format` reformatted `src/crease/extractor.py`, `src/crease/template_model.py`, and `test_cases/cases.py`. - `end-of-file-fixer` added trailing newlines to the new blocks-grammar corpus JSON files (`expected.json`, `expected_issues.json`). Also fixes the source of the EoF drift: `test_cases/types.py` now appends `\n` when writing the two JSON fixture files, so future `python -m test_cases.generate` runs stay idempotent against the end-of-file-fixer hook.

dev360 added 5 commits May 21, 2026 16:23

dev360 merged commit 7126a3a into main May 22, 2026
8 checks passed

dev360 deleted the feat/blocks-grammar branch May 22, 2026 13:34

dev360 mentioned this pull request May 22, 2026

feat(template): add skip_row_if predicate for row filtering (P0-2) #26

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(template): add `blocks:` v2 grammar for repeating sections#37

feat(template): add `blocks:` v2 grammar for repeating sections#37
dev360 merged 5 commits into
mainfrom
feat/blocks-grammar

dev360 commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dev360 commented May 22, 2026

Summary

What's intentionally not in v1

Commit slice

New error codes (public contract — see docs/reference/errors.md)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New error codes (public contract — see `docs/reference/errors.md`)