Skip to content

feat(template): add blocks: v2 grammar for repeating sections#37

Merged
dev360 merged 5 commits into
mainfrom
feat/blocks-grammar
May 22, 2026
Merged

feat(template): add blocks: v2 grammar for repeating sections#37
dev360 merged 5 commits into
mainfrom
feat/blocks-grammar

Conversation

@dev360
Copy link
Copy Markdown
Owner

@dev360 dev360 commented May 22, 2026

Summary

Adds the blocks: v2 grammar so a template can describe a tab that
holds multiple repeating sub-sections delimited by anchor patterns,
with per-section metadata that gets merged onto every row inside the
section. The flat entities: grammar can't express that pattern today.

Mental model:

template_id: weekly_orders
version: 2

blocks:                                 # top-level region declarations
  - name: daily_section
    tab_pattern: ^W-\d+$
    starts_at: { column: D, cell_pattern: ^ORDER SCHEDULE$ }
    ends_at:   { column: A, cell_pattern: ^={3,}$ }
    captures:
      - field: order_date
        from: { column: D, cell_pattern: ^DAY (\d+-\d+-\d+)$, regex_group: 1 }
        type: date
        date_formats: ['%m-%d-%Y']

entities:
  - name: order
    block: daily_section                # ← scope this entity to each block instance
    cardinality: many
    locate:
      orientation: flat
      header_anchor: { text: ORDER_ID, match_mode: exact }
    fields: [...]

Output stays flat — order_date from each section's DAY-row is merged
onto every order row from that section. No nested-dict output mode.

What's intentionally not in v1

  • Nested blocks. Entities reference a block by name. There is no
    body.blocks, no block: [outer, inner]. Forward-compatible if a
    future layout needs it.
  • Captures above starts_at. A capture's from scan is bounded
    by the block instance.
  • Merged-cell expansion. Anchors must point at the merge's
    top-left cell.
  • Multi-column anchor. column: is a single int or single letter.

Each cut keeps the v1 grammar small and predictable; the doc page
spells them out so template authors know where the edges are.

Commit slice

  1. feat(template): add Block, Capture, CellAnchor schema with Entity.block ref
    — Pydantic models, Literal[1, 2] version, cross-tree validators.
  2. feat(extractor): implement block scanning, capture resolution, entity scoping
    — block-instance discovery, capture resolution, separator_rows
    pre-filter, header_anchor scoping per instance, captures merged
    onto flat rows.
  3. feat(extractor): emit structural errors for block / capture failure modes
    — five new block_* / capture_* error codes; capture coercion
    failures re-use wrong_type; five corpus negative cases +
    eight template-load unit tests.
  4. docs(blocks): grammar guide, errors taxonomy, README bullet, nav entry
    docs/guides/blocks.md (new page), docs/reference/errors.md,
    versioning section in templates.md, streaming note,
    README bullet, mkdocs.yml nav.

New error codes (public contract — see docs/reference/errors.md)

Code Severity When
block_starts_not_found structural tab matches tab_pattern but starts_at never fires
block_unterminated structural ends_at configured but no candidate before next start / EOF
capture_no_match structural required capture with zero matches in the instance
capture_multiple_matches structural capture with on_multiple: error matches more than once
block_ref_not_found structural template-load: entity targets an undeclared block

Test plan

  • uv run pytest -q140 tests pass (was 107)
  • Both calamine and openpyxl backends covered for every new corpus case
  • uv run mkdocs build --strict clean
  • uv run ruff check . clean
  • Corpus fixtures regenerate deterministically (python -m test_cases.generate)
  • Synthetic fixtures only (Acme/Globex/Hooli vocabulary, ORD-####, example.com)

🤖 Generated with Claude Code

dev360 added 5 commits May 21, 2026 16:23
…ck ref

Adds the template-model layer for the upcoming `blocks:` grammar.
Extraction is wired in a follow-up; this commit only teaches the
schema what a v2 template looks like.

Grammar shape:

  blocks:
    - name: daily_section
      tab_pattern: ...
      starts_at: { column: D, cell_pattern: ^DELIVERY SCHEDULE$ }
      ends_at:   { column: A, cell_pattern: ^={3,}$,
                   strategy: last_in_block }
      separator_rows:
        - { column: A, cell_pattern: ^={3,}$ }
        - { column: A, match_blank: true }
      captures:
        - field: delivery_date
          from: { column: D, cell_pattern: ^(MON|...) (.+)$, regex_group: 2 }
          type: date

  entities:
    - name: delivery
      block: daily_section    # entity scoped to each block instance;
                              # captures merge onto every emitted row
      ...

New Pydantic models (all `extra="forbid"`):

  - CellAnchor       column + cell_pattern + regex_group + on_multiple
  - EndAnchor        CellAnchor + strategy (first_in_block | last_in_block)
  - SkipRowRule      column + (cell_pattern XOR match_blank: true)
  - Capture          field + from (CellAnchor) + type + required + propagate
  - Block            name + tab_pattern + starts_at + ends_at + ...

`Entity` gains an optional `block: str` field. When set, the entity is
extraction-scoped to each instance of the named block and inherits its
captures.

`Template.version` is now `Literal[1, 2]` (was `int = 1`). A v2-grammar
template (`blocks:` non-empty or any `Entity.block` set) requires
`version: 2`.

Cross-tree validation at `Template.model_validate`:

  - blocks-requires-v2     `blocks:` without `version: 2`
  - duplicate block name   two blocks share a `name`
  - block_ref_not_found    `Entity.block` references unknown block
  - entity_tab_with_block  entity sets `locate.tab`/`tab_pattern` while
                           also setting `block:` (the block owns tab scope)
  - field_shadow_collision capture on block B and FieldSpec on an entity
                           that targets B share a name

`CellAnchor.column` accepts an int (0-indexed) OR a single Excel letter
("A".."Z"); a `@field_validator` coerces letter -> int. Multi-letter
columns ("AA") are out of scope for v1.

Includes the seed `test_cases/repeating_sections_per_tab/` fixture
that uses the new grammar. Schema validates cleanly; full extraction
fails until the extractor commit lands (entity has no tab/tab_pattern
because the block owns tab scope — that's the correct failure mode
prior to wiring).

103 existing tests pass; the 4 failures are the new seed case waiting
on the extractor.
… scoping

Wires the `blocks:` grammar through the extractor. With this commit, the
seed `test_cases/repeating_sections_per_tab/` case extracts cleanly:
7 delivery rows across 2 daily sections, each carrying the section's
`delivery_date` and `day_of_week` captured from the section header row.

Mechanics:

  1. `_extract_entity` dispatches to a new `_extract_for_block` path when
     `entity.block` is set. The block (looked up by name from
     `Template.blocks`) owns tab targeting via its own `tab_pattern`;
     `find_tabs` is invoked with a stand-in Locate carrying the block's
     tab scope.

  2. `_find_block_instances` makes a single linear pass through the
     tab, collecting all `starts_at` and `ends_at` matches in their
     respective columns. Instances are paired per strategy:
       - `last_in_block` (default, greedy): pick the LAST ends_at hit
         in (starts_at, next_starts_at] or (starts_at, EOF].
       - `first_in_block`: pick the FIRST.
     When `ends_at` is omitted, the instance extends to next start - 1
     or EOF. When `starts_at` matches nothing, emits
     `block_starts_not_found`. When `ends_at` is configured but no
     candidate fires in the window, emits `block_unterminated`.

  3. `_resolve_captures` scans each capture's `from.column` inside the
     instance's row range. `on_multiple` controls picking (first/last/
     error). Zero matches + `required: true` => `capture_no_match`.
     Captured groups are coerced via `_coerce_capture` (returns ISO
     strings for dates to match the rest of the corpus's JSON shape).

  4. `_extract_flat` gains three optional kwargs:
       - `cell_range_override`: synthesized per block instance so the
         existing CellRange row-window machinery scopes extraction.
       - `separator_rows`: applied AFTER header detection (filtering
         pre-header would shift indices out from under header_idx).
       - `extra_fields`: the propagating captures merged onto each
         emitted row.

  5. `_locate.find_header_row` and `resolve_header_row` accept optional
     `min_row` / `max_row` so an entity's `header_anchor` scan inside a
     block instance is restricted to that instance's range. An outer
     occurrence of the anchor text can't leak in.

Captures whose `propagate: false` are still scanned (and still surface
`capture_no_match` / `capture_multiple_matches` if applicable) but are
not merged onto rows.

Fixture changes:
  - `expected.json` switched to plural `deliveries:` (matches every
    other corpus case's pluralization convention; `_pluralize`
    produces `deliveries` for `cardinality: many`).
  - `input.xlsx`: harmonized the two header rows so both blocks use
    `DOCK` (one was `DOCK #`). The header-variant case will land with
    a future `FieldSpec.alias:` feature outside the blocks grammar.

107 tests pass on both calamine and openpyxl backends.
…odes

Adds the five extract-time error codes the `blocks:` grammar needs and
five programmatic negative corpus cases that exercise them end-to-end,
plus eight template-load unit tests for the cross-tree validator.

New STRUCTURAL error codes (all in `_errors.py` STRUCTURAL_TYPES; each
also has a human-readable msg in `validator._structural_msg`):

  - block_starts_not_found     starts_at never fires in a matching tab
  - block_unterminated         ends_at configured but no candidate found
                               before next starts_at or EOF
  - capture_no_match           required capture with zero matches in
                               the block instance
  - capture_multiple_matches   capture has >1 hit and on_multiple=error
  - block_ref_not_found        entity.block names an undeclared block
                               (template-load time only)

`_coerce_capture` no longer swallows coercion failures — it raises
`CoercionError`, which `_resolve_captures` catches and re-emits as a
row-level `wrong_type` keyed to the capture so the validator picks it
up alongside other field-level coercion failures.

New corpus cases (programmatic, regenerable via
`python -m test_cases.generate`), all using synthetic Acme order data:

  blocks_starts_not_found
  blocks_unterminated
  blocks_capture_no_match_required
  blocks_capture_multiple_matches_error
  blocks_capture_wrong_type

The first four set verdict=reject (structural); the wrong_type case is
needs_review (rows still extract, the bad capture is flagged).

Helper notes for fixture authors: `_append_string_row` forces
`data_type = 's'` on cells whose string value starts with `=`.
Without this, openpyxl's `ws.append` writes the value as a formula,
which calamine then evaluates to an empty string when round-tripped
through `to_python(skip_empty_area=False)`. That broke the end-anchor
detection for cells like `====`.

Template-load unit tests in tests/test_blocks_template_validation.py
cover the five cross-tree validator paths plus column-letter coercion
and SkipRowRule's exactly-one-match-mode invariant.

140 tests pass on both backends.
Documents the `blocks:` v2 grammar end-to-end:

  - docs/guides/blocks.md         new page; the YAML shape, every Block
                                  field, greedy-ends_at rationale,
                                  column int-vs-letter, capture knobs,
                                  flat-output guarantee, both the
                                  template-load and extract-time error
                                  tables, the three matchers reference,
                                  and explicit v1 non-features (no
                                  nesting, no above-anchor captures,
                                  no merged-cell expansion, no
                                  multi-column anchor).
  - docs/guides/templates.md      adds a Versioning section explaining
                                  `version: 1` vs `version: 2` and the
                                  loader-rejection behavior.
  - docs/guides/streaming.md      adds a section on per-block-instance
                                  buffering and the latency-to-first-row
                                  contract.
  - docs/reference/errors.md      adds a table for the five new
                                  `block_*` / `capture_*` codes plus a
                                  note on capture coercion failures
                                  re-using `wrong_type`.
  - README.md                     adds a "Repeating sections within
                                  one tab" section between
                                  "Multi-entity files" and "Scattered
                                  metadata", with the same Acme orders
                                  example used in the guide.
  - mkdocs.yml                    adds guides/blocks.md to the Guides
                                  nav, between layouts and streaming.

`mkdocs build --strict` passes. Full pytest (140 tests) still green.
Pre-commit fixers fired on CI:
  - `ruff-format` reformatted `src/crease/extractor.py`,
    `src/crease/template_model.py`, and `test_cases/cases.py`.
  - `end-of-file-fixer` added trailing newlines to the new blocks-grammar
    corpus JSON files (`expected.json`, `expected_issues.json`).

Also fixes the source of the EoF drift: `test_cases/types.py` now appends
`\n` when writing the two JSON fixture files, so future
`python -m test_cases.generate` runs stay idempotent against the
end-of-file-fixer hook.
@dev360 dev360 merged commit 7126a3a into main May 22, 2026
8 checks passed
@dev360 dev360 deleted the feat/blocks-grammar branch May 22, 2026 13:34
dev360 added a commit that referenced this pull request May 22, 2026
* feat(template): add Block, Capture, CellAnchor schema with Entity.block ref

Adds the template-model layer for the upcoming `blocks:` grammar.
Extraction is wired in a follow-up; this commit only teaches the
schema what a v2 template looks like.

Grammar shape:

  blocks:
    - name: daily_section
      tab_pattern: ...
      starts_at: { column: D, cell_pattern: ^DELIVERY SCHEDULE$ }
      ends_at:   { column: A, cell_pattern: ^={3,}$,
                   strategy: last_in_block }
      separator_rows:
        - { column: A, cell_pattern: ^={3,}$ }
        - { column: A, match_blank: true }
      captures:
        - field: delivery_date
          from: { column: D, cell_pattern: ^(MON|...) (.+)$, regex_group: 2 }
          type: date

  entities:
    - name: delivery
      block: daily_section    # entity scoped to each block instance;
                              # captures merge onto every emitted row
      ...

New Pydantic models (all `extra="forbid"`):

  - CellAnchor       column + cell_pattern + regex_group + on_multiple
  - EndAnchor        CellAnchor + strategy (first_in_block | last_in_block)
  - SkipRowRule      column + (cell_pattern XOR match_blank: true)
  - Capture          field + from (CellAnchor) + type + required + propagate
  - Block            name + tab_pattern + starts_at + ends_at + ...

`Entity` gains an optional `block: str` field. When set, the entity is
extraction-scoped to each instance of the named block and inherits its
captures.

`Template.version` is now `Literal[1, 2]` (was `int = 1`). A v2-grammar
template (`blocks:` non-empty or any `Entity.block` set) requires
`version: 2`.

Cross-tree validation at `Template.model_validate`:

  - blocks-requires-v2     `blocks:` without `version: 2`
  - duplicate block name   two blocks share a `name`
  - block_ref_not_found    `Entity.block` references unknown block
  - entity_tab_with_block  entity sets `locate.tab`/`tab_pattern` while
                           also setting `block:` (the block owns tab scope)
  - field_shadow_collision capture on block B and FieldSpec on an entity
                           that targets B share a name

`CellAnchor.column` accepts an int (0-indexed) OR a single Excel letter
("A".."Z"); a `@field_validator` coerces letter -> int. Multi-letter
columns ("AA") are out of scope for v1.

Includes the seed `test_cases/repeating_sections_per_tab/` fixture
that uses the new grammar. Schema validates cleanly; full extraction
fails until the extractor commit lands (entity has no tab/tab_pattern
because the block owns tab scope — that's the correct failure mode
prior to wiring).

103 existing tests pass; the 4 failures are the new seed case waiting
on the extractor.

* feat(extractor): implement block scanning, capture resolution, entity scoping

Wires the `blocks:` grammar through the extractor. With this commit, the
seed `test_cases/repeating_sections_per_tab/` case extracts cleanly:
7 delivery rows across 2 daily sections, each carrying the section's
`delivery_date` and `day_of_week` captured from the section header row.

Mechanics:

  1. `_extract_entity` dispatches to a new `_extract_for_block` path when
     `entity.block` is set. The block (looked up by name from
     `Template.blocks`) owns tab targeting via its own `tab_pattern`;
     `find_tabs` is invoked with a stand-in Locate carrying the block's
     tab scope.

  2. `_find_block_instances` makes a single linear pass through the
     tab, collecting all `starts_at` and `ends_at` matches in their
     respective columns. Instances are paired per strategy:
       - `last_in_block` (default, greedy): pick the LAST ends_at hit
         in (starts_at, next_starts_at] or (starts_at, EOF].
       - `first_in_block`: pick the FIRST.
     When `ends_at` is omitted, the instance extends to next start - 1
     or EOF. When `starts_at` matches nothing, emits
     `block_starts_not_found`. When `ends_at` is configured but no
     candidate fires in the window, emits `block_unterminated`.

  3. `_resolve_captures` scans each capture's `from.column` inside the
     instance's row range. `on_multiple` controls picking (first/last/
     error). Zero matches + `required: true` => `capture_no_match`.
     Captured groups are coerced via `_coerce_capture` (returns ISO
     strings for dates to match the rest of the corpus's JSON shape).

  4. `_extract_flat` gains three optional kwargs:
       - `cell_range_override`: synthesized per block instance so the
         existing CellRange row-window machinery scopes extraction.
       - `separator_rows`: applied AFTER header detection (filtering
         pre-header would shift indices out from under header_idx).
       - `extra_fields`: the propagating captures merged onto each
         emitted row.

  5. `_locate.find_header_row` and `resolve_header_row` accept optional
     `min_row` / `max_row` so an entity's `header_anchor` scan inside a
     block instance is restricted to that instance's range. An outer
     occurrence of the anchor text can't leak in.

Captures whose `propagate: false` are still scanned (and still surface
`capture_no_match` / `capture_multiple_matches` if applicable) but are
not merged onto rows.

Fixture changes:
  - `expected.json` switched to plural `deliveries:` (matches every
    other corpus case's pluralization convention; `_pluralize`
    produces `deliveries` for `cardinality: many`).
  - `input.xlsx`: harmonized the two header rows so both blocks use
    `DOCK` (one was `DOCK #`). The header-variant case will land with
    a future `FieldSpec.alias:` feature outside the blocks grammar.

107 tests pass on both calamine and openpyxl backends.

* feat(extractor): emit structural errors for block / capture failure modes

Adds the five extract-time error codes the `blocks:` grammar needs and
five programmatic negative corpus cases that exercise them end-to-end,
plus eight template-load unit tests for the cross-tree validator.

New STRUCTURAL error codes (all in `_errors.py` STRUCTURAL_TYPES; each
also has a human-readable msg in `validator._structural_msg`):

  - block_starts_not_found     starts_at never fires in a matching tab
  - block_unterminated         ends_at configured but no candidate found
                               before next starts_at or EOF
  - capture_no_match           required capture with zero matches in
                               the block instance
  - capture_multiple_matches   capture has >1 hit and on_multiple=error
  - block_ref_not_found        entity.block names an undeclared block
                               (template-load time only)

`_coerce_capture` no longer swallows coercion failures — it raises
`CoercionError`, which `_resolve_captures` catches and re-emits as a
row-level `wrong_type` keyed to the capture so the validator picks it
up alongside other field-level coercion failures.

New corpus cases (programmatic, regenerable via
`python -m test_cases.generate`), all using synthetic Acme order data:

  blocks_starts_not_found
  blocks_unterminated
  blocks_capture_no_match_required
  blocks_capture_multiple_matches_error
  blocks_capture_wrong_type

The first four set verdict=reject (structural); the wrong_type case is
needs_review (rows still extract, the bad capture is flagged).

Helper notes for fixture authors: `_append_string_row` forces
`data_type = 's'` on cells whose string value starts with `=`.
Without this, openpyxl's `ws.append` writes the value as a formula,
which calamine then evaluates to an empty string when round-tripped
through `to_python(skip_empty_area=False)`. That broke the end-anchor
detection for cells like `====`.

Template-load unit tests in tests/test_blocks_template_validation.py
cover the five cross-tree validator paths plus column-letter coercion
and SkipRowRule's exactly-one-match-mode invariant.

140 tests pass on both backends.

* docs(blocks): grammar guide, errors taxonomy, README bullet, nav entry

Documents the `blocks:` v2 grammar end-to-end:

  - docs/guides/blocks.md         new page; the YAML shape, every Block
                                  field, greedy-ends_at rationale,
                                  column int-vs-letter, capture knobs,
                                  flat-output guarantee, both the
                                  template-load and extract-time error
                                  tables, the three matchers reference,
                                  and explicit v1 non-features (no
                                  nesting, no above-anchor captures,
                                  no merged-cell expansion, no
                                  multi-column anchor).
  - docs/guides/templates.md      adds a Versioning section explaining
                                  `version: 1` vs `version: 2` and the
                                  loader-rejection behavior.
  - docs/guides/streaming.md      adds a section on per-block-instance
                                  buffering and the latency-to-first-row
                                  contract.
  - docs/reference/errors.md      adds a table for the five new
                                  `block_*` / `capture_*` codes plus a
                                  note on capture coercion failures
                                  re-using `wrong_type`.
  - README.md                     adds a "Repeating sections within
                                  one tab" section between
                                  "Multi-entity files" and "Scattered
                                  metadata", with the same Acme orders
                                  example used in the guide.
  - mkdocs.yml                    adds guides/blocks.md to the Guides
                                  nav, between layouts and streaming.

`mkdocs build --strict` passes. Full pytest (140 tests) still green.

* style: apply ruff-format + ensure JSON fixtures end with newline

Pre-commit fixers fired on CI:
  - `ruff-format` reformatted `src/crease/extractor.py`,
    `src/crease/template_model.py`, and `test_cases/cases.py`.
  - `end-of-file-fixer` added trailing newlines to the new blocks-grammar
    corpus JSON files (`expected.json`, `expected_issues.json`).

Also fixes the source of the EoF drift: `test_cases/types.py` now appends
`\n` when writing the two JSON fixture files, so future
`python -m test_cases.generate` runs stay idempotent against the
end-of-file-fixer hook.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant