Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/guides/templates.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,34 @@ by OR across rules. `value_pattern` is a regex full-matched against the
stringified cell value; combine it with `column:` to pin a single
column.

## Disambiguating anchored labels

When a worksheet stacks two cover-sheet-style blocks side by side and
both carry the same labels (a "REPORTING" block in column A and a
"BILLING" block in column D, each with its own `Company:` /
`Email:` rows), an `anchor` whose `label_match: "Company:"` would
default to the first hit. Two optional fields scope the search:

- `column: int` — restrict the scan to a single 0-indexed column.
- `nth: int` — pick the Nth match (1-indexed; default 1).

```yaml
fields:
- name: reporting_company
type: string
anchor: { label_match: "Company:", column: 0, value_at: right, offset: 1 }
- name: billing_company
type: string
anchor: { label_match: "Company:", column: 3, value_at: right, offset: 1 }
- name: section_two_carrier
type: string
anchor:
label_match: "SHIPPING INFORMATION"
nth: 2 # the second occurrence of the label
value_at: right
offset: 2
```

## Templates that pin the read backend

Crease reads spreadsheets through two interchangeable backends — calamine
Expand Down
18 changes: 14 additions & 4 deletions src/crease/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -866,16 +866,26 @@ def _extract_anchored(
def _find_anchor(grid: list[list[Any]], anchor) -> tuple[int, int] | None:
target = anchor.label_match
mode = anchor.match_mode
pinned_col = anchor.column
nth = max(1, anchor.nth)
seen = 0
for r, row in enumerate(grid):
for c, val in enumerate(row):
if pinned_col is not None and c != pinned_col:
continue
if val is None:
continue
s = str(val).strip()
if mode == "exact" and s == target:
return (r, c)
if mode == "contains" and target in s:
return (r, c)
if mode == "regex" and re.search(target, s):
pass
elif mode == "contains" and target in s:
pass
elif mode == "regex" and re.search(target, s):
pass
else:
continue
seen += 1
if seen == nth:
return (r, c)
return None

Expand Down
2 changes: 2 additions & 0 deletions src/crease/template_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ class Anchor(BaseModel):
match_mode: MatchMode = "contains"
value_at: Direction = "right"
offset: int = 1
column: int | None = None # restrict label search to a single column; None = any column
nth: int = 1 # 1-indexed match to return when the label appears more than once


class DataEnd(BaseModel):
Expand Down
11 changes: 2 additions & 9 deletions tests/test_field_scan_gaps.py
Original file line number Diff line number Diff line change
Expand Up @@ -356,11 +356,6 @@ def build(wb):
# ======================================================================


@pytest.mark.xfail(
strict=True,
reason="P1-1: Anchor.column not yet implemented; duplicated labels in side-by-side "
"blocks always match the first occurrence.",
)
def test_anchor_column_scopes_match_to_one_column(tmp_path):
"""Two side-by-side blocks (REPORTING in col A, BILLING in col D) carry
the same labels. ``anchor.column: 3`` should restrict the search to the
Expand All @@ -379,6 +374,7 @@ def build(wb):
"""
template_id: anchor_column_scope
version: 1
description: P1-1 fixture - anchor.column scopes label search
entities:
- name: cover
cardinality: one
Expand All @@ -400,10 +396,6 @@ def build(wb):
assert result.canonical["cover"]["billing_company"] == "Globex Corp"


@pytest.mark.xfail(
strict=True,
reason="P1-1: Anchor.nth not yet implemented; cannot pick the Nth occurrence " "of an ambiguous label.",
)
def test_anchor_nth_picks_second_match(tmp_path):
"""A label ``SHIPPING INFORMATION`` appears twice on the sheet (a header
label at row 0 and a sub-section label at row 4). ``nth: 2`` should pick
Expand All @@ -424,6 +416,7 @@ def build(wb):
"""
template_id: anchor_nth
version: 1
description: P1-1 fixture - anchor.nth picks second match
entities:
- name: cover
cardinality: one
Expand Down
Loading