feat(extract): warn on duplicate source_column; add source_column_index (P1-2) by dev360 · Pull Request #24 · dev360/crease

dev360 · 2026-05-21T21:16:08Z

Summary

Closes P1-2 — duplicate-header detection in _extract_flat.

When two header cells in the same row carry the same normalized text (e.g. DATE in two columns), a template field with source_column: "DATE" used to silently bind to the first occurrence with no signal that the choice was ambiguous.

This PR:

Builds a header → all-matching-columns map in _extract_flat (currently headers.index(wanted) returns only the first).
Emits a header_duplicated error on report.errors() whenever a field's source_column matches more than one header and the field doesn't pin a specific occurrence. Bind still picks the first column so extraction proceeds; the operator now sees the ambiguity instead of finding out downstream.
Adds an optional source_column_index: int (0-indexed across matches) on FieldSpec so two fields can map to two occurrences of the same header name.

Why this gap

This is the kind of silent-correctness bug crease's "fail loudly with coordinates" pitch is supposed to catch. The shape comes up in any report where an operator stacked two side-by-side sub-tables that share a header (an opening date and a closing date, an "in" weight and an "out" weight, etc.) — the canonical output looks fine, just wrong, and nothing in the report tells you so.

API additions

FieldSpec.source_column_index: int | None = None — disambiguates when source_column matches multiple header cells.
New error.type code header_duplicated (cell-severity warning). Added to the README error-code taxonomy and docs/guides/templates.md includes a worked example.

Test plan

uv run pytest tests/test_field_scan_gaps.py -q — P0-4 ×2 and P1-2 ×2 pass; 25 gaps remain xfail.
uv run pytest -q — 106 passed, 35 xfailed.
uvx --from 'ruff==0.6.9' ruff check . and ruff format --check . — clean.

All fixtures and copy use only fictitious values per CLAUDE.md.

🤖 Generated with Claude Code

When two header cells in the same row carry the same normalized text (e.g. ``DATE`` in columns A and F), a template field with ``source_column: "DATE"`` used to silently bind to the first column. Operators only noticed when a downstream consumer flagged the wrong date — by which point the canonical output was indistinguishable from a single-DATE workbook. ``_extract_flat`` now builds a header → all-columns map and: - Emits a structured ``header_duplicated`` warning whenever a field binds to an ambiguous header without an explicit disambiguator. The bind still picks the first occurrence so extraction proceeds, but the ambiguity surfaces in ``report.errors()`` instead of staying silent. - Honors a new optional ``source_column_index: int`` on ``FieldSpec`` (0-indexed across the matches) to bind a field to a specific occurrence — two fields can then map to two occurrences of the same header. Adds the ``header_duplicated`` error code to the README taxonomy and a new "Disambiguating duplicated headers" section in ``docs/guides/templates.md``. Graduates the two xfail tests in ``tests/test_field_scan_gaps.py::§P1-2`` to real acceptance checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dev360 force-pushed the feat/p1-2-duplicate-headers branch from dfa8494 to 701d918 Compare May 22, 2026 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): warn on duplicate source_column; add source_column_index (P1-2)#24

feat(extract): warn on duplicate source_column; add source_column_index (P1-2)#24
dev360 wants to merge 1 commit into
mainfrom
feat/p1-2-duplicate-headers

dev360 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dev360 commented May 21, 2026

Summary

Why this gap

API additions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant