Skip to content

feat(extract): warn on duplicate source_column; add source_column_index (P1-2)#24

Open
dev360 wants to merge 1 commit into
mainfrom
feat/p1-2-duplicate-headers
Open

feat(extract): warn on duplicate source_column; add source_column_index (P1-2)#24
dev360 wants to merge 1 commit into
mainfrom
feat/p1-2-duplicate-headers

Conversation

@dev360
Copy link
Copy Markdown
Owner

@dev360 dev360 commented May 21, 2026

Summary

Closes P1-2 — duplicate-header detection in _extract_flat.

When two header cells in the same row carry the same normalized text (e.g. DATE in two columns), a template field with source_column: "DATE" used to silently bind to the first occurrence with no signal that the choice was ambiguous.

This PR:

  • Builds a header → all-matching-columns map in _extract_flat (currently headers.index(wanted) returns only the first).
  • Emits a header_duplicated error on report.errors() whenever a field's source_column matches more than one header and the field doesn't pin a specific occurrence. Bind still picks the first column so extraction proceeds; the operator now sees the ambiguity instead of finding out downstream.
  • Adds an optional source_column_index: int (0-indexed across matches) on FieldSpec so two fields can map to two occurrences of the same header name.

Why this gap

This is the kind of silent-correctness bug crease's "fail loudly with coordinates" pitch is supposed to catch. The shape comes up in any report where an operator stacked two side-by-side sub-tables that share a header (an opening date and a closing date, an "in" weight and an "out" weight, etc.) — the canonical output looks fine, just wrong, and nothing in the report tells you so.

API additions

  • FieldSpec.source_column_index: int | None = None — disambiguates when source_column matches multiple header cells.
  • New error.type code header_duplicated (cell-severity warning). Added to the README error-code taxonomy and docs/guides/templates.md includes a worked example.

Test plan

  • uv run pytest tests/test_field_scan_gaps.py -q — P0-4 ×2 and P1-2 ×2 pass; 25 gaps remain xfail.
  • uv run pytest -q — 106 passed, 35 xfailed.
  • uvx --from 'ruff==0.6.9' ruff check . and ruff format --check . — clean.

All fixtures and copy use only fictitious values per CLAUDE.md.

🤖 Generated with Claude Code

When two header cells in the same row carry the same normalized text
(e.g. ``DATE`` in columns A and F), a template field with
``source_column: "DATE"`` used to silently bind to the first column.
Operators only noticed when a downstream consumer flagged the wrong
date — by which point the canonical output was indistinguishable from
a single-DATE workbook.

``_extract_flat`` now builds a header → all-columns map and:

- Emits a structured ``header_duplicated`` warning whenever a field
  binds to an ambiguous header without an explicit disambiguator. The
  bind still picks the first occurrence so extraction proceeds, but the
  ambiguity surfaces in ``report.errors()`` instead of staying silent.
- Honors a new optional ``source_column_index: int`` on ``FieldSpec``
  (0-indexed across the matches) to bind a field to a specific
  occurrence — two fields can then map to two occurrences of the same
  header.

Adds the ``header_duplicated`` error code to the README taxonomy and a
new "Disambiguating duplicated headers" section in
``docs/guides/templates.md``. Graduates the two xfail tests in
``tests/test_field_scan_gaps.py::§P1-2`` to real acceptance checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dev360 dev360 force-pushed the feat/p1-2-duplicate-headers branch from dfa8494 to 701d918 Compare May 22, 2026 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant