feat(extract): warn on duplicate source_column; add source_column_index (P1-2)#24
Open
dev360 wants to merge 1 commit into
Open
feat(extract): warn on duplicate source_column; add source_column_index (P1-2)#24dev360 wants to merge 1 commit into
dev360 wants to merge 1 commit into
Conversation
When two header cells in the same row carry the same normalized text (e.g. ``DATE`` in columns A and F), a template field with ``source_column: "DATE"`` used to silently bind to the first column. Operators only noticed when a downstream consumer flagged the wrong date — by which point the canonical output was indistinguishable from a single-DATE workbook. ``_extract_flat`` now builds a header → all-columns map and: - Emits a structured ``header_duplicated`` warning whenever a field binds to an ambiguous header without an explicit disambiguator. The bind still picks the first occurrence so extraction proceeds, but the ambiguity surfaces in ``report.errors()`` instead of staying silent. - Honors a new optional ``source_column_index: int`` on ``FieldSpec`` (0-indexed across the matches) to bind a field to a specific occurrence — two fields can then map to two occurrences of the same header. Adds the ``header_duplicated`` error code to the README taxonomy and a new "Disambiguating duplicated headers" section in ``docs/guides/templates.md``. Graduates the two xfail tests in ``tests/test_field_scan_gaps.py::§P1-2`` to real acceptance checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dfa8494 to
701d918
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes P1-2 — duplicate-header detection in
_extract_flat.When two header cells in the same row carry the same normalized text (e.g.
DATEin two columns), a template field withsource_column: "DATE"used to silently bind to the first occurrence with no signal that the choice was ambiguous.This PR:
_extract_flat(currentlyheaders.index(wanted)returns only the first).header_duplicatederror onreport.errors()whenever a field'ssource_columnmatches more than one header and the field doesn't pin a specific occurrence. Bind still picks the first column so extraction proceeds; the operator now sees the ambiguity instead of finding out downstream.source_column_index: int(0-indexed across matches) onFieldSpecso two fields can map to two occurrences of the same header name.Why this gap
This is the kind of silent-correctness bug crease's "fail loudly with coordinates" pitch is supposed to catch. The shape comes up in any report where an operator stacked two side-by-side sub-tables that share a header (an opening date and a closing date, an "in" weight and an "out" weight, etc.) — the canonical output looks fine, just wrong, and nothing in the report tells you so.
API additions
FieldSpec.source_column_index: int | None = None— disambiguates whensource_columnmatches multiple header cells.error.typecodeheader_duplicated(cell-severity warning). Added to the README error-code taxonomy anddocs/guides/templates.mdincludes a worked example.Test plan
uv run pytest tests/test_field_scan_gaps.py -q— P0-4 ×2 and P1-2 ×2 pass; 25 gaps remain xfail.uv run pytest -q— 106 passed, 35 xfailed.uvx --from 'ruff==0.6.9' ruff check .andruff format --check .— clean.All fixtures and copy use only fictitious values per CLAUDE.md.
🤖 Generated with Claude Code