Skip to content

fix(db): enforce UNIQUE on edges so INSERT OR IGNORE actually dedupes#102

Open
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:fix/edges-unique-constraint
Open

fix(db): enforce UNIQUE on edges so INSERT OR IGNORE actually dedupes#102
andreinknv wants to merge 1 commit into
colbymchenry:mainfrom
andreinknv:fix/edges-unique-constraint

Conversation

@andreinknv
Copy link
Copy Markdown
Contributor

Summary

The edges table has id INTEGER PRIMARY KEY AUTOINCREMENT and no other UNIQUE constraint. The codebase uses INSERT OR IGNORE INTO edges ... clearly intending dedup, but autoincrement IDs never conflict, so the OR IGNORE was silently a no-op. Any path that re-emits the same edge (resolver retries, partial-failure re-runs, framework extractors that double-emit) accumulated duplicates → inflated call graphs in codegraph_callers / codegraph_callees results.

What changed

  • Schema: src/db/schema.sql adds UNIQUE INDEX idx_edges_unique ON edges(source, target, kind, COALESCE(line, -1), COALESCE(col, -1)) for fresh installs.
  • Migration v4 (src/db/migrations.ts): existing databases first dedupe pre-existing duplicates (DELETE ... WHERE id NOT IN (SELECT MIN(id) ... GROUP BY ...)) then create the index. Both run inside the migration transaction wrapper, so a crash leaves the DB consistent.
  • COALESCE(line, -1), COALESCE(col, -1) in the unique key: SQLite treats raw NULLs in a UNIQUE index as distinct, which would defeat dedup for edges that don't carry a line/column. The codebase's line numbers are 1-indexed and column numbers are non-negative everywhere — -1 is a safe sentinel (verified via grep across src/).
  • CURRENT_SCHEMA_VERSION bumped 3 → 4. Two existing version-pinned tests updated.

Files changed

File Change
src/db/schema.sql Add UNIQUE INDEX idx_edges_unique for fresh installs
src/db/migrations.ts Bump version to 4; add migration v4 (dedup + index)
__tests__/edges-unique.test.ts 7 regression tests
__tests__/foundation.test.ts Update expected schema version
__tests__/pr19-improvements.test.ts Update expected schema version

Test coverage

  • Exact duplicates (same source/target/kind/line/col) collapse to one row
  • NULL line edges with same source/target/kind collapse via COALESCE
  • Different lines / different kinds remain distinct
  • Batch insertEdges dedupes within one call
  • Re-emission across 100 cycles leaves a single row
  • Migration test: simulate a v3-shaped DB, insert duplicates via raw SQL, run runMigrations(db, 3), assert dedup occurred and the constraint is now enforced

Test plan

  • npm test: 387/387 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation — unrelated to this change)
  • npx tsc --noEmit clean
  • Independent reviewer pass before pushing — verdict APPROVE (informational nits only: schema.sql header comment still says "Version 1", and the migration test simulates the v3 state by surgically rolling back rather than reproducing the historical schema.sql; both deemed acceptable)

🤖 Generated with Claude Code

The edges table has `id INTEGER PRIMARY KEY AUTOINCREMENT` and no other
UNIQUE constraint. The codebase uses

    INSERT OR IGNORE INTO edges (source, target, kind, ...) VALUES (...)

clearly intending dedup, but the only candidate key for OR IGNORE was
the autoincrement id (which never conflicts) — so the OR IGNORE was a
silent no-op. Any code path that re-emits the same edge (resolver retries,
partial-failure re-runs, framework extractors that double-emit) silently
inserted duplicates, inflating call graphs in codegraph_callers/callees.

This change adds a real UNIQUE index on the natural key:

    UNIQUE INDEX idx_edges_unique
      ON edges(source, target, kind, COALESCE(line, -1), COALESCE(col, -1))

COALESCE keeps two NULL line/col values comparable as equal — SQLite
treats raw NULLs in a UNIQUE index as distinct, which would otherwise
defeat dedup for edges that don't carry a line/col (1-indexed everywhere
in this codebase, so -1 is a safe sentinel).

Migration v4 first deduplicates pre-existing rows (DELETE ... WHERE id
NOT IN (SELECT MIN(id) FROM edges GROUP BY source, target, kind,
COALESCE(line, -1), COALESCE(col, -1))) then creates the index. Both run
inside the migration transaction wrapper so a crash leaves the DB
consistent.

CURRENT_SCHEMA_VERSION bumped to 4. Two existing version-pinned tests
updated to match.

## Files changed

| File | Change |
|---|---|
| src/db/schema.sql | Add UNIQUE INDEX idx_edges_unique for fresh installs |
| src/db/migrations.ts | Bump version to 4; add migration v4 (dedup + index) |
| __tests__/edges-unique.test.ts | 7 regression tests |
| __tests__/foundation.test.ts | Update expected schema version |
| __tests__/pr19-improvements.test.ts | Update expected schema version |

## Test plan

- [x] npm test: 387/387 pass on macOS (one pre-existing fs.watch flake under parallel load, passes in isolation)
- [x] npx tsc --noEmit clean
- [x] Independent reviewer pass before pushing — APPROVE; nits-only

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant