Skip to content

test: disable FixDifferences in TestCutoverAtomicityWithConcurrentWrites for #746 repro#764

Merged
morgo merged 1 commit into
block:mainfrom
morgo:repro-746-no-fixdifferences
May 1, 2026
Merged

test: disable FixDifferences in TestCutoverAtomicityWithConcurrentWrites for #746 repro#764
morgo merged 1 commit into
block:mainfrom
morgo:repro-746-no-fixdifferences

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 1, 2026

Summary

The root cause of TestCutoverAtomicityWithConcurrentWrites is not yet known. However, it looks like it might be changes lost in the replication stream. Because of the checksum fixing differences, these changes can only be between the checksum start and cutover. By disabling FixDifferences for this test it extends the time window and makes the race hopefully more likely to occur.

🤖 Generated with Claude Code

The checksum step's FixDifferences=true repairs any rows that the copy
or applier path lost during the migration. That repair is what makes
TestCutoverAtomicityWithConcurrentWrites flaky-rather-than-broken in
CI: copy-phase loss is masked, and only the small post-checksum /
pre-cutover window leaks through. id=118 evidence (BUG.md, 2026-05-01)
showed a mid-keyspace miss, which fits a copy-phase loss that
checksum *also* didn't catch — likely because the writer committed
inside checksum's own snapshot window.

Reuse the existing useTestCutover hook (currently set only by this
test) to flip FixDifferences off. With repair disabled, any divergence
the checksum sees becomes a "checksum mismatch" error before partial
cutover runs, and the existing inspectDifferences pass logs the
diverged PKs at Warn. The test now special-cases that error path so
CI gets a clear repro signal instead of confusingly failing on a
later "_old table should exist" assertion.

No behavior change in production — useTestCutover is a test-only
field on Migration with no CLI binding. All other tests use
FixDifferences: true unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo marked this pull request as ready for review May 1, 2026 15:34
@morgo morgo merged commit 501e30c into block:main May 1, 2026
12 of 16 checks passed
@morgo morgo deleted the repro-746-no-fixdifferences branch May 1, 2026 16:04
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
… state

`chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr`
and `checkpointHighPtr` were Nil, with the comment "every key is above
because we haven't started copying." This caused the deltaMap and bufferedMap
HasChanged paths to silently drop binlog events for every key during the
window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables
the filter, just before copier startup) and the first `chunker.Next()` call
(which sets `chunkPtr` non-nil).

The intent was: "the chunker's later SELECT will pick the row up from the
source table." That holds only for rows committed before the SELECT's
snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE
`chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late)
AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE).
The row is lost.

This is the root cause of issue block#746. PR block#764 disabled FixDifferences in
TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the
checksum step instead of being masked by repair. After that, the failure
surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent
signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)`
(`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer
transactions committed during the dead zone, autoinc funnels them through in
PK order, all silently dropped."

Fix: return FALSE in the IsNil branch, per this method's own contract above
("if there is any ambiguity, it's important to return FALSE"). The change is
buffered into the delta map. When the chunker advances and the relevant
chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it
via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`.
Idempotent with anything the chunker's own SELECT picked up.

Also updates two tests that asserted the buggy behavior:

  - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after
    a DELETE issued before the chunker's first `Next()`. Now the deltas
    are buffered and drained correctly when feedback advances the
    watermark; the post-final-flush count assertion is unchanged.

  - `TestOptimisticChunkerBasic` previously expected
    `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()`
    (before the first `Next()`). Now that returns FALSE for the same reason.

Verified locally:
  - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/` — 10/10 PASS.
  - Full `pkg/migration` suite — PASS (~3 min).
  - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
… state

`chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr`
and `checkpointHighPtr` were Nil, with the comment "every key is above
because we haven't started copying." This caused the deltaMap and bufferedMap
HasChanged paths to silently drop binlog events for every key during the
window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables
the filter, just before copier startup) and the first `chunker.Next()` call
(which sets `chunkPtr` non-nil).

The intent was: "the chunker's later SELECT will pick the row up from the
source table." That holds only for rows committed before the SELECT's
snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE
`chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late)
AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE).
The row is lost.

This is the root cause of issue block#746. PR block#764 disabled FixDifferences in
TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the
checksum step instead of being masked by repair. After that, the failure
surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent
signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)`
(`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer
transactions committed during the dead zone, autoinc funnels them through in
PK order, all silently dropped."

Fix: return FALSE in the IsNil branch, per this method's own contract above
("if there is any ambiguity, it's important to return FALSE"). The change is
buffered into the delta map. When the chunker advances and the relevant
chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it
via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`.
Idempotent with anything the chunker's own SELECT picked up.

Also updates two tests that asserted the buggy behavior:

  - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after
    a DELETE issued before the chunker's first `Next()`. Now the deltas
    are buffered and drained correctly when feedback advances the
    watermark; the post-final-flush count assertion is unchanged.

  - `TestOptimisticChunkerBasic` previously expected
    `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()`
    (before the first `Next()`). Now that returns FALSE for the same reason.

Verified locally:
  - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/` — 10/10 PASS.
  - Full `pkg/migration` suite — PASS (~3 min).
  - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants