test: disable FixDifferences in TestCutoverAtomicityWithConcurrentWrites for #746 repro#764
Merged
Merged
Conversation
The checksum step's FixDifferences=true repairs any rows that the copy or applier path lost during the migration. That repair is what makes TestCutoverAtomicityWithConcurrentWrites flaky-rather-than-broken in CI: copy-phase loss is masked, and only the small post-checksum / pre-cutover window leaks through. id=118 evidence (BUG.md, 2026-05-01) showed a mid-keyspace miss, which fits a copy-phase loss that checksum *also* didn't catch — likely because the writer committed inside checksum's own snapshot window. Reuse the existing useTestCutover hook (currently set only by this test) to flip FixDifferences off. With repair disabled, any divergence the checksum sees becomes a "checksum mismatch" error before partial cutover runs, and the existing inspectDifferences pass logs the diverged PKs at Warn. The test now special-cases that error path so CI gets a clear repro signal instead of confusingly failing on a later "_old table should exist" assertion. No behavior change in production — useTestCutover is a test-only field on Migration with no CLI binding. All other tests use FixDifferences: true unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aparajon
approved these changes
May 1, 2026
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
… state `chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr` and `checkpointHighPtr` were Nil, with the comment "every key is above because we haven't started copying." This caused the deltaMap and bufferedMap HasChanged paths to silently drop binlog events for every key during the window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables the filter, just before copier startup) and the first `chunker.Next()` call (which sets `chunkPtr` non-nil). The intent was: "the chunker's later SELECT will pick the row up from the source table." That holds only for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE `chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE). The row is lost. This is the root cause of issue block#746. PR block#764 disabled FixDifferences in TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the checksum step instead of being masked by repair. After that, the failure surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)` (`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer transactions committed during the dead zone, autoinc funnels them through in PK order, all silently dropped." Fix: return FALSE in the IsNil branch, per this method's own contract above ("if there is any ambiguity, it's important to return FALSE"). The change is buffered into the delta map. When the chunker advances and the relevant chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`. Idempotent with anything the chunker's own SELECT picked up. Also updates two tests that asserted the buggy behavior: - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after a DELETE issued before the chunker's first `Next()`. Now the deltas are buffered and drained correctly when feedback advances the watermark; the post-final-flush count assertion is unchanged. - `TestOptimisticChunkerBasic` previously expected `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()` (before the first `Next()`). Now that returns FALSE for the same reason. Verified locally: - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/` — 10/10 PASS. - Full `pkg/migration` suite — PASS (~3 min). - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
… state `chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr` and `checkpointHighPtr` were Nil, with the comment "every key is above because we haven't started copying." This caused the deltaMap and bufferedMap HasChanged paths to silently drop binlog events for every key during the window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables the filter, just before copier startup) and the first `chunker.Next()` call (which sets `chunkPtr` non-nil). The intent was: "the chunker's later SELECT will pick the row up from the source table." That holds only for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE `chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE). The row is lost. This is the root cause of issue block#746. PR block#764 disabled FixDifferences in TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the checksum step instead of being masked by repair. After that, the failure surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)` (`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer transactions committed during the dead zone, autoinc funnels them through in PK order, all silently dropped." Fix: return FALSE in the IsNil branch, per this method's own contract above ("if there is any ambiguity, it's important to return FALSE"). The change is buffered into the delta map. When the chunker advances and the relevant chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`. Idempotent with anything the chunker's own SELECT picked up. Also updates two tests that asserted the buggy behavior: - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after a DELETE issued before the chunker's first `Next()`. Now the deltas are buffered and drained correctly when feedback advances the watermark; the post-final-flush count assertion is unchanged. - `TestOptimisticChunkerBasic` previously expected `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()` (before the first `Next()`). Now that returns FALSE for the same reason. Verified locally: - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/` — 10/10 PASS. - Full `pkg/migration` suite — PASS (~3 min). - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The root cause of TestCutoverAtomicityWithConcurrentWrites is not yet known. However, it looks like it might be changes lost in the replication stream. Because of the checksum fixing differences, these changes can only be between the checksum start and cutover. By disabling FixDifferences for this test it extends the time window and makes the race hopefully more likely to occur.
🤖 Generated with Claude Code