fix(#746): KeyAboveHighWatermark must not silently drop in IsNil state#808
Merged
Conversation
c2e825a to
a8d916f
Compare
… state `chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr` and `checkpointHighPtr` were Nil, with the comment "every key is above because we haven't started copying." This caused the deltaMap and bufferedMap HasChanged paths to silently drop binlog events for every key during the window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables the filter, just before copier startup) and the first `chunker.Next()` call (which sets `chunkPtr` non-nil). The intent was: "the chunker's later SELECT will pick the row up from the source table." That holds only for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE `chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE). The row is lost. This is the root cause of issue block#746. PR block#764 disabled FixDifferences in TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the checksum step instead of being masked by repair. After that, the failure surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)` (`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer transactions committed during the dead zone, autoinc funnels them through in PK order, all silently dropped." Fix: return FALSE in the IsNil branch, per this method's own contract above ("if there is any ambiguity, it's important to return FALSE"). The change is buffered into the delta map. When the chunker advances and the relevant chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`. Idempotent with anything the chunker's own SELECT picked up. Also updates two tests that asserted the buggy behavior: - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after a DELETE issued before the chunker's first `Next()`. Now the deltas are buffered and drained correctly when feedback advances the watermark; the post-final-flush count assertion is unchanged. - `TestOptimisticChunkerBasic` previously expected `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()` (before the first `Next()`). Now that returns FALSE for the same reason. Verified locally: - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/` — 10/10 PASS. - Full `pkg/migration` suite — PASS (~3 min). - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t prior workarounds Composite chunker had the same IsNil-returns-TRUE bug in two places: - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker). - The "no chunkPtrs but above checkpointHighPtr" branch (resume path). Both flip to returning FALSE for the same reason: the doc comment on KeyAboveHighWatermark says ambiguity should return FALSE, and these two branches were exactly the ambiguous "we haven't dispatched yet" state. Tests updated to match: chunker_composite_test.go's two assertions that encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before Open() / before OpenAtWatermark()) now assert FALSE. Also reverts two prior workarounds that the issue-block#746 root cause makes unnecessary: - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around the _old/_new checksum compare with a strict single-shot `require.Equal`. The original wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed. With the IsNil fix the deficit doesn't occur in the first place, so the polling is pure noise. - cutover.go: removes the second back-to-back `FlushUnderTableLock` call introduced by 3001db2. Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative "rare case" hedge for the same divergence the IsNil bug actually caused. A single call is sufficient and matches BUG.md's protocol. All four changes verified: - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — 10/10 PASS. - Full pkg/migration suite — PASS (~4 min). - Full pkg/table, pkg/repl, pkg/copier suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fered+buffered The cutover atomicity test (TestCutoverAtomicityWithConcurrentWrites) has proven its value as a probe for issue block#746 — it surfaced the KeyAboveHighWatermark IsNil bug across multiple unrelated PRs (block#801, block#797, block#795, block#804) once FixDifferences was disabled for it. Until now it only exercised the optimistic chunker in unbuffered mode. Both the composite chunker and the buffered copier path have the same shape and the same dead-zone window between SetWatermarkOptimization(true) and the chunker's first Next(), so they should be probed too. Refactor the test body into a helper runCutoverAtomicityTest(t, tableName, schema, buffered) and wrap it in four subtests: - optimistic_unbuffered (table t1concurrent_oub) - optimistic_buffered (table t1concurrent_obu) - composite_unbuffered (table t1concurrent_cub) - composite_buffered (table t1concurrent_cbu) The composite variants use a (id, x_token) PK so NewChunker selects the composite chunker (optimistic requires single-column auto_increment). Buffered variants opt in via WithBuffered(true); they Skip on the minimal-RBR test runner since buffered requires binlog_row_image=FULL. The four subtests run in parallel under t.Parallel() with unique table names so they don't collide on _<name>_old / _<name>_new artifacts. The failure-diagnostic block (max(id), missing PKs, diverged rows) now parameterizes the table name so each subtest's logs name themselves correctly. migrationConcurrentWriteThread / doOneMigrationWriteLoop now take a tableName parameter and build their SQL with fmt.Sprintf instead of hard-coding `t1concurrent`. Verified locally: all 4 variants pass standalone and together (go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/), full pkg/migration suite passes (~3 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a8d916f to
5932c32
Compare
aparajon
approved these changes
May 3, 2026
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 3, 2026
…+ deltaMap.Flush The second issue-block#746 bug (post-PR-block#808) is currently invisible at failure time because: 1. The post-cutover CRC compare in TestCutoverAtomicityWithConcurrentWrites uses require.Equal, which halts execution before the diagnostic block that lists missing/diverged PKs can run. Each failure only prints two CRC integers, no per-PK detail. 2. deltaMap.Flush logs num_keys and affected_rows at Debug level but only emits a single info-free line. Even when affected_rows < num_keys (the shape we keep seeing on the failing optimistic_unbuffered variant on the RBR-minimal lane), nothing flags it. This patch addresses both: - cutover_test.go: switch the post-cutover CRC require.Equal to assert.Equal so the diagnostic block runs and prints the missing/ diverged PK list. The final require.Equal on row counts at the end of the function still hard-fails the test, so the failure mode is unchanged. The function continues until that final assertion, giving us the per-PK signal we need to debug the second bug. - subscription_map.go: when a flush statement reports affected_rows strictly less than num_keys, emit a Warn-level "issue block#746 second-bug" log line with the table, num_keys, affected_rows, missing_at_least, and the statement. This is the smoking-gun pattern observed in CI runs of PR block#813 / block#814 where the SELECT FROM original WHERE pk IN (...) returns fewer rows than the delta map had keys — i.e. a key was in the delta map but not visible in `original` at the statement's snapshot. The check has false negatives when missing-source rows coincide with target-side replacements that bump affected_rows back up; a tighter check would need an explicit COUNT pre-flight which we skip for cost. Verified locally: - go test -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — all 4 variants pass (1.6s). - Full pkg/repl suite — pass (32s). - go vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #746.
Root cause
chunkerOptimistic.KeyAboveHighWatermark(andchunkerComposite.KeyAboveHighWatermark) returnedtruewhen the chunker hadn't dispatched any chunks yet (chunkPtr.IsNil() && checkpointHighPtr.IsNil()), with the comment "every key is above because we haven't started copying." This causeddeltaMap.HasChangedandbufferedMap.HasChangedto silently drop every binlog event during the window betweenrunner.go'sSetWatermarkOptimization(true)(filter on, just before copier startup) and the firstchunker.Next()(chunkPtr set).The intent was: "the chunker's later SELECT will pick the row up from source." That's only true for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot but BEFORE chunkPtr is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (
KeyAboveHighWatermarkreturnstrue). The row is permanently lost.Likelihood in practical workloads
For a real workload to lose a row, all of the following must coincide:
SetWatermarkOptimization(true)and the firstchunker.Next()advancingchunkPtr. This window stretches under CPU/I/O pressure (Docker, race detector, noisy neighbours).FixDifferences=true, default in production) must fail to repair the divergence. Checksum re-checks every chunk and re-copies any whose CRC doesn't match — so the bug is masked the vast majority of the time.Empirically this matched the failure rate observed in CI: across 99 failed docker-compose runs from 2026-04-30/05-01, only 2 hits on
TestCutoverAtomicityWithConcurrentWrites(the test purpose-built to provoke this). And before PR #764 disabledFixDifferencesfor that test, even those two hits sometimes recovered after the checksum repair — the original #746 report was a 2-row deficit, not all four PKs in the dead zone.In a production OLTP migration the practical risk is small but not zero: enough sustained write traffic during setup gives a non-zero chance of one or more silent row losses. The mode is silent data loss (rows in source missing from target after cutover), so it warrants a fix at any frequency. Most existing users on busy schemas have probably not been bitten, but anyone who has should re-checksum source vs the migrated table.
This is what makes a
v0.12.1cherry-pick the right call.Why it took so long to pin down
In production,
FixDifferences=truere-copies any chunk whose CRC doesn't match, so this bug was almost always masked. The rare flake ofTestCutoverAtomicityWithConcurrentWriteswas the slice that escaped repair.PR #764 disabled
FixDifferencesfor that test specifically so copy-phase loss would surface at the checksum step instead. After that landed, the failure became deterministic across #801, #797, #795, #804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk[1, 1001)(pk=62-65;pk=102-105; etc.) — exactly the shape of several writer transactions committing during the dead zone, autoinc funneling them through in PK order, all silently dropped.What this PR does
Optimistic chunker (chunker_optimistic.go): the IsNil branch now returns
false(buffer the change) instead oftrue(drop). Per the method's own contract: "if there is any ambiguity, it's important to return FALSE."Composite chunker (chunker_composite.go): same fix in two places. The
len(chunkPtrs) == 0 && checkpointHighPtr.IsNil()branch and thelen(chunkPtrs) == 0 && key > checkpointHighPtrresume branch — both now returnfalsefor the same reason.Test coverage extended to 4 variants.
TestCutoverAtomicityWithConcurrentWritespreviously only exercised the optimistic chunker in unbuffered copier mode. The cutover-atomicity probe has proven its value as the canonical detector for issue-flaky test: TestCutoverAtomicityWithConcurrentWrites #746-shaped bugs, and both the composite chunker and the buffered copier path have the same dead-zone window betweenSetWatermarkOptimization(true)and the chunker's firstNext(). The test body is now extracted intorunCutoverAtomicityTest(t, tableName, schemaTmpl, buffered)and invoked four times viat.Run:optimistic_unbufferedoptimistic_bufferedcomposite_unbuffered(id, x_token)PK)composite_bufferedThe composite variants use a
PRIMARY KEY (id, x_token)soNewChunkerselects the composite chunker (which only handles single-column auto_increment otherwise). Buffered variants opt in viaWithBuffered(true)and skip on the minimal-RBR runner since buffered requiresbinlog_row_image=FULL. All four subtests run in parallel undert.Parallel()with uniquet1concurrent_*table names so they don't collide on_<name>_old/_<name>_newartifacts.Test updates for the four tests that encoded the buggy behavior:
TestReplClientComplex— previously expectedGetDeltaLen() == 0after aDELETEissued before the chunker's firstNext(). Now expects the deletes to accumulate in the delta map and be drained when feedback advances the watermark. End-state row count assertion unchanged.TestOptimisticChunkerBasic— flippedKeyAboveHighWatermark(1)assertion from TRUE to FALSE for the pre-Next state.TestCompositeChunkerWatermarkOptimizations— same flip for both pre-Open and pre-Next states.TestCompositeChunkerCheckpointHighPtr— same flip for the pre-OpenAtWatermark state.Reverts two prior workarounds that the root-cause fix makes unnecessary:
require.Eventually(5s, 250ms, ...)in the cutover atomicity test. The wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed — it stays a deficit. Replaced with a strict single-shotrequire.Equal.FlushUnderTableLockinexecuteRenameUnderLock(commit 3001db2). Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative hedge for the same divergence the IsNil bug actually caused. A single call is sufficient.Verified locally
go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/— all 4 variants × 5 iterations PASS.pkg/migrationsuite — PASS (~3 min).pkg/repl,pkg/copier,pkg/tablesuites — PASS.Cherry-pick
Recommend cherry-picking the chunker fixes (commit
045117cfor optimistic + the chunker portion of the composite-and-reverts commit) tov0.12.1. The two reverts (require.Eventually wrapper, second FlushUnderTableLock) are cleanups and don't strictly need to ride along. The 4-variant test extension is independent and not needed in the patch release — but harmless if cherry-picking the whole branch is simpler.Refs
🤖 Generated with Claude Code