fix(#746): KeyAboveHighWatermark must not silently drop in IsNil state by morgo · Pull Request #808 · block/spirit

morgo · 2026-05-03T00:49:00Z

Fixes #746.

Root cause

chunkerOptimistic.KeyAboveHighWatermark (and chunkerComposite.KeyAboveHighWatermark) returned true when the chunker hadn't dispatched any chunks yet (chunkPtr.IsNil() && checkpointHighPtr.IsNil()), with the comment "every key is above because we haven't started copying." This caused deltaMap.HasChanged and bufferedMap.HasChanged to silently drop every binlog event during the window between runner.go's SetWatermarkOptimization(true) (filter on, just before copier startup) and the first chunker.Next() (chunkPtr set).

The intent was: "the chunker's later SELECT will pick the row up from source." That's only true for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot but BEFORE chunkPtr is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns true). The row is permanently lost.

Likelihood in practical workloads

For a real workload to lose a row, all of the following must coincide:

The migration must run while the source table is actively receiving writes — idle or low-traffic migrations are not affected.
A writer must commit during a window of a few milliseconds between SetWatermarkOptimization(true) and the first chunker.Next() advancing chunkPtr. This window stretches under CPU/I/O pressure (Docker, race detector, noisy neighbours).
The committed row must be one whose chunk has not yet been read by the copier's snapshot. (Rows committed before the SELECT's snapshot are picked up by the copier directly and aren't affected.)
The downstream checksum (FixDifferences=true, default in production) must fail to repair the divergence. Checksum re-checks every chunk and re-copies any whose CRC doesn't match — so the bug is masked the vast majority of the time.

Empirically this matched the failure rate observed in CI: across 99 failed docker-compose runs from 2026-04-30/05-01, only 2 hits on TestCutoverAtomicityWithConcurrentWrites (the test purpose-built to provoke this). And before PR #764 disabled FixDifferences for that test, even those two hits sometimes recovered after the checksum repair — the original #746 report was a 2-row deficit, not all four PKs in the dead zone.

In a production OLTP migration the practical risk is small but not zero: enough sustained write traffic during setup gives a non-zero chance of one or more silent row losses. The mode is silent data loss (rows in source missing from target after cutover), so it warrants a fix at any frequency. Most existing users on busy schemas have probably not been bitten, but anyone who has should re-checksum source vs the migrated table.

This is what makes a v0.12.1 cherry-pick the right call.

Why it took so long to pin down

In production, FixDifferences=true re-copies any chunk whose CRC doesn't match, so this bug was almost always masked. The rare flake of TestCutoverAtomicityWithConcurrentWrites was the slice that escaped repair.

PR #764 disabled FixDifferences for that test specifically so copy-phase loss would surface at the checksum step instead. After that landed, the failure became deterministic across #801, #797, #795, #804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk [1, 1001) (pk=62-65; pk=102-105; etc.) — exactly the shape of several writer transactions committing during the dead zone, autoinc funneling them through in PK order, all silently dropped.

What this PR does

Optimistic chunker (chunker_optimistic.go): the IsNil branch now returns false (buffer the change) instead of true (drop). Per the method's own contract: "if there is any ambiguity, it's important to return FALSE."
Composite chunker (chunker_composite.go): same fix in two places. The len(chunkPtrs) == 0 && checkpointHighPtr.IsNil() branch and the len(chunkPtrs) == 0 && key > checkpointHighPtr resume branch — both now return false for the same reason.

Test coverage extended to 4 variants. TestCutoverAtomicityWithConcurrentWrites previously only exercised the optimistic chunker in unbuffered copier mode. The cutover-atomicity probe has proven its value as the canonical detector for issue-flaky test: TestCutoverAtomicityWithConcurrentWrites #746-shaped bugs, and both the composite chunker and the buffered copier path have the same dead-zone window between SetWatermarkOptimization(true) and the chunker's first Next(). The test body is now extracted into runCutoverAtomicityTest(t, tableName, schemaTmpl, buffered) and invoked four times via t.Run:

variant	chunker	copier mode
`optimistic_unbuffered`	optimistic (single-col auto_inc PK)	unbuffered
`optimistic_buffered`	optimistic	buffered
`composite_unbuffered`	composite (forced via `(id, x_token)` PK)	unbuffered
`composite_buffered`	composite	buffered

The composite variants use a PRIMARY KEY (id, x_token) so NewChunker selects the composite chunker (which only handles single-column auto_increment otherwise). Buffered variants opt in via WithBuffered(true) and skip on the minimal-RBR runner since buffered requires binlog_row_image=FULL. All four subtests run in parallel under t.Parallel() with unique t1concurrent_* table names so they don't collide on _<name>_old / _<name>_new artifacts.

Test updates for the four tests that encoded the buggy behavior:
- TestReplClientComplex — previously expected GetDeltaLen() == 0 after a DELETE issued before the chunker's first Next(). Now expects the deletes to accumulate in the delta map and be drained when feedback advances the watermark. End-state row count assertion unchanged.
- TestOptimisticChunkerBasic — flipped KeyAboveHighWatermark(1) assertion from TRUE to FALSE for the pre-Next state.
- TestCompositeChunkerWatermarkOptimizations — same flip for both pre-Open and pre-Next states.
- TestCompositeChunkerCheckpointHighPtr — same flip for the pre-OpenAtWatermark state.
Reverts two prior workarounds that the root-cause fix makes unnecessary:
- require.Eventually(5s, 250ms, ...) in the cutover atomicity test. The wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed — it stays a deficit. Replaced with a strict single-shot require.Equal.
- The second FlushUnderTableLock in executeRenameUnderLock (commit 3001db2). Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative hedge for the same divergence the IsNil bug actually caused. A single call is sufficient.

Verified locally

go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — all 4 variants × 5 iterations PASS.

A single run produces:

--- PASS: TestCutoverAtomicityWithConcurrentWrites/composite_unbuffered (2.86s)
--- PASS: TestCutoverAtomicityWithConcurrentWrites/optimistic_buffered  (3.04s)
--- PASS: TestCutoverAtomicityWithConcurrentWrites/optimistic_unbuffered (3.21s)
--- PASS: TestCutoverAtomicityWithConcurrentWrites/composite_buffered   (3.22s)

Full pkg/migration suite — PASS (~3 min).
Full pkg/repl, pkg/copier, pkg/table suites — PASS.

Cherry-pick

Recommend cherry-picking the chunker fixes (commit 045117c for optimistic + the chunker portion of the composite-and-reverts commit) to v0.12.1. The two reverts (require.Eventually wrapper, second FlushUnderTableLock) are cleanups and don't strictly need to ride along. The 4-variant test extension is independent and not needed in the patch release — but harmless if cherry-picking the whole branch is simpler.

Refs

Closes flaky test: TestCutoverAtomicityWithConcurrentWrites #746
Surfaced in CI on test(#784): migrate pkg/dbconn tests to require.* #801, test(#790): migrate pkg/applier tests to require.* #797, test(#783): migrate pkg/lint tests to require.* #795, test(#752,#792): unique table names + require.* in pkg/repl #804 (all unrelated PRs that just happened to inherit the now-strict test from main).

🤖 Generated with Claude Code

… state `chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr` and `checkpointHighPtr` were Nil, with the comment "every key is above because we haven't started copying." This caused the deltaMap and bufferedMap HasChanged paths to silently drop binlog events for every key during the window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables the filter, just before copier startup) and the first `chunker.Next()` call (which sets `chunkPtr` non-nil). The intent was: "the chunker's later SELECT will pick the row up from the source table." That holds only for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE `chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE). The row is lost. This is the root cause of issue block#746. PR block#764 disabled FixDifferences in TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the checksum step instead of being masked by repair. After that, the failure surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)` (`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer transactions committed during the dead zone, autoinc funnels them through in PK order, all silently dropped." Fix: return FALSE in the IsNil branch, per this method's own contract above ("if there is any ambiguity, it's important to return FALSE"). The change is buffered into the delta map. When the chunker advances and the relevant chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`. Idempotent with anything the chunker's own SELECT picked up. Also updates two tests that asserted the buggy behavior: - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after a DELETE issued before the chunker's first `Next()`. Now the deltas are buffered and drained correctly when feedback advances the watermark; the post-final-flush count assertion is unchanged. - `TestOptimisticChunkerBasic` previously expected `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()` (before the first `Next()`). Now that returns FALSE for the same reason. Verified locally: - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/` — 10/10 PASS. - Full `pkg/migration` suite — PASS (~3 min). - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t prior workarounds Composite chunker had the same IsNil-returns-TRUE bug in two places: - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker). - The "no chunkPtrs but above checkpointHighPtr" branch (resume path). Both flip to returning FALSE for the same reason: the doc comment on KeyAboveHighWatermark says ambiguity should return FALSE, and these two branches were exactly the ambiguous "we haven't dispatched yet" state. Tests updated to match: chunker_composite_test.go's two assertions that encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before Open() / before OpenAtWatermark()) now assert FALSE. Also reverts two prior workarounds that the issue-block#746 root cause makes unnecessary: - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around the _old/_new checksum compare with a strict single-shot `require.Equal`. The original wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed. With the IsNil fix the deficit doesn't occur in the first place, so the polling is pure noise. - cutover.go: removes the second back-to-back `FlushUnderTableLock` call introduced by 3001db2. Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative "rare case" hedge for the same divergence the IsNil bug actually caused. A single call is sufficient and matches BUG.md's protocol. All four changes verified: - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — 10/10 PASS. - Full pkg/migration suite — PASS (~4 min). - Full pkg/table, pkg/repl, pkg/copier suites — PASS. Refs block#746, block#801, block#797, block#795, block#804. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fered+buffered The cutover atomicity test (TestCutoverAtomicityWithConcurrentWrites) has proven its value as a probe for issue block#746 — it surfaced the KeyAboveHighWatermark IsNil bug across multiple unrelated PRs (block#801, block#797, block#795, block#804) once FixDifferences was disabled for it. Until now it only exercised the optimistic chunker in unbuffered mode. Both the composite chunker and the buffered copier path have the same shape and the same dead-zone window between SetWatermarkOptimization(true) and the chunker's first Next(), so they should be probed too. Refactor the test body into a helper runCutoverAtomicityTest(t, tableName, schema, buffered) and wrap it in four subtests: - optimistic_unbuffered (table t1concurrent_oub) - optimistic_buffered (table t1concurrent_obu) - composite_unbuffered (table t1concurrent_cub) - composite_buffered (table t1concurrent_cbu) The composite variants use a (id, x_token) PK so NewChunker selects the composite chunker (optimistic requires single-column auto_increment). Buffered variants opt in via WithBuffered(true); they Skip on the minimal-RBR test runner since buffered requires binlog_row_image=FULL. The four subtests run in parallel under t.Parallel() with unique table names so they don't collide on _<name>_old / _<name>_new artifacts. The failure-diagnostic block (max(id), missing PKs, diverged rows) now parameterizes the table name so each subtest's logs name themselves correctly. migrationConcurrentWriteThread / doOneMigrationWriteLoop now take a tableName parameter and build their SQL with fmt.Sprintf instead of hard-coding `t1concurrent`. Verified locally: all 4 variants pass standalone and together (go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/), full pkg/migration suite passes (~3 min). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+ deltaMap.Flush The second issue-block#746 bug (post-PR-block#808) is currently invisible at failure time because: 1. The post-cutover CRC compare in TestCutoverAtomicityWithConcurrentWrites uses require.Equal, which halts execution before the diagnostic block that lists missing/diverged PKs can run. Each failure only prints two CRC integers, no per-PK detail. 2. deltaMap.Flush logs num_keys and affected_rows at Debug level but only emits a single info-free line. Even when affected_rows < num_keys (the shape we keep seeing on the failing optimistic_unbuffered variant on the RBR-minimal lane), nothing flags it. This patch addresses both: - cutover_test.go: switch the post-cutover CRC require.Equal to assert.Equal so the diagnostic block runs and prints the missing/ diverged PK list. The final require.Equal on row counts at the end of the function still hard-fails the test, so the failure mode is unchanged. The function continues until that final assertion, giving us the per-PK signal we need to debug the second bug. - subscription_map.go: when a flush statement reports affected_rows strictly less than num_keys, emit a Warn-level "issue block#746 second-bug" log line with the table, num_keys, affected_rows, missing_at_least, and the statement. This is the smoking-gun pattern observed in CI runs of PR block#813 / block#814 where the SELECT FROM original WHERE pk IN (...) returns fewer rows than the delta map had keys — i.e. a key was in the delta map but not visible in `original` at the statement's snapshot. The check has false negatives when missing-source rows coincide with target-side replacements that bump affected_rows back up; a tighter check would need an explicit COUNT pre-flight which we skip for cost. Verified locally: - go test -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — all 4 variants pass (1.6s). - Full pkg/repl suite — pass (32s). - go vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

morgo force-pushed the fix-746-key-above-watermark-isnil branch from c2e825a to a8d916f Compare May 3, 2026 01:33

morgo and others added 3 commits May 2, 2026 20:12

morgo force-pushed the fix-746-key-above-watermark-isnil branch from a8d916f to 5932c32 Compare May 3, 2026 02:21

aparajon approved these changes May 3, 2026

View reviewed changes

morgo merged commit be718c6 into block:main May 3, 2026
12 checks passed

morgo deleted the fix-746-key-above-watermark-isnil branch May 3, 2026 13:25

morgo mentioned this pull request May 3, 2026

flaky test: TestCutoverAtomicityWithConcurrentWrites #746

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#746): KeyAboveHighWatermark must not silently drop in IsNil state#808

fix(#746): KeyAboveHighWatermark must not silently drop in IsNil state#808
morgo merged 3 commits into
block:mainfrom
morgo:fix-746-key-above-watermark-isnil

morgo commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

morgo commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause

Likelihood in practical workloads

Why it took so long to pin down

What this PR does

Verified locally

Cherry-pick

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morgo commented May 3, 2026 •

edited

Loading