Skip to content

fix(#746): KeyAboveHighWatermark must not silently drop in IsNil state#808

Merged
morgo merged 3 commits into
block:mainfrom
morgo:fix-746-key-above-watermark-isnil
May 3, 2026
Merged

fix(#746): KeyAboveHighWatermark must not silently drop in IsNil state#808
morgo merged 3 commits into
block:mainfrom
morgo:fix-746-key-above-watermark-isnil

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 3, 2026

Fixes #746.

Root cause

chunkerOptimistic.KeyAboveHighWatermark (and chunkerComposite.KeyAboveHighWatermark) returned true when the chunker hadn't dispatched any chunks yet (chunkPtr.IsNil() && checkpointHighPtr.IsNil()), with the comment "every key is above because we haven't started copying." This caused deltaMap.HasChanged and bufferedMap.HasChanged to silently drop every binlog event during the window between runner.go's SetWatermarkOptimization(true) (filter on, just before copier startup) and the first chunker.Next() (chunkPtr set).

The intent was: "the chunker's later SELECT will pick the row up from source." That's only true for rows committed before the SELECT's snapshot. A writer that commits AFTER the snapshot but BEFORE chunkPtr is updated is in a dead zone — invisible to the SELECT (too late) AND silently dropped by the applier (KeyAboveHighWatermark returns true). The row is permanently lost.

Likelihood in practical workloads

For a real workload to lose a row, all of the following must coincide:

  1. The migration must run while the source table is actively receiving writes — idle or low-traffic migrations are not affected.
  2. A writer must commit during a window of a few milliseconds between SetWatermarkOptimization(true) and the first chunker.Next() advancing chunkPtr. This window stretches under CPU/I/O pressure (Docker, race detector, noisy neighbours).
  3. The committed row must be one whose chunk has not yet been read by the copier's snapshot. (Rows committed before the SELECT's snapshot are picked up by the copier directly and aren't affected.)
  4. The downstream checksum (FixDifferences=true, default in production) must fail to repair the divergence. Checksum re-checks every chunk and re-copies any whose CRC doesn't match — so the bug is masked the vast majority of the time.

Empirically this matched the failure rate observed in CI: across 99 failed docker-compose runs from 2026-04-30/05-01, only 2 hits on TestCutoverAtomicityWithConcurrentWrites (the test purpose-built to provoke this). And before PR #764 disabled FixDifferences for that test, even those two hits sometimes recovered after the checksum repair — the original #746 report was a 2-row deficit, not all four PKs in the dead zone.

In a production OLTP migration the practical risk is small but not zero: enough sustained write traffic during setup gives a non-zero chance of one or more silent row losses. The mode is silent data loss (rows in source missing from target after cutover), so it warrants a fix at any frequency. Most existing users on busy schemas have probably not been bitten, but anyone who has should re-checksum source vs the migrated table.

This is what makes a v0.12.1 cherry-pick the right call.

Why it took so long to pin down

In production, FixDifferences=true re-copies any chunk whose CRC doesn't match, so this bug was almost always masked. The rare flake of TestCutoverAtomicityWithConcurrentWrites was the slice that escaped repair.

PR #764 disabled FixDifferences for that test specifically so copy-phase loss would surface at the checksum step instead. After that landed, the failure became deterministic across #801, #797, #795, #804 with a consistent signature: a contiguous run of 3-5 PKs missing from chunk [1, 1001) (pk=62-65; pk=102-105; etc.) — exactly the shape of several writer transactions committing during the dead zone, autoinc funneling them through in PK order, all silently dropped.

What this PR does

  1. Optimistic chunker (chunker_optimistic.go): the IsNil branch now returns false (buffer the change) instead of true (drop). Per the method's own contract: "if there is any ambiguity, it's important to return FALSE."

  2. Composite chunker (chunker_composite.go): same fix in two places. The len(chunkPtrs) == 0 && checkpointHighPtr.IsNil() branch and the len(chunkPtrs) == 0 && key > checkpointHighPtr resume branch — both now return false for the same reason.

  3. Test coverage extended to 4 variants. TestCutoverAtomicityWithConcurrentWrites previously only exercised the optimistic chunker in unbuffered copier mode. The cutover-atomicity probe has proven its value as the canonical detector for issue-flaky test: TestCutoverAtomicityWithConcurrentWrites #746-shaped bugs, and both the composite chunker and the buffered copier path have the same dead-zone window between SetWatermarkOptimization(true) and the chunker's first Next(). The test body is now extracted into runCutoverAtomicityTest(t, tableName, schemaTmpl, buffered) and invoked four times via t.Run:

    variant chunker copier mode
    optimistic_unbuffered optimistic (single-col auto_inc PK) unbuffered
    optimistic_buffered optimistic buffered
    composite_unbuffered composite (forced via (id, x_token) PK) unbuffered
    composite_buffered composite buffered

    The composite variants use a PRIMARY KEY (id, x_token) so NewChunker selects the composite chunker (which only handles single-column auto_increment otherwise). Buffered variants opt in via WithBuffered(true) and skip on the minimal-RBR runner since buffered requires binlog_row_image=FULL. All four subtests run in parallel under t.Parallel() with unique t1concurrent_* table names so they don't collide on _<name>_old / _<name>_new artifacts.

  4. Test updates for the four tests that encoded the buggy behavior:

    • TestReplClientComplex — previously expected GetDeltaLen() == 0 after a DELETE issued before the chunker's first Next(). Now expects the deletes to accumulate in the delta map and be drained when feedback advances the watermark. End-state row count assertion unchanged.
    • TestOptimisticChunkerBasic — flipped KeyAboveHighWatermark(1) assertion from TRUE to FALSE for the pre-Next state.
    • TestCompositeChunkerWatermarkOptimizations — same flip for both pre-Open and pre-Next states.
    • TestCompositeChunkerCheckpointHighPtr — same flip for the pre-OpenAtWatermark state.
  5. Reverts two prior workarounds that the root-cause fix makes unnecessary:

    • require.Eventually(5s, 250ms, ...) in the cutover atomicity test. The wrapper was added under the theory that "Docker MySQL has variable I/O latency that delays final event delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be masked by 5 s of polling once cutover has completed — it stays a deficit. Replaced with a strict single-shot require.Equal.
    • The second FlushUnderTableLock in executeRenameUnderLock (commit 3001db2). Each call already does flush → BlockWait → flush internally, so the flush's own binlog events (REPLACE INTO _new / DELETE FROM _new) are read back in the same call's BlockWait. The doubled call was a speculative hedge for the same divergence the IsNil bug actually caused. A single call is sufficient.

Verified locally

  • go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/ — all 4 variants × 5 iterations PASS.
  • A single run produces:
    --- PASS: TestCutoverAtomicityWithConcurrentWrites/composite_unbuffered (2.86s)
    --- PASS: TestCutoverAtomicityWithConcurrentWrites/optimistic_buffered  (3.04s)
    --- PASS: TestCutoverAtomicityWithConcurrentWrites/optimistic_unbuffered (3.21s)
    --- PASS: TestCutoverAtomicityWithConcurrentWrites/composite_buffered   (3.22s)
    
  • Full pkg/migration suite — PASS (~3 min).
  • Full pkg/repl, pkg/copier, pkg/table suites — PASS.

Cherry-pick

Recommend cherry-picking the chunker fixes (commit 045117c for optimistic + the chunker portion of the composite-and-reverts commit) to v0.12.1. The two reverts (require.Eventually wrapper, second FlushUnderTableLock) are cleanups and don't strictly need to ride along. The 4-variant test extension is independent and not needed in the patch release — but harmless if cherry-picking the whole branch is simpler.

Refs

🤖 Generated with Claude Code

@morgo morgo force-pushed the fix-746-key-above-watermark-isnil branch from c2e825a to a8d916f Compare May 3, 2026 01:33
morgo and others added 3 commits May 2, 2026 20:12
… state

`chunkerOptimistic.KeyAboveHighWatermark` returned TRUE when both `chunkPtr`
and `checkpointHighPtr` were Nil, with the comment "every key is above
because we haven't started copying." This caused the deltaMap and bufferedMap
HasChanged paths to silently drop binlog events for every key during the
window between `runner.go`'s `SetWatermarkOptimization(true)` (which enables
the filter, just before copier startup) and the first `chunker.Next()` call
(which sets `chunkPtr` non-nil).

The intent was: "the chunker's later SELECT will pick the row up from the
source table." That holds only for rows committed before the SELECT's
snapshot. A writer that commits AFTER the snapshot timestamp but BEFORE
`chunkPtr` is updated is in a dead zone — invisible to the SELECT (too late)
AND silently dropped by the applier (KeyAboveHighWatermark returns TRUE).
The row is lost.

This is the root cause of issue block#746. PR block#764 disabled FixDifferences in
TestCutoverAtomicityWithConcurrentWrites so copy-phase loss surfaces at the
checksum step instead of being masked by repair. After that, the failure
surfaced deterministically across block#801, block#797, block#795, block#804 with a consistent
signature: a contiguous run of 3-5 PKs missing from chunk `[1, 1001)`
(`pk=62,63,64,65`; `pk=102-105`; etc.) — exactly the shape of "several writer
transactions committed during the dead zone, autoinc funnels them through in
PK order, all silently dropped."

Fix: return FALSE in the IsNil branch, per this method's own contract above
("if there is any ambiguity, it's important to return FALSE"). The change is
buffered into the delta map. When the chunker advances and the relevant
chunk's `Feedback` opens up `KeyBelowLowWatermark`, the next flush applies it
via `REPLACE INTO _new ... SELECT FROM original WHERE id IN (...)`.
Idempotent with anything the chunker's own SELECT picked up.

Also updates two tests that asserted the buggy behavior:

  - `TestReplClientComplex` previously expected `GetDeltaLen() == 0` after
    a DELETE issued before the chunker's first `Next()`. Now the deltas
    are buffered and drained correctly when feedback advances the
    watermark; the post-final-flush count assertion is unchanged.

  - `TestOptimisticChunkerBasic` previously expected
    `KeyAboveHighWatermark(1) == true` immediately after `chunker.Open()`
    (before the first `Next()`). Now that returns FALSE for the same reason.

Verified locally:
  - `go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/` — 10/10 PASS.
  - Full `pkg/migration` suite — PASS (~3 min).
  - Full `pkg/repl`, `pkg/copier`, `pkg/table` suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t prior workarounds

Composite chunker had the same IsNil-returns-TRUE bug in two places:
  - The "no chunkPtrs and no checkpoint" branch (mirror of optimistic chunker).
  - The "no chunkPtrs but above checkpointHighPtr" branch (resume path).

Both flip to returning FALSE for the same reason: the doc comment on
KeyAboveHighWatermark says ambiguity should return FALSE, and these two
branches were exactly the ambiguous "we haven't dispatched yet" state.

Tests updated to match: chunker_composite_test.go's two assertions that
encoded the buggy behavior (`KeyAboveHighWatermark` returns TRUE before
Open() / before OpenAtWatermark()) now assert FALSE.

Also reverts two prior workarounds that the issue-block#746 root cause makes
unnecessary:

  - cutover_test.go: replaces `require.Eventually(5s, 250ms, ...)` around
    the _old/_new checksum compare with a strict single-shot
    `require.Equal`. The original wrapper was added under the theory that
    "Docker MySQL has variable I/O latency that delays final event
    delivery" (commits 9ab22fe, f0f44fb), but a 1-row deficit cannot be
    masked by 5 s of polling once cutover has completed. With the IsNil
    fix the deficit doesn't occur in the first place, so the polling is
    pure noise.

  - cutover.go: removes the second back-to-back `FlushUnderTableLock`
    call introduced by 3001db2. Each call already does
    flush → BlockWait → flush internally, so the flush's own
    binlog events (REPLACE INTO _new / DELETE FROM _new) are read back
    in the same call's BlockWait. The doubled call was a speculative
    "rare case" hedge for the same divergence the IsNil bug actually
    caused. A single call is sufficient and matches BUG.md's protocol.

All four changes verified:
  - go test -count=10 -run TestCutoverAtomicityWithConcurrentWrites
    ./pkg/migration/ — 10/10 PASS.
  - Full pkg/migration suite — PASS (~4 min).
  - Full pkg/table, pkg/repl, pkg/copier suites — PASS.

Refs block#746, block#801, block#797, block#795, block#804.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fered+buffered

The cutover atomicity test (TestCutoverAtomicityWithConcurrentWrites) has
proven its value as a probe for issue block#746 — it surfaced the
KeyAboveHighWatermark IsNil bug across multiple unrelated PRs (block#801,
block#797, block#795, block#804) once FixDifferences was disabled for it. Until now it
only exercised the optimistic chunker in unbuffered mode. Both the
composite chunker and the buffered copier path have the same shape and
the same dead-zone window between SetWatermarkOptimization(true) and
the chunker's first Next(), so they should be probed too.

Refactor the test body into a helper runCutoverAtomicityTest(t,
tableName, schema, buffered) and wrap it in four subtests:

  - optimistic_unbuffered  (table t1concurrent_oub)
  - optimistic_buffered    (table t1concurrent_obu)
  - composite_unbuffered   (table t1concurrent_cub)
  - composite_buffered     (table t1concurrent_cbu)

The composite variants use a (id, x_token) PK so NewChunker selects the
composite chunker (optimistic requires single-column auto_increment).
Buffered variants opt in via WithBuffered(true); they Skip on the
minimal-RBR test runner since buffered requires binlog_row_image=FULL.

The four subtests run in parallel under t.Parallel() with unique table
names so they don't collide on _<name>_old / _<name>_new artifacts. The
failure-diagnostic block (max(id), missing PKs, diverged rows) now
parameterizes the table name so each subtest's logs name themselves
correctly.

migrationConcurrentWriteThread / doOneMigrationWriteLoop now take a
tableName parameter and build their SQL with fmt.Sprintf instead of
hard-coding `t1concurrent`.

Verified locally: all 4 variants pass standalone and together
(go test -count=5 -run TestCutoverAtomicityWithConcurrentWrites
./pkg/migration/), full pkg/migration suite passes (~3 min).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo force-pushed the fix-746-key-above-watermark-isnil branch from a8d916f to 5932c32 Compare May 3, 2026 02:21
@morgo morgo merged commit be718c6 into block:main May 3, 2026
12 checks passed
@morgo morgo deleted the fix-746-key-above-watermark-isnil branch May 3, 2026 13:25
morgo added a commit to morgo/spirit that referenced this pull request May 3, 2026
…+ deltaMap.Flush

The second issue-block#746 bug (post-PR-block#808) is currently invisible at failure
time because:

  1. The post-cutover CRC compare in TestCutoverAtomicityWithConcurrentWrites
     uses require.Equal, which halts execution before the diagnostic block
     that lists missing/diverged PKs can run. Each failure only prints two
     CRC integers, no per-PK detail.

  2. deltaMap.Flush logs num_keys and affected_rows at Debug level but only
     emits a single info-free line. Even when affected_rows < num_keys (the
     shape we keep seeing on the failing optimistic_unbuffered variant on
     the RBR-minimal lane), nothing flags it.

This patch addresses both:

  - cutover_test.go: switch the post-cutover CRC require.Equal to
    assert.Equal so the diagnostic block runs and prints the missing/
    diverged PK list. The final require.Equal on row counts at the end of
    the function still hard-fails the test, so the failure mode is
    unchanged. The function continues until that final assertion, giving
    us the per-PK signal we need to debug the second bug.

  - subscription_map.go: when a flush statement reports affected_rows
    strictly less than num_keys, emit a Warn-level "issue block#746 second-bug"
    log line with the table, num_keys, affected_rows, missing_at_least,
    and the statement. This is the smoking-gun pattern observed in CI
    runs of PR block#813 / block#814 where the SELECT FROM original WHERE pk IN (...)
    returns fewer rows than the delta map had keys — i.e. a key was in
    the delta map but not visible in `original` at the statement's
    snapshot. The check has false negatives when missing-source rows
    coincide with target-side replacements that bump affected_rows back
    up; a tighter check would need an explicit COUNT pre-flight which we
    skip for cost.

Verified locally:
  - go test -run TestCutoverAtomicityWithConcurrentWrites ./pkg/migration/
    — all 4 variants pass (1.6s).
  - Full pkg/repl suite — pass (32s).
  - go vet clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestCutoverAtomicityWithConcurrentWrites

2 participants