fix(#769): stop reusing Migration in TestMultiChangesDifferentSchemas + improve checksum retry diagnostics by morgo · Pull Request #770 · block/spirit

morgo · 2026-05-01T17:09:18Z

Summary

Fixes #769

Fixes the latent flake and improves the diagnostic shape of SingleChecker's retry-exhausted error so the next occurrence (here or elsewhere) is actually triagable.

Behavioral fix. TestMultiChangesDifferentSchemas reused a single *Migration across five sequential Run() calls. Reusing a *Migration is not a supported production path, and stale state from prior failed Runs was the most plausible explanation for the symptom in flaky test: TestMultiChangesDifferentSchemas — checksum fails 3 attempts due to context-canceled, likely from Migration reuse #769 (three identical context canceled errors on attempts 1–3 with checksum-progress=0/1 0.00%). Build a fresh *Migration per Run in this test.
Bail out early on cancelled parent ctx in SingleChecker.Run(). Once the parent ctx is cancelled, every subsequent attempt fails the same way at the first ctx-aware call — return ctx.Err() directly instead of the generic "checksum failed after N attempts" wrapper.
Differentiate the final retry-exhausted error. Split into:
- checksum errored on every attempt (N/N); last error: %w — the underlying error is wrapped, so context cancellations / connection issues / etc. surface verbatim.
- checksum found differences on every attempt (N/N). This likely indicates either a bug in Spirit, or a manual modification to the _new table… — keeps the original guidance for the lossy-ALTER / real-bug shape.
Previously the single message pointed users at "a bug in Spirit, or a manual modification to the _new table" even when every attempt was just context.Canceled.

The pkg/migration/runner.go lossy-unique-index wrapper is unchanged, so TestUniqueIndexAddNonUnique-style tests continue to assert the same user-facing message.

Test plan

TestMultiChangesDifferentSchemas passes locally.
pkg/checksum/... test suite passes (updated TestUnfixableUniqueChecksum to match the new wording on the diff path).
pkg/migration/... test suite passes on a clean re-run.
CI green.

The hard-to-repro nature of the flake means we can't directly prove the fix; the test change removes the only known mechanism for cross-Run state leakage in this test, and the diagnostic changes ensure that if anything like this recurs the failure mode is much clearer.

🤖 Generated with Claude Code

…hemas + improve checksum retry diagnostics The flaky failure on TestMultiChangesDifferentSchemas reused a single *Migration across five sequential Run() calls. Reusing a *Migration is not a supported production path and stale state from prior failed Runs (replication subscriptions, useTestCutover bookkeeping, etc.) has been seen leaking into the next Run as transient checksum failures. Build a fresh *Migration per Run in this test. Also harden the SingleChecker retry loop: - Bail out immediately if the parent context is already cancelled, so the caller sees the real cancellation cause instead of the generic "checksum failed after N attempts" wrapper. - Differentiate the final error: separate "errored on every attempt" (with the underlying error wrapped via %w) from "found differences on every attempt" (the original lossy-ALTER / bug-in-Spirit shape). The previous single message conflated the two paths and pointed users at "a bug in Spirit, or a manual modification to the _new table" even when every attempt was just context-canceled. Closes block#769.

The composite-chunker fix duplicated the inline closure that block#770 introduced in TestCheckpoint. Move it to helpers_test.go so both binlog_test.go and resume_test.go share one implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…king TestE2EBinlogSubscribingCompositeKey and TestE2EBinlogSubscribingRogueValues both assume a 2-chunk layout: an initial chunk with an upper bound and a final open-ended chunk. Under CI load the first chunk's CopyChunk can exceed ChunkerTarget*DynamicPanicFactor (100ms*5=500ms by default), which triggers the composite chunker's panic path and shrinks chunkSize from 1000 to 100. The next prefetch then returns a real upper-bound row, so the second chunk comes back with both bounds and the equality assertion fails. Mirror the optimistic chunker fix from PR #770: expose chunkerComposite.SetDynamicChunking and have these two tests call it right after setup. Pins ChunkSize=1000 deterministically without depending on internal Feedback ordering or stretching TargetChunkTime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

morgo added 2 commits May 1, 2026 11:07

Merge branch 'main' into fix-issue-769-flaky-multichanges

33387b5

morgo marked this pull request as ready for review May 1, 2026 17:33

aparajon approved these changes May 1, 2026

View reviewed changes

Merge branch 'main' into fix-issue-769-flaky-multichanges

561dbca

morgo enabled auto-merge May 1, 2026 20:09

morgo merged commit e2c40ec into block:main May 1, 2026
12 checks passed

morgo mentioned this pull request May 1, 2026

fix(#766,#772): pin chunk size in binlog E2E tests via SetDynamicChunking #774

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#769): stop reusing Migration in TestMultiChangesDifferentSchemas + improve checksum retry diagnostics#770

fix(#769): stop reusing Migration in TestMultiChangesDifferentSchemas + improve checksum retry diagnostics#770
morgo merged 3 commits into
block:mainfrom
morgo:fix-issue-769-flaky-multichanges

morgo commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

morgo commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morgo commented May 1, 2026 •

edited

Loading