fix(#769): stop reusing Migration in TestMultiChangesDifferentSchemas + improve checksum retry diagnostics#770
Merged
Conversation
…hemas + improve checksum retry diagnostics The flaky failure on TestMultiChangesDifferentSchemas reused a single *Migration across five sequential Run() calls. Reusing a *Migration is not a supported production path and stale state from prior failed Runs (replication subscriptions, useTestCutover bookkeeping, etc.) has been seen leaking into the next Run as transient checksum failures. Build a fresh *Migration per Run in this test. Also harden the SingleChecker retry loop: - Bail out immediately if the parent context is already cancelled, so the caller sees the real cancellation cause instead of the generic "checksum failed after N attempts" wrapper. - Differentiate the final error: separate "errored on every attempt" (with the underlying error wrapped via %w) from "found differences on every attempt" (the original lossy-ALTER / bug-in-Spirit shape). The previous single message conflated the two paths and pointed users at "a bug in Spirit, or a manual modification to the _new table" even when every attempt was just context-canceled. Closes block#769.
aparajon
approved these changes
May 1, 2026
4 tasks
morgo
added a commit
to morgo/spirit
that referenced
this pull request
May 1, 2026
The composite-chunker fix duplicated the inline closure that block#770 introduced in TestCheckpoint. Move it to helpers_test.go so both binlog_test.go and resume_test.go share one implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo
added a commit
that referenced
this pull request
May 1, 2026
…king TestE2EBinlogSubscribingCompositeKey and TestE2EBinlogSubscribingRogueValues both assume a 2-chunk layout: an initial chunk with an upper bound and a final open-ended chunk. Under CI load the first chunk's CopyChunk can exceed ChunkerTarget*DynamicPanicFactor (100ms*5=500ms by default), which triggers the composite chunker's panic path and shrinks chunkSize from 1000 to 100. The next prefetch then returns a real upper-bound row, so the second chunk comes back with both bounds and the equality assertion fails. Mirror the optimistic chunker fix from PR #770: expose chunkerComposite.SetDynamicChunking and have these two tests call it right after setup. Pins ChunkSize=1000 deterministically without depending on internal Feedback ordering or stretching TargetChunkTime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #769
Fixes the latent flake and improves the diagnostic shape of
SingleChecker's retry-exhausted error so the next occurrence (here or elsewhere) is actually triagable.Behavioral fix.
TestMultiChangesDifferentSchemasreused a single*Migrationacross five sequentialRun()calls. Reusing a*Migrationis not a supported production path, and stale state from prior failed Runs was the most plausible explanation for the symptom in flaky test: TestMultiChangesDifferentSchemas — checksum fails 3 attempts due to context-canceled, likely from Migration reuse #769 (three identicalcontext cancelederrors on attempts 1–3 withchecksum-progress=0/1 0.00%). Build a fresh*Migrationper Run in this test.Bail out early on cancelled parent ctx in
SingleChecker.Run(). Once the parent ctx is cancelled, every subsequent attempt fails the same way at the first ctx-aware call — returnctx.Err()directly instead of the generic "checksum failed after N attempts" wrapper.Differentiate the final retry-exhausted error. Split into:
checksum errored on every attempt (N/N); last error: %w— the underlying error is wrapped, so context cancellations / connection issues / etc. surface verbatim.checksum found differences on every attempt (N/N). This likely indicates either a bug in Spirit, or a manual modification to the _new table…— keeps the original guidance for the lossy-ALTER / real-bug shape.Previously the single message pointed users at "a bug in Spirit, or a manual modification to the _new table" even when every attempt was just
context.Canceled.The
pkg/migration/runner.golossy-unique-index wrapper is unchanged, soTestUniqueIndexAddNonUnique-style tests continue to assert the same user-facing message.Test plan
TestMultiChangesDifferentSchemaspasses locally.pkg/checksum/...test suite passes (updatedTestUnfixableUniqueChecksumto match the new wording on the diff path).pkg/migration/...test suite passes on a clean re-run.The hard-to-repro nature of the flake means we can't directly prove the fix; the test change removes the only known mechanism for cross-Run state leakage in this test, and the diagnostic changes ensure that if anything like this recurs the failure mode is much clearer.
🤖 Generated with Claude Code