Skip to content

fix(#769): stop reusing Migration in TestMultiChangesDifferentSchemas + improve checksum retry diagnostics#770

Merged
morgo merged 3 commits into
block:mainfrom
morgo:fix-issue-769-flaky-multichanges
May 1, 2026
Merged

fix(#769): stop reusing Migration in TestMultiChangesDifferentSchemas + improve checksum retry diagnostics#770
morgo merged 3 commits into
block:mainfrom
morgo:fix-issue-769-flaky-multichanges

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 1, 2026

Summary

Fixes #769

Fixes the latent flake and improves the diagnostic shape of SingleChecker's retry-exhausted error so the next occurrence (here or elsewhere) is actually triagable.

  • Behavioral fix. TestMultiChangesDifferentSchemas reused a single *Migration across five sequential Run() calls. Reusing a *Migration is not a supported production path, and stale state from prior failed Runs was the most plausible explanation for the symptom in flaky test: TestMultiChangesDifferentSchemas — checksum fails 3 attempts due to context-canceled, likely from Migration reuse #769 (three identical context canceled errors on attempts 1–3 with checksum-progress=0/1 0.00%). Build a fresh *Migration per Run in this test.

  • Bail out early on cancelled parent ctx in SingleChecker.Run(). Once the parent ctx is cancelled, every subsequent attempt fails the same way at the first ctx-aware call — return ctx.Err() directly instead of the generic "checksum failed after N attempts" wrapper.

  • Differentiate the final retry-exhausted error. Split into:

    • checksum errored on every attempt (N/N); last error: %w — the underlying error is wrapped, so context cancellations / connection issues / etc. surface verbatim.
    • checksum found differences on every attempt (N/N). This likely indicates either a bug in Spirit, or a manual modification to the _new table… — keeps the original guidance for the lossy-ALTER / real-bug shape.

    Previously the single message pointed users at "a bug in Spirit, or a manual modification to the _new table" even when every attempt was just context.Canceled.

The pkg/migration/runner.go lossy-unique-index wrapper is unchanged, so TestUniqueIndexAddNonUnique-style tests continue to assert the same user-facing message.

Test plan

  • TestMultiChangesDifferentSchemas passes locally.
  • pkg/checksum/... test suite passes (updated TestUnfixableUniqueChecksum to match the new wording on the diff path).
  • pkg/migration/... test suite passes on a clean re-run.
  • CI green.

The hard-to-repro nature of the flake means we can't directly prove the fix; the test change removes the only known mechanism for cross-Run state leakage in this test, and the diagnostic changes ensure that if anything like this recurs the failure mode is much clearer.

🤖 Generated with Claude Code

morgo added 2 commits May 1, 2026 11:07
…hemas + improve checksum retry diagnostics

The flaky failure on TestMultiChangesDifferentSchemas reused a single
*Migration across five sequential Run() calls. Reusing a *Migration is
not a supported production path and stale state from prior failed Runs
(replication subscriptions, useTestCutover bookkeeping, etc.) has been
seen leaking into the next Run as transient checksum failures. Build a
fresh *Migration per Run in this test.

Also harden the SingleChecker retry loop:

- Bail out immediately if the parent context is already cancelled, so
  the caller sees the real cancellation cause instead of the generic
  "checksum failed after N attempts" wrapper.
- Differentiate the final error: separate "errored on every attempt"
  (with the underlying error wrapped via %w) from "found differences
  on every attempt" (the original lossy-ALTER / bug-in-Spirit shape).
  The previous single message conflated the two paths and pointed
  users at "a bug in Spirit, or a manual modification to the _new
  table" even when every attempt was just context-canceled.

Closes block#769.
@morgo morgo marked this pull request as ready for review May 1, 2026 17:33
@morgo morgo enabled auto-merge May 1, 2026 20:09
@morgo morgo merged commit e2c40ec into block:main May 1, 2026
12 checks passed
morgo added a commit to morgo/spirit that referenced this pull request May 1, 2026
The composite-chunker fix duplicated the inline closure that block#770
introduced in TestCheckpoint. Move it to helpers_test.go so both
binlog_test.go and resume_test.go share one implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
morgo added a commit that referenced this pull request May 1, 2026
…king

TestE2EBinlogSubscribingCompositeKey and TestE2EBinlogSubscribingRogueValues
both assume a 2-chunk layout: an initial chunk with an upper bound and a
final open-ended chunk. Under CI load the first chunk's CopyChunk can
exceed ChunkerTarget*DynamicPanicFactor (100ms*5=500ms by default), which
triggers the composite chunker's panic path and shrinks chunkSize from
1000 to 100. The next prefetch then returns a real upper-bound row, so
the second chunk comes back with both bounds and the equality assertion
fails.

Mirror the optimistic chunker fix from PR #770: expose
chunkerComposite.SetDynamicChunking and have these two tests call it
right after setup. Pins ChunkSize=1000 deterministically without
depending on internal Feedback ordering or stretching TargetChunkTime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestMultiChangesDifferentSchemas — checksum fails 3 attempts due to context-canceled, likely from Migration reuse

2 participants