test: cancel before Close in resume teardown to avoid fatalError race#768
Merged
Conversation
The resume tests started Run() in a goroutine and tore down with Close() → cancel() → <-done. Closing the runner while Run was still executing closed r.db, replClient, and the throttler out from under the live goroutine, opening a window where any in-flight code path could trip fatalError() → dropCheckpoint(), removing the checkpoint table the second runner needed. That matches issue block#767: the strict-mode test expected ErrBinlogNotFound but got nil because the checkpoint table was gone by the time m2 read it. resumeFromCheckpoint then returned a "table missing" error not in strict mode's allowlist, fell through to newMigration, and completed successfully. Likely also explains the shutdown side of block#742, on top of the already-fixed MDL release race. Flip the order to cancel() → <-done → Close() so Run unwinds cleanly: ctx cancellation reaches copier/checksum/replClient (all share the runner ctx), the deferred MDL release runs, Run returns, and only then does Close() tear down the rest. TestCheckpointResumeDuringChecksum and TestCheckpointPhantomRow are left alone — the former intentionally drives checksum manually while Run is sentinel-blocked (commit 2730881), the latter has no Run goroutine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecksum This one was originally left alone because its `Close before cancel` ordering was set explicitly in 2730881 ("fix racey test"). But that commit predates the `<-done` sync point — it was choosing the lesser of two races between bare `Close → cancel` and bare `cancel → Close`. With `<-done` in place, `cancel → <-done → Close` is the safe ordering here too: when cancel fires, Run is sitting in waitOnSentinelTable polling and just returns the ctx error. The deferred MDL release runs, Run returns, <-done syncs, then Close tears down. The manually-invoked r.checksum() ran to completion long before — Run isn't going to re-invoke it. Audit confirmed this is the last instance of the pattern in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aparajon
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The resume tests in
pkg/migration/resume_test.gostartedRun()in a goroutine and tore down withClose() → cancel() → <-done. Closing the runner whileRunwas still executing closedr.db,replClient, and the throttler out from under the live goroutine — and any in-flight path that hit those closed resources could tripfatalError() → dropCheckpoint()(removing the checkpoint table the next runner needed) or stallRun's unwind so the deferred MDL release didn't happen before the test moved on.This flips the ordering to
cancel() → <-done → Close()soRununwinds cleanly: ctx cancellation reaches copier / checksum / replClient (they all share the runner ctx), the deferred MDL release runs insideRun,Runreturns, and only then doesClose()tear down the rest. After<-donewe have a hard guarantee thatRunhas fully exited and every defer has fired.Fixes #767 and fixes #742.
Why this is the bug for #767
The test corrupted the checkpoint binlog name and expected
ErrBinlogNotFoundfromm2.Run, but gotnil. WalkingresumeFromCheckpointandsetup, strict mode only catches three sentinels:For strict mode to silently start fresh,
resumeFromCheckpointhas to return an error not in that allowlist. The most likely candidate is the checkpoint row read failing because the table no longer exists — andfatalError()callsdropCheckpoint().fatalErrorcould fire during the racy m1 shutdown (e.g. the binlog readStream or any apply path hitting the now-closedr.dbbefore the readStream exits cleanly).Why this is the bug for #742
The failure was
could not acquire metadata lock for test.strictoldtest-077e2070, lock is held by another connection— m2'sNewMetadataLocksaw m1's MDL still held. The explicitRELEASE_LOCKadded in 25c92e5 made release synchronous, but that fix relies on the MDL goroutine's<-ctx.Done()handler running, which is triggered bylock.Close()— andlock.Close()is adeferinsideRun. The old test ordering closed the runner first, then cancelled, leaving a window whereRunhadn't yet unwound to its defer when other resources were already torn down. Withcancel() → <-done → Close(), by the time the test moves on,Runhas returned and the MDL has been released synchronously — m2 always sees a free lock.Tests left alone
TestCheckpointPhantomRow— drives the runner manually, noRungoroutine to wait on.(
TestCheckpointResumeDuringChecksumwas the one exception I'd carved out initially, but a follow-up audit showed itsClose before cancelwas just choosing the lesser of two races before<-doneexisted. It's now fixed too in the second commit.)Follow-up worth considering (not in this PR)
Strict mode's silent fallback when the checkpoint table is missing is itself questionable — if a caller asked for strict mode and the checkpoint is gone, surfacing that explicitly is more useful than starting fresh. Worth a separate change.
Test plan
TestResumeFromCheckpointStrictBinlogExpiredandTestResumeFromCheckpointStrictTooOldin a loop locally to confirm no flakes🤖 Generated with Claude Code