Skip to content

test: cancel before Close in resume teardown to avoid fatalError race#768

Merged
morgo merged 3 commits into
block:mainfrom
morgo:fix-resume-test-teardown-race
May 1, 2026
Merged

test: cancel before Close in resume teardown to avoid fatalError race#768
morgo merged 3 commits into
block:mainfrom
morgo:fix-resume-test-teardown-race

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 1, 2026

Summary

The resume tests in pkg/migration/resume_test.go started Run() in a goroutine and tore down with Close() → cancel() → <-done. Closing the runner while Run was still executing closed r.db, replClient, and the throttler out from under the live goroutine — and any in-flight path that hit those closed resources could trip fatalError() → dropCheckpoint() (removing the checkpoint table the next runner needed) or stall Run's unwind so the deferred MDL release didn't happen before the test moved on.

This flips the ordering to cancel() → <-done → Close() so Run unwinds cleanly: ctx cancellation reaches copier / checksum / replClient (they all share the runner ctx), the deferred MDL release runs inside Run, Run returns, and only then does Close() tear down the rest. After <-done we have a hard guarantee that Run has fully exited and every defer has fired.

Fixes #767 and fixes #742.

Why this is the bug for #767

The test corrupted the checkpoint binlog name and expected ErrBinlogNotFound from m2.Run, but got nil. Walking resumeFromCheckpoint and setup, strict mode only catches three sentinels:

if r.migration.Strict && (errors.Is(err, status.ErrMismatchedAlter) ||
    errors.Is(err, status.ErrBinlogNotFound) ||
    errors.Is(err, status.ErrCheckpointTooOld)) {
    return err
}
// fall through to newMigration

For strict mode to silently start fresh, resumeFromCheckpoint has to return an error not in that allowlist. The most likely candidate is the checkpoint row read failing because the table no longer exists — and fatalError() calls dropCheckpoint(). fatalError could fire during the racy m1 shutdown (e.g. the binlog readStream or any apply path hitting the now-closed r.db before the readStream exits cleanly).

Why this is the bug for #742

The failure was could not acquire metadata lock for test.strictoldtest-077e2070, lock is held by another connection — m2's NewMetadataLock saw m1's MDL still held. The explicit RELEASE_LOCK added in 25c92e5 made release synchronous, but that fix relies on the MDL goroutine's <-ctx.Done() handler running, which is triggered by lock.Close() — and lock.Close() is a defer inside Run. The old test ordering closed the runner first, then cancelled, leaving a window where Run hadn't yet unwound to its defer when other resources were already torn down. With cancel() → <-done → Close(), by the time the test moves on, Run has returned and the MDL has been released synchronously — m2 always sees a free lock.

Tests left alone

  • TestCheckpointPhantomRow — drives the runner manually, no Run goroutine to wait on.

(TestCheckpointResumeDuringChecksum was the one exception I'd carved out initially, but a follow-up audit showed its Close before cancel was just choosing the lesser of two races before <-done existed. It's now fixed too in the second commit.)

Follow-up worth considering (not in this PR)

Strict mode's silent fallback when the checkpoint table is missing is itself questionable — if a caller asked for strict mode and the checkpoint is gone, surfacing that explicitly is more useful than starting fresh. Worth a separate change.

Test plan

  • CI green on the affected resume tests
  • Run TestResumeFromCheckpointStrictBinlogExpired and TestResumeFromCheckpointStrictTooOld in a loop locally to confirm no flakes

🤖 Generated with Claude Code

morgo and others added 3 commits May 1, 2026 10:29
The resume tests started Run() in a goroutine and tore down with
Close() → cancel() → <-done. Closing the runner while Run was still
executing closed r.db, replClient, and the throttler out from under
the live goroutine, opening a window where any in-flight code path
could trip fatalError() → dropCheckpoint(), removing the checkpoint
table the second runner needed.

That matches issue block#767: the strict-mode test expected
ErrBinlogNotFound but got nil because the checkpoint table was gone
by the time m2 read it. resumeFromCheckpoint then returned a "table
missing" error not in strict mode's allowlist, fell through to
newMigration, and completed successfully. Likely also explains the
shutdown side of block#742, on top of the already-fixed MDL release race.

Flip the order to cancel() → <-done → Close() so Run unwinds
cleanly: ctx cancellation reaches copier/checksum/replClient (all
share the runner ctx), the deferred MDL release runs, Run returns,
and only then does Close() tear down the rest.

TestCheckpointResumeDuringChecksum and TestCheckpointPhantomRow are
left alone — the former intentionally drives checksum manually while
Run is sentinel-blocked (commit 2730881), the latter has no Run
goroutine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecksum

This one was originally left alone because its `Close before cancel`
ordering was set explicitly in 2730881 ("fix racey test"). But that
commit predates the `<-done` sync point — it was choosing the lesser
of two races between bare `Close → cancel` and bare `cancel → Close`.

With `<-done` in place, `cancel → <-done → Close` is the safe
ordering here too: when cancel fires, Run is sitting in
waitOnSentinelTable polling and just returns the ctx error. The
deferred MDL release runs, Run returns, <-done syncs, then Close
tears down. The manually-invoked r.checksum() ran to completion
long before — Run isn't going to re-invoke it.

Audit confirmed this is the last instance of the pattern in the
repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo merged commit 6b092c8 into block:main May 1, 2026
12 checks passed
@morgo morgo deleted the fix-resume-test-teardown-race branch May 1, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestResumeFromCheckpointStrictBinlogExpired flaky test: TestResumeFromCheckpointStrictTooOld

2 participants