test: cancel before Close in resume teardown to avoid fatalError race by morgo · Pull Request #768 · block/spirit

morgo · 2026-05-01T16:30:40Z

Summary

The resume tests in pkg/migration/resume_test.go started Run() in a goroutine and tore down with Close() → cancel() → <-done. Closing the runner while Run was still executing closed r.db, replClient, and the throttler out from under the live goroutine — and any in-flight path that hit those closed resources could trip fatalError() → dropCheckpoint() (removing the checkpoint table the next runner needed) or stall Run's unwind so the deferred MDL release didn't happen before the test moved on.

This flips the ordering to cancel() → <-done → Close() so Run unwinds cleanly: ctx cancellation reaches copier / checksum / replClient (they all share the runner ctx), the deferred MDL release runs inside Run, Run returns, and only then does Close() tear down the rest. After <-done we have a hard guarantee that Run has fully exited and every defer has fired.

Fixes #767 and fixes #742.

Why this is the bug for #767

The test corrupted the checkpoint binlog name and expected ErrBinlogNotFound from m2.Run, but got nil. Walking resumeFromCheckpoint and setup, strict mode only catches three sentinels:

if r.migration.Strict && (errors.Is(err, status.ErrMismatchedAlter) ||
    errors.Is(err, status.ErrBinlogNotFound) ||
    errors.Is(err, status.ErrCheckpointTooOld)) {
    return err
}
// fall through to newMigration

For strict mode to silently start fresh, resumeFromCheckpoint has to return an error not in that allowlist. The most likely candidate is the checkpoint row read failing because the table no longer exists — and fatalError() calls dropCheckpoint(). fatalError could fire during the racy m1 shutdown (e.g. the binlog readStream or any apply path hitting the now-closed r.db before the readStream exits cleanly).

Why this is the bug for #742

The failure was could not acquire metadata lock for test.strictoldtest-077e2070, lock is held by another connection — m2's NewMetadataLock saw m1's MDL still held. The explicit RELEASE_LOCK added in 25c92e5 made release synchronous, but that fix relies on the MDL goroutine's <-ctx.Done() handler running, which is triggered by lock.Close() — and lock.Close() is a defer inside Run. The old test ordering closed the runner first, then cancelled, leaving a window where Run hadn't yet unwound to its defer when other resources were already torn down. With cancel() → <-done → Close(), by the time the test moves on, Run has returned and the MDL has been released synchronously — m2 always sees a free lock.

Tests left alone

TestCheckpointPhantomRow — drives the runner manually, no Run goroutine to wait on.

(TestCheckpointResumeDuringChecksum was the one exception I'd carved out initially, but a follow-up audit showed its Close before cancel was just choosing the lesser of two races before <-done existed. It's now fixed too in the second commit.)

Follow-up worth considering (not in this PR)

Strict mode's silent fallback when the checkpoint table is missing is itself questionable — if a caller asked for strict mode and the checkpoint is gone, surfacing that explicitly is more useful than starting fresh. Worth a separate change.

Test plan

CI green on the affected resume tests
Run TestResumeFromCheckpointStrictBinlogExpired and TestResumeFromCheckpointStrictTooOld in a loop locally to confirm no flakes

🤖 Generated with Claude Code

The resume tests started Run() in a goroutine and tore down with Close() → cancel() → <-done. Closing the runner while Run was still executing closed r.db, replClient, and the throttler out from under the live goroutine, opening a window where any in-flight code path could trip fatalError() → dropCheckpoint(), removing the checkpoint table the second runner needed. That matches issue block#767: the strict-mode test expected ErrBinlogNotFound but got nil because the checkpoint table was gone by the time m2 read it. resumeFromCheckpoint then returned a "table missing" error not in strict mode's allowlist, fell through to newMigration, and completed successfully. Likely also explains the shutdown side of block#742, on top of the already-fixed MDL release race. Flip the order to cancel() → <-done → Close() so Run unwinds cleanly: ctx cancellation reaches copier/checksum/replClient (all share the runner ctx), the deferred MDL release runs, Run returns, and only then does Close() tear down the rest. TestCheckpointResumeDuringChecksum and TestCheckpointPhantomRow are left alone — the former intentionally drives checksum manually while Run is sentinel-blocked (commit 2730881), the latter has no Run goroutine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ecksum This one was originally left alone because its `Close before cancel` ordering was set explicitly in 2730881 ("fix racey test"). But that commit predates the `<-done` sync point — it was choosing the lesser of two races between bare `Close → cancel` and bare `cancel → Close`. With `<-done` in place, `cancel → <-done → Close` is the safe ordering here too: when cancel fires, Run is sitting in waitOnSentinelTable polling and just returns the ctx error. The deferred MDL release runs, Run returns, <-done syncs, then Close tears down. The manually-invoked r.checksum() ran to completion long before — Run isn't going to re-invoke it. Audit confirmed this is the last instance of the pattern in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

morgo and others added 3 commits May 1, 2026 10:29

Merge branch 'main' into fix-resume-test-teardown-race

55a7460

aparajon approved these changes May 1, 2026

View reviewed changes

morgo merged commit 6b092c8 into block:main May 1, 2026
12 checks passed

morgo deleted the fix-resume-test-teardown-race branch May 1, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: cancel before Close in resume teardown to avoid fatalError race#768

test: cancel before Close in resume teardown to avoid fatalError race#768
morgo merged 3 commits into
block:mainfrom
morgo:fix-resume-test-teardown-race

morgo commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

morgo commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is the bug for #767

Why this is the bug for #742

Tests left alone

Follow-up worth considering (not in this PR)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

morgo commented May 1, 2026 •

edited

Loading