Skip to content

fix(#773): wait for WatchTask goroutines in Runner.Close()#776

Merged
morgo merged 3 commits into
block:mainfrom
morgo:fix-issue-773-cleanup-checkpoint-race
May 1, 2026
Merged

fix(#773): wait for WatchTask goroutines in Runner.Close()#776
morgo merged 3 commits into
block:mainfrom
morgo:fix-issue-773-cleanup-checkpoint-race

Conversation

@morgo
Copy link
Copy Markdown
Collaborator

@morgo morgo commented May 1, 2026

Summary

  • Fixes flaky test: TestResumeFromCheckpointCleanupOnFailure #773
  • status.WatchTask spawns the periodic status and checkpoint dumpers as fire-and-forget goroutines that exit only when their parent ctx is done. Runner.Close() did not wait for them: a late DumpCheckpoint INSERT could land after Close() returned, racing with test-side mutations of the checkpoint table.
  • In TestResumeFromCheckpointCleanupOnFailure this manifests as the checkpoint UPDATE being out-raced by a background INSERT — the new row has a higher auto-increment id and a real binlog name, so the resumed migration's ORDER BY id DESC LIMIT 1 picks the valid row and sets usedResumeFromCheckpoint=true, failing require.False(t, m2.usedResumeFromCheckpoint).
  • Have WatchTask return a wait function. Both callers (pkg/migration, pkg/move) store it and invoke it from Close() (after cancelling their ctx) before tearing down DB connections. Close() also now invokes the runner's cancelFunc explicitly so the background goroutines are guaranteed to observe ctx.Done() even on paths that bypass Run's deferred cancel.
  • This is the broader of the two fixes morgo proposed in the issue (option 2). It eliminates this whole class of "late background-goroutine writes after Close" races, not just the specific symptom in flaky test: TestResumeFromCheckpointCleanupOnFailure #773.

Test plan

  • go test -run TestResumeFromCheckpointCleanupOnFailure -count=5 ./pkg/migration/
  • go test -run TestResumeFrom -count=2 ./pkg/migration/ (all resume tests, ~130s)
  • go test -short ./pkg/migration/... (no regressions; only pre-existing privilege-test failures unrelated to this change)
  • go build ./... && go vet ./...

🤖 Generated with Claude Code

morgo and others added 3 commits May 1, 2026 15:16
status.WatchTask spawns the periodic status and checkpoint dumpers as
fire-and-forget goroutines that exit only when their parent ctx is done.
Runner.Close() did not wait for those goroutines: a late
DumpCheckpoint INSERT could land after Close() returned, racing with
test-side mutations of the checkpoint table.

In TestResumeFromCheckpointCleanupOnFailure this manifested as the
checkpoint UPDATE being "out-raced" by a background INSERT — the new
row had a higher auto-increment id and a real binlog name, so the
resumed migration's `ORDER BY id DESC LIMIT 1` picked the valid row
and set usedResumeFromCheckpoint=true.

Have WatchTask return a wait function. Both callers (pkg/migration,
pkg/move) store it and invoke it from Close() (after cancelling their
ctx) before tearing down DB connections. Close() also now invokes the
runner's cancelFunc explicitly so the background goroutines are
guaranteed to observe ctx.Done() even on paths that bypass Run's
deferred cancel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleaner than wg.Add(N) + per-goroutine defer wg.Done(). Available since
Go 1.25; this module is on 1.26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@morgo morgo merged commit 1e6930d into block:main May 1, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestResumeFromCheckpointCleanupOnFailure

2 participants