Skip to content

fix(scheduler): stop FragmentTable cleanup goroutine on shutdown#7599

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/scheduler-fragmenttable-cleanup-lifecycle
Open

fix(scheduler): stop FragmentTable cleanup goroutine on shutdown#7599
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/scheduler-fragmenttable-cleanup-lifecycle

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

What this PR does

Fixes a goroutine/ticker leak in the query-scheduler's FragmentTable (issue #7596).

NewFragmentTable (pkg/scheduler/fragment_table/fragment_table.go) starts a periodicCleanup goroutine driven by a time.Ticker, but it had no stop mechanism: the loop ranged over ticker.C forever, the deferred ticker.Stop() was unreachable, and FragmentTable had no Close/Stop method. The goroutine and ticker therefore lived for the whole process lifetime and were orphaned on scheduler shutdown.

This adds an idempotent Close() (a done channel closed once via sync.Once); periodicCleanup now selects on the ticker and the done channel. The scheduler closes the fragment table in its stopping hook so the goroutine is reclaimed on shutdown.

Why

FragmentTable is a per-scheduler singleton (NewSchedulerfragment_table.NewFragmentTable(2 * time.Minute)), constructed unconditionally, so in production it does not leak repeatedly — but the goroutine was orphaned rather than gracefully stopped on scheduler shutdown, and it was an un-stoppable background goroutine inconsistent with the scheduler's services.Service lifecycle. It is also a go.uber.org/goleak hazard: each NewFragmentTable in tests leaked a goroutine with no way to stop it.

How the fix resolves it

  • Close() closes the done channel exactly once via sync.Once, so it is safe to call repeatedly and concurrently.
  • periodicCleanup returns on <-f.done, so the deferred ticker.Stop() now runs and both the goroutine and the ticker are released.
  • Scheduler.stopping calls s.fragmentTable.Close() before stopping the subservices manager. Closing is a non-blocking channel close, independent of subservice teardown, so the ordering is harmless and there is no deadlock. Close() only touches done/closeOnce, never the mu/mappings state, so it cannot race with GetAddrByID/AddAddressByID/cleanupExpired.

Why not the services.Service approach

We considered making FragmentTable a services.Service (services.NewTimerService) registered in the scheduler's services.Manager. We rejected it for this fix after empirically testing both: in the one scenario where it would "help" — tests that construct a Scheduler via NewScheduler but never start it — it is a lateral move: it removes the periodicCleanup goroutine but adds a services.Manager listener goroutine, so those tests still leak. Meanwhile it changes behavior (cleanup runs only once the service is Started) and requires start/stop wiring across the existing fragment-table tests. The minimal done/sync.Once/Close() idiom preserves semantics, stays tightly scoped, and fixes the production leak identically — mirroring #7578's tight-scope-for-a-bugfix approach.

Tests

  • A package-level goleak.VerifyTestMain guard, plus t.Cleanup(ft.Close) on every existing test that constructs a table, so no FragmentTable leaks in the package's tests.
  • TestFragmentTable_Close: asserts Close stops the cleanup goroutine (enforced by the goleak guard) and is idempotent (a second call does not panic).
  • fail-without-fix verified: reverting only periodicCleanup to for range ticker.C { ... } (keeping Close/done so it compiles) makes the goleak guard fail, reporting the leaked periodicCleanup goroutine created by NewFragmentTable.

Scope

Two small edits in pkg/scheduler/fragment_table/fragment_table.go, a one-line Close() call in pkg/scheduler/scheduler.go's stopping, and pkg/scheduler/fragment_table/fragment_table_test.go test additions; plus one CHANGELOG.md line. No flags, config, or production behavior change beyond reclaiming the goroutine on shutdown.

Out of scope (pre-existing, noted for follow-up): three TestQueryFragmentRegistry* tests in scheduler_test.go construct a Scheduler via NewScheduler but never start/stop it, so they leak background goroutines in-test (the fragment-table cleanup goroutine and services.Manager listener goroutines). This is unchanged from master and not surfaced by CI (the scheduler package has no goleak guard); closing only the fragment table would not make those tests leak-clean — fixing them means starting/stopping the scheduler (or its manager) in those tests, a separate change.

Which issue(s) this PR fixes

Fixes #7596

Checklist

  • CHANGELOG.md updated — [BUGFIX] Query Scheduler entry.
  • Documentation updated — N/A; no flags or config changed (make doc not required).
  • Tests: goleak guard + regression/idempotency test; fail-without-fix verified.
  • Commit signed off (DCO).

Test plan

  • gofmt -l — clean; go vet -tags "netgo slicelabels" ./pkg/scheduler/... — clean
  • go test -tags "netgo slicelabels" -race -count=20 ./pkg/scheduler/fragment_table/... — PASS (goleak guard runs each iteration)
  • go test -tags "netgo slicelabels" -race -count=1 ./pkg/scheduler/ — PASS (full package, exercises the scheduler start/stop → Close() path)
  • Reverting periodicCleanup to the buggy for range ticker.C loop makes the goleak guard fail.

🤖 Generated with Claude Code

@sandy2008 sandy2008 force-pushed the fix/scheduler-fragmenttable-cleanup-lifecycle branch from 88d3222 to 232523b Compare June 7, 2026 13:09
@sandy2008 sandy2008 marked this pull request as ready for review June 7, 2026 13:10
@dosubot dosubot Bot added go Pull requests that update Go code type/bug labels Jun 7, 2026
NewFragmentTable starts a periodicCleanup goroutine (driven by a ticker)
that had no stop mechanism: the loop ranged over ticker.C forever, the
deferred ticker.Stop() was unreachable, and FragmentTable had no Close
method, so the goroutine and ticker lived for the whole process lifetime
and could not be reclaimed (issue cortexproject#7596).

Add an idempotent Close() (done channel + sync.Once); periodicCleanup now
selects on the done channel and returns. The scheduler closes the fragment
table in its stopping hook so the goroutine is reclaimed on shutdown. A
package goleak guard plus a regression test cover it, and the existing
tests now stop the table they create.

Fixes cortexproject#7596

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/scheduler-fragmenttable-cleanup-lifecycle branch from 232523b to c7c60df Compare June 7, 2026 13:22
Copy link
Copy Markdown
Member

@SungJin1212 SungJin1212 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code lgtm This PR has been approved by a maintainer size/M type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduler: FragmentTable starts a periodicCleanup goroutine/ticker that can never be stopped (no Close/Stop)

2 participants