Skip to content

[BUGFIX] Ring: fix DoBatch cleanup callback blocked forever on callback panic#7559

Open
sandy2008 wants to merge 3 commits into
cortexproject:masterfrom
sandy2008:fix/distributor-dobatch-context-leak
Open

[BUGFIX] Ring: fix DoBatch cleanup callback blocked forever on callback panic#7559
sandy2008 wants to merge 3 commits into
cortexproject:masterfrom
sandy2008:fix/distributor-dobatch-context-leak

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

@sandy2008 sandy2008 commented May 27, 2026

What this PR does

ring.DoBatch submitted closures called wg.Done() inline after callback() and tracker.record(). If callback panicked, wg.Done() was skipped, causing wg.Wait() to block forever and the cleanup callback to never execute.

Symptom: any panic inside a DoBatch callback permanently blocked the cleanup goroutine. For the distributor, this leaked the context.WithTimeout timer (until RemoteTimeout expired) and request buffers (req.Timeseries, req.Free()) — they were never reclaimed. The same issue affected all three production DoBatch callers: distributor, alertmanager distributor, and multitenant alertmanager.

Fix: move wg.Done() to defer wg.Done() so it runs even during panic unwinding, ensuring the cleanup goroutine always completes. An earlier iteration also added defer cancel() to the distributor's doBatch function, but this was intentionally reverted after review found it contradicts the design intent: "Use a background context to make sure all ingesters get samples even if we return early." The defer wg.Done() fix makes the existing cleanup callback (which already calls cancel()) reliable on all paths, preserving send semantics without cancelling in-flight ingester RPCs.

Which issue(s) this PR fixes

Fixes #7558

Checklist

  • CHANGELOG.md updated — not yet, will add if maintainers confirm the fix direction.
  • Documentation updated — not applicable, no flags or config changed.
  • Tests added — TestDoBatchCleanupCalledOnCallbackPanic verifies cleanup runs even when callback panics.

Test plan

  • go vet ./pkg/ring/... ./pkg/distributor/... — clean
  • go test -tags "netgo slicelabels" -count=1 -timeout 120s ./pkg/ring/... — 10/10 packages pass
  • go test -tags "netgo slicelabels" -run TestDistributor_BatchTimeoutMetric ./pkg/distributor/... — passes
  • New test passes with fix, fails without (verified by reverting defer during development)

DoBatch's submitted closures called wg.Done() inline after callback()
and tracker.record(). If callback panicked, wg.Done() was skipped,
causing wg.Wait() to block forever and the cleanup callback to never
execute. This leaked context timers, request buffers, and any other
resources owned by the cleanup function for all DoBatch callers
(distributor, alertmanager).

Move wg.Done() to a defer so it runs even during panic unwinding,
ensuring the cleanup goroutine always completes.

Fixes cortexproject#7558

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@dosubot dosubot Bot added component/ring go Pull requests that update Go code type/bug labels May 27, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 changed the title Fix DoBatch cleanup callback blocked forever on callback panic [BUGFIX] Ring: fix DoBatch cleanup callback blocked forever on callback panic May 27, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/ring go Pull requests that update Go code size/L type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Distributor: context.WithTimeout timer leak in doBatch — cancel() not deferred, relies solely on ring.DoBatch cleanup callback

1 participant