AI Tool Usage Notice
This issue was drafted with the help of an AI tool. All content has been reviewed and validated by the submitter, who is responsible for its accuracy and quality. See the Generative AI Contribution Policy for details.
Describe the bug
In DefaultMultiTenantManager, newManager starts and registers a per-tenant notifier before invoking managerFactory, and does not clean it up if managerFactory fails.
newManager calls getOrCreateNotifier (pkg/ruler/manager.go:303), which starts the notifier's discovery + notification goroutines via n.run() (manager.go:373) and registers it in r.notifiers (manager.go:380).
- It then returns
r.managerFactory(...) directly (manager.go:308).
- If
managerFactory returns an error, the caller createRulesManager logs and returns nil (manager.go:267-272); the user is never added to r.userManagers (line 276 is unreachable on this path).
The only place a notifier is stopped is removeNotifier (manager.go:311-320), called from the user-deletion loop in SyncRuleGroups, which iterates r.userManagers. Since this tenant was never added there, removeNotifier is never called for it. The notifier and its two long-lived goroutines (discovery.Manager.Run and notifier.Manager.Run, started in pkg/ruler/notifier.go:60-72) are orphaned until process shutdown.
DefaultTenantManagerFactory returns real errors — e.g. resolveFrontendClient failure, or when neither a PromQL engine nor a frontend client is configured (pkg/ruler/compat.go) — so this is reachable on a misconfigured ruler or a transiently unresolvable query-frontend. Under repeated transient failures plus tenant churn, leaked notifiers, goroutines, and per-tenant registries accumulate and grow memory over time.
A related lower-severity variant: getOrCreateNotifier calls n.run() (manager.go:373) then applyConfig (manager.go:376-378); if applyConfig fails it returns without n.stop() and without map insertion, leaking those goroutines too (guarded by a "should never fail" comment).
To Reproduce
Steps to reproduce the behavior:
- Start the Cortex ruler (master) configured so
DefaultTenantManagerFactory fails for a tenant — e.g. -ruler.frontend-address set to an unresolvable address, or neither an internal PromQL engine nor a frontend client configured.
- Trigger a rule sync for tenant
T so managerFactory errors on first creation.
- Remove
T's rule groups before any successful sync.
- Via a goroutine profile, observe
T's notifier discovery/notification goroutines are still running — removeNotifier(T) was never called.
Expected behavior
When managerFactory returns an error, newManager should stop and de-register the notifier it just created (call removeNotifier(userID) / n.stop() and drop the per-user registry) before returning, mirroring the existing user-removal cleanup. The applyConfig failure path in getOrCreateNotifier should likewise call n.stop() before returning.
Environment:
- Infrastructure: any (code-level resource leak)
- Deployment tool: any
Additional Context
Introduced by PR #6151 ("Support Ruler to query Query Frontend"), which changed ManagerFactory to return an error and made newManager propagate it without adding cleanup (previously the factory had no error return). Prior fix #4718 added removeNotifier only to the user-removal path, not the creation-failure path.
Key references: pkg/ruler/manager.go:297-309 (newManager), :311-320 (removeNotifier), :322-382 (getOrCreateNotifier), :263-278 (createRulesManager); pkg/ruler/notifier.go:60-72 (run), :86-90 (stop).
AI Tool Usage Notice
This issue was drafted with the help of an AI tool. All content has been reviewed and validated by the submitter, who is responsible for its accuracy and quality. See the Generative AI Contribution Policy for details.
Describe the bug
In
DefaultMultiTenantManager,newManagerstarts and registers a per-tenant notifier before invokingmanagerFactory, and does not clean it up ifmanagerFactoryfails.newManagercallsgetOrCreateNotifier(pkg/ruler/manager.go:303), which starts the notifier's discovery + notification goroutines vian.run()(manager.go:373) and registers it inr.notifiers(manager.go:380).r.managerFactory(...)directly (manager.go:308).managerFactoryreturns an error, the callercreateRulesManagerlogs and returnsnil(manager.go:267-272); the user is never added tor.userManagers(line 276 is unreachable on this path).The only place a notifier is stopped is
removeNotifier(manager.go:311-320), called from the user-deletion loop inSyncRuleGroups, which iteratesr.userManagers. Since this tenant was never added there,removeNotifieris never called for it. The notifier and its two long-lived goroutines (discovery.Manager.Runandnotifier.Manager.Run, started inpkg/ruler/notifier.go:60-72) are orphaned until process shutdown.DefaultTenantManagerFactoryreturns real errors — e.g.resolveFrontendClientfailure, or when neither a PromQL engine nor a frontend client is configured (pkg/ruler/compat.go) — so this is reachable on a misconfigured ruler or a transiently unresolvable query-frontend. Under repeated transient failures plus tenant churn, leaked notifiers, goroutines, and per-tenant registries accumulate and grow memory over time.A related lower-severity variant:
getOrCreateNotifiercallsn.run()(manager.go:373) thenapplyConfig(manager.go:376-378); ifapplyConfigfails it returns withoutn.stop()and without map insertion, leaking those goroutines too (guarded by a "should never fail" comment).To Reproduce
Steps to reproduce the behavior:
DefaultTenantManagerFactoryfails for a tenant — e.g.-ruler.frontend-addressset to an unresolvable address, or neither an internal PromQL engine nor a frontend client configured.TsomanagerFactoryerrors on first creation.T's rule groups before any successful sync.T's notifier discovery/notification goroutines are still running —removeNotifier(T)was never called.Expected behavior
When
managerFactoryreturns an error,newManagershould stop and de-register the notifier it just created (callremoveNotifier(userID)/n.stop()and drop the per-user registry) before returning, mirroring the existing user-removal cleanup. TheapplyConfigfailure path ingetOrCreateNotifiershould likewise calln.stop()before returning.Environment:
Additional Context
Introduced by PR #6151 ("Support Ruler to query Query Frontend"), which changed
ManagerFactoryto return an error and madenewManagerpropagate it without adding cleanup (previously the factory had no error return). Prior fix #4718 addedremoveNotifieronly to the user-removal path, not the creation-failure path.Key references:
pkg/ruler/manager.go:297-309(newManager),:311-320(removeNotifier),:322-382(getOrCreateNotifier),:263-278(createRulesManager);pkg/ruler/notifier.go:60-72(run),:86-90(stop).