Skip to content

Ruler: per-tenant notifier and its service-discovery goroutines leak when managerFactory returns an error #7595

@sandy2008

Description

@sandy2008

AI Tool Usage Notice

This issue was drafted with the help of an AI tool. All content has been reviewed and validated by the submitter, who is responsible for its accuracy and quality. See the Generative AI Contribution Policy for details.

Describe the bug

In DefaultMultiTenantManager, newManager starts and registers a per-tenant notifier before invoking managerFactory, and does not clean it up if managerFactory fails.

  • newManager calls getOrCreateNotifier (pkg/ruler/manager.go:303), which starts the notifier's discovery + notification goroutines via n.run() (manager.go:373) and registers it in r.notifiers (manager.go:380).
  • It then returns r.managerFactory(...) directly (manager.go:308).
  • If managerFactory returns an error, the caller createRulesManager logs and returns nil (manager.go:267-272); the user is never added to r.userManagers (line 276 is unreachable on this path).

The only place a notifier is stopped is removeNotifier (manager.go:311-320), called from the user-deletion loop in SyncRuleGroups, which iterates r.userManagers. Since this tenant was never added there, removeNotifier is never called for it. The notifier and its two long-lived goroutines (discovery.Manager.Run and notifier.Manager.Run, started in pkg/ruler/notifier.go:60-72) are orphaned until process shutdown.

DefaultTenantManagerFactory returns real errors — e.g. resolveFrontendClient failure, or when neither a PromQL engine nor a frontend client is configured (pkg/ruler/compat.go) — so this is reachable on a misconfigured ruler or a transiently unresolvable query-frontend. Under repeated transient failures plus tenant churn, leaked notifiers, goroutines, and per-tenant registries accumulate and grow memory over time.

A related lower-severity variant: getOrCreateNotifier calls n.run() (manager.go:373) then applyConfig (manager.go:376-378); if applyConfig fails it returns without n.stop() and without map insertion, leaking those goroutines too (guarded by a "should never fail" comment).

To Reproduce

Steps to reproduce the behavior:

  1. Start the Cortex ruler (master) configured so DefaultTenantManagerFactory fails for a tenant — e.g. -ruler.frontend-address set to an unresolvable address, or neither an internal PromQL engine nor a frontend client configured.
  2. Trigger a rule sync for tenant T so managerFactory errors on first creation.
  3. Remove T's rule groups before any successful sync.
  4. Via a goroutine profile, observe T's notifier discovery/notification goroutines are still running — removeNotifier(T) was never called.

Expected behavior

When managerFactory returns an error, newManager should stop and de-register the notifier it just created (call removeNotifier(userID) / n.stop() and drop the per-user registry) before returning, mirroring the existing user-removal cleanup. The applyConfig failure path in getOrCreateNotifier should likewise call n.stop() before returning.

Environment:

  • Infrastructure: any (code-level resource leak)
  • Deployment tool: any

Additional Context

Introduced by PR #6151 ("Support Ruler to query Query Frontend"), which changed ManagerFactory to return an error and made newManager propagate it without adding cleanup (previously the factory had no error return). Prior fix #4718 added removeNotifier only to the user-removal path, not the creation-failure path.

Key references: pkg/ruler/manager.go:297-309 (newManager), :311-320 (removeNotifier), :322-382 (getOrCreateNotifier), :263-278 (createRulesManager); pkg/ruler/notifier.go:60-72 (run), :86-90 (stop).

Metadata

Metadata

Assignees

No one assigned

    Labels

    component/rulesBits & bobs todo with rules and alerts: the ruler, config service etc.type/bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions