Ruler never unregisters group metrics #2033
Labels
component/rules
Bits & bobs todo with rules and alerts: the ruler, config service etc.
stale
type/observability
To help know what is going on inside Cortex
As a new group is created (e.g. on resharding when a ruler starts or stops) it registers metrics with Prometheus, but there is no code to unregister them when the group stops being used (e.g. on another resharding). Over time this will build up a substantial number of useless metrics.
This makes it hard to observe how well the ruler is keeping up, since
time()-cortex_prometheus_rule_group_last_evaluation_timestamp_seconds
is ever-increasing for the left-behind metrics.The metric registration is done in Prometheus code; ruler calls
Update()
with a list of files, but nobody is checking which files have disappeared since last update.The text was updated successfully, but these errors were encountered: