Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler never unregisters group metrics #2033

Closed
bboreham opened this issue Jan 24, 2020 · 5 comments
Closed

Ruler never unregisters group metrics #2033

bboreham opened this issue Jan 24, 2020 · 5 comments
Labels
component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. stale type/observability To help know what is going on inside Cortex

Comments

@bboreham
Copy link
Contributor

bboreham commented Jan 24, 2020

As a new group is created (e.g. on resharding when a ruler starts or stops) it registers metrics with Prometheus, but there is no code to unregister them when the group stops being used (e.g. on another resharding). Over time this will build up a substantial number of useless metrics.

This makes it hard to observe how well the ruler is keeping up, since time()-cortex_prometheus_rule_group_last_evaluation_timestamp_seconds is ever-increasing for the left-behind metrics.

The metric registration is done in Prometheus code; ruler calls Update() with a list of files, but nobody is checking which files have disappeared since last update.

@bboreham bboreham added type/observability To help know what is going on inside Cortex component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. labels Jan 24, 2020
@bboreham
Copy link
Contributor Author

Upstream issue prometheus/prometheus#6689

@jtlisi
Copy link
Contributor

jtlisi commented Feb 3, 2020

prometheus/prometheus#6693

Now that this has been merged, we should be able to update our vendored prometheus version to solve this issue.

@bboreham
Copy link
Contributor Author

bboreham commented Feb 4, 2020

NB we had a request to vendor only tagged versions of Prometheus, which would imply waiting for 2.16 to ship.

@pracucci
Copy link
Contributor

pracucci commented Feb 5, 2020

NB we had a request to vendor only tagged versions of Prometheus, which would imply waiting for 2.16 to ship.

We had that request, but I feel we haven't talked much about it. I'm personally dubious about it. Forcing the version in master to always match a tagged version of Thanos and Prometheus may significantly slow down the development, especially on experimental features like the blocks storage.

@stale
Copy link

stale bot commented Apr 5, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 5, 2020
@stale stale bot closed this as completed Apr 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/rules Bits & bobs todo with rules and alerts: the ruler, config service etc. stale type/observability To help know what is going on inside Cortex
Projects
None yet
Development

No branches or pull requests

3 participants