-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose paused and retired workers separately in prometheus #8613
Conversation
Having paused and retiring and Thoughts about removing |
I think we don't have any strong preferences about keeping/removing the |
I said "kinda". It messes up anything that assumes states other than "connected" are exclusive, or that (eg) a chart of all states other than "connected" would make sense. Instead, one would need logic that includes |
Right, that makes sense. It's a shame we didn't introduce the split metrics earlier. For now, we're mostly interested in the paused metric because this may highlight a problematic cluster behavior (too small in size) while the retiring signal is a little bit of noise and is usually harmless. So I don't have a solution for the "nice chart" problem but when using this as a tag, the retired metric as a standalone thing already makes sense. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 29 files ±0 29 suites ±0 11h 52m 24s ⏱️ - 2m 15s For more details on these failures and errors, see this check. Results for commit dfc84cb. ± Comparison against base commit d68a5d9. ♻️ This comment has been updated with latest results. |
I'd be fine with removing the paused_or_retiring metric, my main concern was backward compatibility. We mostly care about paused as @fjetter said |
FWIW, I'm +1 for removing |
@phofl: Should we move this forward by removing |
Yep lets do that |
…heus_paused_workers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @phofl! One out-of-scope comment that doesn't need to be addressed here as it's not a regression.
worker_states.add_metric( | ||
["paused_or_retiring"], len(self.server.workers) - len(self.server.running) | ||
["retiring"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope: Technically, this definition is slightly off because there's a brief delay between registering a worker on the scheduler (with the worker being in the init
state) and updating its state to running
on the scheduler.
paused_workers = len( | ||
[w for w in self.server.workers.values() if w.status == Status.paused] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect looping over this once every poll should be fine, we can address this later if it's found to become a problem with very large clusters.
Closes #xxxx
pre-commit run --all-files
cc @ntabris for the grafana dashboards