Expose paused and retired workers separately in prometheus #8613

phofl · 2024-04-11T13:14:01Z

Closes #xxxx

Tests added / passed
Passes pre-commit run --all-files

cc @ntabris for the grafana dashboards

ntabris · 2024-04-11T13:23:57Z

Having paused and retiring and paused_or_retiring makes things more complicated for me, because various things would need to include that iff paused and retiring are not present (and not include if they are, otherwise we'd be double counting).

Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.

fjetter · 2024-04-11T13:28:22Z

Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.

I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?

ntabris · 2024-04-11T13:31:57Z

I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?

I said "kinda". It messes up anything that assumes states other than "connected" are exclusive, or that (eg) a chart of all states other than "connected" would make sense.

Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).

fjetter · 2024-04-11T13:35:41Z

Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).

Right, that makes sense. It's a shame we didn't introduce the split metrics earlier. For now, we're mostly interested in the paused metric because this may highlight a problematic cluster behavior (too small in size) while the retiring signal is a little bit of noise and is usually harmless.

So I don't have a solution for the "nice chart" problem but when using this as a tag, the retired metric as a standalone thing already makes sense.

github-actions · 2024-04-11T13:50:36Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

29 files ±0 29 suites ±0 11h 52m 24s ⏱️ - 2m 15s
4 087 tests ±0 3 970 ✅ +1 112 💤 ±0 4 ❌ - 2 1 🔥 +1
55 287 runs +1 52 844 ✅ +4 2 438 💤 - 1 4 ❌ - 3 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit dfc84cb. ± Comparison against base commit d68a5d9.

♻️ This comment has been updated with latest results.

phofl · 2024-04-11T13:54:04Z

I'd be fine with removing the paused_or_retiring metric, my main concern was backward compatibility.

We mostly care about paused as @fjetter said

hendrikmakait · 2024-06-25T19:13:16Z

FWIW, I'm +1 for removing paused_or_retiring. We don't have a good story for backward compatibility yet, and this change seems like a net improvement.

hendrikmakait · 2024-07-16T17:35:43Z

@phofl: Should we move this forward by removing paused_or_retiring?

phofl · 2024-07-17T08:11:36Z

Yep lets do that

…heus_paused_workers

hendrikmakait

Thanks, @phofl! One out-of-scope comment that doesn't need to be addressed here as it's not a regression.

hendrikmakait · 2024-07-18T18:46:49Z

distributed/http/scheduler/prometheus/core.py

        worker_states.add_metric(
-            ["paused_or_retiring"], len(self.server.workers) - len(self.server.running)
+            ["retiring"],


Out of scope: Technically, this definition is slightly off because there's a brief delay between registering a worker on the scheduler (with the worker being in the init state) and updating its state to running on the scheduler.

hendrikmakait · 2024-07-18T18:49:51Z

distributed/http/scheduler/prometheus/core.py

+        paused_workers = len(
+            [w for w in self.server.workers.values() if w.status == Status.paused]
+        )


I suspect looping over this once every poll should be fine, we can address this later if it's found to become a problem with very large clusters.

Expose paused and retired workers separately in prometheus

bcc262c

phofl requested a review from fjetter as a code owner April 11, 2024 13:14

Merge branch 'main' into prometheus_paused_workers

97675e9

phofl added 2 commits July 17, 2024 10:15

Update

30bee7e

Merge remote-tracking branch 'refs/remotes/upstream/main' into promet…

dfc84cb

…heus_paused_workers

hendrikmakait approved these changes Jul 18, 2024

View reviewed changes

hendrikmakait reviewed Jul 18, 2024

View reviewed changes

phofl merged commit 30c0d29 into dask:main Jul 18, 2024
29 of 36 checks passed

phofl deleted the prometheus_paused_workers branch July 18, 2024 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose paused and retired workers separately in prometheus #8613

Expose paused and retired workers separately in prometheus #8613

phofl commented Apr 11, 2024

ntabris commented Apr 11, 2024

fjetter commented Apr 11, 2024

ntabris commented Apr 11, 2024

fjetter commented Apr 11, 2024

github-actions bot commented Apr 11, 2024 •

edited

Loading

phofl commented Apr 11, 2024

hendrikmakait commented Jun 25, 2024

hendrikmakait commented Jul 16, 2024

phofl commented Jul 17, 2024

hendrikmakait left a comment

hendrikmakait Jul 18, 2024

hendrikmakait Jul 18, 2024

Expose paused and retired workers separately in prometheus #8613

Expose paused and retired workers separately in prometheus #8613

Conversation

phofl commented Apr 11, 2024

ntabris commented Apr 11, 2024

fjetter commented Apr 11, 2024

ntabris commented Apr 11, 2024

fjetter commented Apr 11, 2024

github-actions bot commented Apr 11, 2024 • edited Loading

Unit Test Results

phofl commented Apr 11, 2024

hendrikmakait commented Jun 25, 2024

hendrikmakait commented Jul 16, 2024

phofl commented Jul 17, 2024

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Jul 18, 2024

Choose a reason for hiding this comment

hendrikmakait Jul 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 11, 2024 •

edited

Loading