Track Event Loop intervals in dashboard plot #5964

mrocklin · 2022-03-19T14:39:10Z

github-actions · 2022-03-19T18:05:08Z

Unit Test Results

      12 files ±0       12 suites ±0 6h 15m 11s ⏱️ + 6m 53s
  2 670 tests +1   2 589 ✔️ +2   80 💤 - 2 1 ❌ +1
15 948 runs +6 15 104 ✔️ +7 843 💤 - 2 1 ❌ +1

For more details on these failures, see this check.

Results for commit 55e8090. ± Comparison against base commit 5c7d555.

♻️ This comment has been updated with latest results.

mrocklin · 2022-03-21T18:52:27Z

cc @graingert if you have time to take a quick look

ncclementi · 2022-03-21T19:22:07Z

@mrocklin can you add the corresponding entry to this doc-page please https://github.com/dask/distributed/blob/main/docs/source/http_services.rst

phobson

Looks good to me. I pulled down the branch and tested it out. Worked as expected on a handful of the notebook in the dask-tutorial and dask-exeamples repos

mrocklin · 2022-03-22T13:16:37Z

Merging in 24 hours if there are not further comments

graingert · 2022-03-22T14:50:53Z

The tick counter itself seems problematic tracked here #5908 I'd rather not build anything on top of it until we've determined a replacement for it

mrocklin · 2022-03-22T15:36:48Z

The tick counter itself seems problematic tracked here #5908 I'd rather not build anything on top of it until we've determined a replacement for it

If it's problematic for tests (I'd like to learn more about how) then let's fix it in tests (#5977 is one simple attempt at this). I'm open to replacements of this structure if anyone has ideas. If no one has ideas then I'm inclined to push forward though. This is important to track system health.

First things first though, any suggestions on alternatives? I'd love to find something better.

gjoseph92

I am skeptical that when the event loop is blocked, bokeh will actually be able to send updates to the dashboard showing that this is happening (i.e. the chart may either be flickering around 0.02, or frozen at 0.02, but not go above 0.02 except for a brief, instantaneous high flicker right after the loop un-blocks). I haven't tested this—could you confirm that (time.sleep in an async client.run_on_scheduler, etc.)?

If the dashboard doesn't adequately show this, maybe we should change the visualization to something that would show the delay post-hoc (a timeseries of loop intervals?). Ideally, we'd have the plot increment itself client-side (JS), and then zero itself back out when it receives an update from the scheduler.

gjoseph92 · 2022-03-22T16:48:46Z

distributed/core.py

-        measure_tick_interval = parse_timedelta(
+        self._tick_counter = 0
+        self._tick_count = 0
+        self._tick_count_last = time()


Let's use a monotonic timer here #4528

We currently use metrics.time because of windows support. If we want to switch to something else that's better I'm all in favor, but I hesitate to launch into this immediately. Instead I'd rather that we kept with the standard way of doing things, and then considered switching out to monotonic en-masse. Thoughts?

gjoseph92 · 2022-03-22T16:50:10Z

distributed/core.py

+        pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)
        self.periodic_callbacks["tick"] = pc


Suggested change

pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)

self.periodic_callbacks["tick"] = pc

self.periodic_callbacks["tick"] = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)

pc is unused otherwise

Sure, done. I also did this for the lines above.

gjoseph92 · 2022-03-22T16:53:17Z

distributed/dashboard/scheduler.py

@@ -97,6 +98,7 @@
    "/individual-compute-time-per-key": individual_doc(ComputePerKey, 500),
    "/individual-aggregate-time-per-action": individual_doc(AggregateAction, 500),
    "/individual-scheduler-system": individual_doc(SystemMonitor, 500),
+    "/individual-event-loop": individual_doc(EventLoop, 500),


Should the interval for this plot correspond to the tick cycle? Otherwise it will run unnecessarily often.

It's pretty cheap. I don't think that this matters much.

gjoseph92 · 2022-03-22T16:58:40Z

distributed/dashboard/components/scheduler.py

+                toolbar_location="above",
+                **kwargs,
+            )
+            self.root.vbar(x="names", top="values", width=0.9, source=self.source)


Ideally I'd like to see color as well (yellow if the interval is over a certain threshold, like the worker memory plot). I think 0.1s would be a reasonable threshold; this is asyncio's default slow_callback_duration when debug mode is turned on.

I agree with the ideal. However, I'd like to get this in and focus elsewhere for now if possible.

mrocklin · 2022-03-22T18:00:51Z

I am skeptical that when the event loop is blocked, bokeh will actually be able to send updates to the dashboard showing that this is happening (i.e. the chart may either be flickering around 0.02, or frozen at 0.02, but not go above 0.02 except for a brief, instantaneous high flicker right after the loop un-blocks). I haven't tested this—could you confirm that (time.sleep in an async client.run_on_scheduler, etc.)?

If the dashboard doesn't adequately show this, maybe we should change the visualization to something that would show the delay post-hoc (a timeseries of loop intervals?). Ideally, we'd have the plot increment itself client-side (JS), and then zero itself back out when it receives an update from the scheduler.

You're right to be skeptical, when the scheduler event loop is blocked then certainly the dashboard won't update. However we will get the signal afterwards. We track the amount of time since the last time we cycled things, and use that.

For workers that are missing I'm actually considering integrating this value into the other bar charts, like memory. We have a wc.last_seen value that is useful here. I consider that to be other work though.

gjoseph92 · 2022-03-22T21:13:23Z

However we will get the signal afterwards

I'm worried the signal will be nearly immediately overridden by the next tick measurement cycle. What if the scheduler event loop is blocked for 10s straight (as it is most times update_graph_hlg runs...), then goes back to normal? Won't the chart stay at 0.01s for all 10 seconds, then jump to 10s for an instant once all the blocked PeriodicCallbacks get to run again, but then go back to normal at most 500ms later when EventLoop.update runs again? That gives you a half-second window in which to notice something was wrong. Whereas a timeseries plot would make this easier to see.

I'm really excited to have visibility into this, but I'm worried that a visualization that's biased towards only showing healthy event loop times is even worse than no indicator at all.

mrocklin · 2022-03-22T22:16:00Z

I'm really excited to have visibility into this, but I'm worried that a visualization that's biased towards only showing healthy event loop times is even worse than no indicator at all.

In practice this thing is super valuable. I'd encourage folks to try it out before raising too many concerns.

I'm worried the signal will be nearly immediately overridden by the next tick measurement cycle. What if the scheduler event loop is blocked for 10s straight (as it is most times update_graph_hlg runs...), then goes back to normal? Won't the chart stay at 0.01s for all 10 seconds, then jump to 10s for an instant once all the blocked PeriodicCallbacks get to run again, but then go back to normal at most 500ms later when EventLoop.update runs again? That gives you a half-second window in which to notice something was wrong. Whereas a timeseries plot would make this easier to see.

Again, in practice I'm getting good signal from this thing.

At one point I started doing an EWMA on it with some lighter transparency bar. I think that that would give you what you're looking for. A timeseries would also be good. All of these ideas are good. I've got what I need here though and I'm focused on other things. I think that this is a strict improvement over current state, doesn't add significant complexity (I think), and so I think that it should go in.

If you're saying "Yes, this is cool, in the future it would be even cooler if it did X, Y, and Z" then I'm entirely aligned.

If you're saying "This is cool, but to go in it needs to be cooler and do X, Y, and Z" then I think that that would engage a conversation about prioritization, and whether or not X, Y, and Z are the highest priority things at the moment.

If you're saying "This isn't cool, and is a degredation on current behavior" then let's pause and chat about it. My experience using this is that it's very helpful. It's not perfect, but I don't think that that's necessary.

gjoseph92 · 2022-03-22T23:38:27Z

I'm saying "I'd like to see a GIF of what a single 3sec pause of the scheduler event loop looks like with this chart."

I definitely agree this is helpful for shuffling. But there are other cases in which we're also worried about event loop blocking:

I think all of those cases above would be things this plot would show clearly, since they'd be brief but repeated interruptions instead of long but rare interruptions. So I also imagine this will be an improvement. But I still want to understand how misleading this will be on other cases before being sure of that judgement.

mrocklin · 2022-03-23T00:01:02Z

I'm saying "I'd like to see a GIF of what a single 3sec pause of the scheduler event loop looks like with this chart."

I'm not sure if this is what you're asking, but it's unlikely that I take the time to form that GIF.

You're absolutely right though that this won't catch everything. If that's your concern then I absolutely cede the point.

In practice, smooth sailing, followed by a large pause, followed by smooth sailing isn't as common, I don't think, as general cruddy performance for an extended period of time.

fjetter

I agree with both @gjoseph92 concern and @mrocklin urge to move forward.

I agree that many use cases would actually want to have a more detailed visualization, e.g. by using an additional EWMA as suggested. I think the use case of high performing extensions (e.g. stealing, shuffling) is where this is good enough but for loop health debugging in "ordinary" circumstances, I can easily see that we need more.
However, the first iteration doesn't need to be perfect. I think it's perfectly fine to ask for more but if @mrocklin is not willing to add more to this, that's also fine. In almost all situations it should be OK to not increase scope of a PR if the additional scope are improvements and not an actual problems. From what I understand, most raised concerns are enhancements on this PR and we can follow up. Is this understanding correct?

Re: "we should get rid of the measure-tick thing"

The way I see it, this system existed for a while now and this PR improves visibility on this metric. Whether this metric is a problem or not can be discussed. We're not stating this anywhere but I would consider most individual-* plots to be experimental. If we remove the internal instrumentation supporting a given plot, we'll just remove the plot together with the instrumentation.

I had a minor comment about the worker heartbeat but this is not a blocker. If there is nothing else, we're good to go from my end

distributed/worker.py

mrocklin · 2022-03-23T16:47:58Z

Woot. Green tests.

fjetter · 2022-03-23T18:05:58Z

Woot. Green tests.

I think your tests haven't even been scheduled, yet. There seems to be something wrong with GH actions :)

…shboard

gjoseph92 · 2022-03-24T21:17:28Z

From what I understand, most raised concerns are enhancements

My concern was about misleading users, or getting misleading bug reports like "the scheduler isn't doing anything and I see event loop blocked errors, but the event loop health plot looks fine." If this were going to be on a highly-visible page like /status, I'd want to see EWMA or something before merging. Given that this is very low-visibility right now though, I'm ok merging as-is, since it's still very useful and I agree we have other things to do. I just wouldn't want to make it more visible until we address this.

mrocklin · 2022-03-25T11:16:04Z

Any objection to merging?

mrocklin added 3 commits March 19, 2022 09:38

Track Event Loop intervals in dashboard plot

c894f40

bump y-axis, add kwargs

c272458

fixup config schema

953af64

Use vline mode for hover

1d421b0

phobson approved these changes Mar 21, 2022

View reviewed changes

mrocklin mentioned this pull request Mar 22, 2022

Shuffle Service #5976

Closed

gjoseph92 reviewed Mar 22, 2022

View reviewed changes

avoid pc variable

876d860

fjetter reviewed Mar 23, 2022

View reviewed changes

distributed/worker.py Show resolved Hide resolved

mrocklin added 2 commits March 23, 2022 08:13

Merge branch 'main' into event-loop-dashboard

8fa8c46

fixup test

d9ebab2

mrocklin added 2 commits March 23, 2022 22:16

Merge branch 'main' of github.com:dask/distributed into event-loop-da…

b780de3

…shboard

Merge branch 'main' of github.com:dask/distributed into event-loop-da…

55e8090

…shboard

mrocklin mentioned this pull request Mar 24, 2022

Remove heartbeats from events #5989

Merged

mrocklin merged commit 2fbf9eb into dask:main Mar 25, 2022

mrocklin deleted the event-loop-dashboard branch March 25, 2022 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track Event Loop intervals in dashboard plot #5964

Track Event Loop intervals in dashboard plot #5964

mrocklin commented Mar 19, 2022 •

edited

Loading

github-actions bot commented Mar 19, 2022 •

edited

Loading

mrocklin commented Mar 21, 2022

ncclementi commented Mar 21, 2022

phobson left a comment

mrocklin commented Mar 22, 2022

graingert commented Mar 22, 2022

mrocklin commented Mar 22, 2022

gjoseph92 left a comment

gjoseph92 Mar 22, 2022

mrocklin Mar 22, 2022

gjoseph92 Mar 22, 2022

mrocklin Mar 22, 2022

gjoseph92 Mar 22, 2022

mrocklin Mar 22, 2022

gjoseph92 Mar 22, 2022

mrocklin Mar 22, 2022

mrocklin commented Mar 22, 2022

gjoseph92 commented Mar 22, 2022

mrocklin commented Mar 22, 2022

gjoseph92 commented Mar 22, 2022

mrocklin commented Mar 23, 2022

fjetter left a comment

mrocklin commented Mar 23, 2022

fjetter commented Mar 23, 2022

gjoseph92 commented Mar 24, 2022

mrocklin commented Mar 25, 2022

		pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)
		self.periodic_callbacks["tick"] = pc

	pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)
	self.periodic_callbacks["tick"] = pc
	self.periodic_callbacks["tick"] = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)

Track Event Loop intervals in dashboard plot #5964

Track Event Loop intervals in dashboard plot #5964

Conversation

mrocklin commented Mar 19, 2022 • edited Loading

github-actions bot commented Mar 19, 2022 • edited Loading

Unit Test Results

mrocklin commented Mar 21, 2022

ncclementi commented Mar 21, 2022

phobson left a comment

Choose a reason for hiding this comment

mrocklin commented Mar 22, 2022

graingert commented Mar 22, 2022

mrocklin commented Mar 22, 2022

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Mar 22, 2022

gjoseph92 commented Mar 22, 2022

mrocklin commented Mar 22, 2022

gjoseph92 commented Mar 22, 2022

mrocklin commented Mar 23, 2022

fjetter left a comment

Choose a reason for hiding this comment

mrocklin commented Mar 23, 2022

fjetter commented Mar 23, 2022

gjoseph92 commented Mar 24, 2022

mrocklin commented Mar 25, 2022

mrocklin commented Mar 19, 2022 •

edited

Loading

github-actions bot commented Mar 19, 2022 •

edited

Loading