Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track Event Loop intervals in dashboard plot #5964

Merged
merged 9 commits into from
Mar 25, 2022

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 19, 2022

image

@github-actions
Copy link
Contributor

github-actions bot commented Mar 19, 2022

Unit Test Results

       12 files  ±0         12 suites  ±0   6h 15m 11s ⏱️ + 6m 53s
  2 670 tests +1    2 589 ✔️ +2    80 💤  - 2  1 +1 
15 948 runs  +6  15 104 ✔️ +7  843 💤  - 2  1 +1 

For more details on these failures, see this check.

Results for commit 55e8090. ± Comparison against base commit 5c7d555.

♻️ This comment has been updated with latest results.

@mrocklin
Copy link
Member Author

cc @graingert if you have time to take a quick look

@ncclementi
Copy link
Member

@mrocklin can you add the corresponding entry to this doc-page please https://github.com/dask/distributed/blob/main/docs/source/http_services.rst

Copy link

@phobson phobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I pulled down the branch and tested it out. Worked as expected on a handful of the notebook in the dask-tutorial and dask-exeamples repos

@mrocklin
Copy link
Member Author

Merging in 24 hours if there are not further comments

@graingert
Copy link
Member

The tick counter itself seems problematic tracked here #5908 I'd rather not build anything on top of it until we've determined a replacement for it

@mrocklin mrocklin mentioned this pull request Mar 22, 2022
@mrocklin
Copy link
Member Author

The tick counter itself seems problematic tracked here #5908 I'd rather not build anything on top of it until we've determined a replacement for it

If it's problematic for tests (I'd like to learn more about how) then let's fix it in tests (#5977 is one simple attempt at this). I'm open to replacements of this structure if anyone has ideas. If no one has ideas then I'm inclined to push forward though. This is important to track system health.

First things first though, any suggestions on alternatives? I'd love to find something better.

Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skeptical that when the event loop is blocked, bokeh will actually be able to send updates to the dashboard showing that this is happening (i.e. the chart may either be flickering around 0.02, or frozen at 0.02, but not go above 0.02 except for a brief, instantaneous high flicker right after the loop un-blocks). I haven't tested this—could you confirm that (time.sleep in an async client.run_on_scheduler, etc.)?

If the dashboard doesn't adequately show this, maybe we should change the visualization to something that would show the delay post-hoc (a timeseries of loop intervals?). Ideally, we'd have the plot increment itself client-side (JS), and then zero itself back out when it receives an update from the scheduler.

measure_tick_interval = parse_timedelta(
self._tick_counter = 0
self._tick_count = 0
self._tick_count_last = time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a monotonic timer here #4528

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently use metrics.time because of windows support. If we want to switch to something else that's better I'm all in favor, but I hesitate to launch into this immediately. Instead I'd rather that we kept with the standard way of doing things, and then considered switching out to monotonic en-masse. Thoughts?

Comment on lines 244 to 245
pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)
self.periodic_callbacks["tick"] = pc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pc = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)
self.periodic_callbacks["tick"] = pc
self.periodic_callbacks["tick"] = PeriodicCallback(self._measure_tick, self._tick_interval * 1000)

pc is unused otherwise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done. I also did this for the lines above.

@@ -97,6 +98,7 @@
"/individual-compute-time-per-key": individual_doc(ComputePerKey, 500),
"/individual-aggregate-time-per-action": individual_doc(AggregateAction, 500),
"/individual-scheduler-system": individual_doc(SystemMonitor, 500),
"/individual-event-loop": individual_doc(EventLoop, 500),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the interval for this plot correspond to the tick cycle? Otherwise it will run unnecessarily often.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty cheap. I don't think that this matters much.

toolbar_location="above",
**kwargs,
)
self.root.vbar(x="names", top="values", width=0.9, source=self.source)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I'd like to see color as well (yellow if the interval is over a certain threshold, like the worker memory plot). I think 0.1s would be a reasonable threshold; this is asyncio's default slow_callback_duration when debug mode is turned on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the ideal. However, I'd like to get this in and focus elsewhere for now if possible.

@mrocklin
Copy link
Member Author

I am skeptical that when the event loop is blocked, bokeh will actually be able to send updates to the dashboard showing that this is happening (i.e. the chart may either be flickering around 0.02, or frozen at 0.02, but not go above 0.02 except for a brief, instantaneous high flicker right after the loop un-blocks). I haven't tested this—could you confirm that (time.sleep in an async client.run_on_scheduler, etc.)?

If the dashboard doesn't adequately show this, maybe we should change the visualization to something that would show the delay post-hoc (a timeseries of loop intervals?). Ideally, we'd have the plot increment itself client-side (JS), and then zero itself back out when it receives an update from the scheduler.

You're right to be skeptical, when the scheduler event loop is blocked then certainly the dashboard won't update. However we will get the signal afterwards. We track the amount of time since the last time we cycled things, and use that.

For workers that are missing I'm actually considering integrating this value into the other bar charts, like memory. We have a wc.last_seen value that is useful here. I consider that to be other work though.

@gjoseph92
Copy link
Collaborator

However we will get the signal afterwards

I'm worried the signal will be nearly immediately overridden by the next tick measurement cycle. What if the scheduler event loop is blocked for 10s straight (as it is most times update_graph_hlg runs...), then goes back to normal? Won't the chart stay at 0.01s for all 10 seconds, then jump to 10s for an instant once all the blocked PeriodicCallbacks get to run again, but then go back to normal at most 500ms later when EventLoop.update runs again? That gives you a half-second window in which to notice something was wrong. Whereas a timeseries plot would make this easier to see.

I'm really excited to have visibility into this, but I'm worried that a visualization that's biased towards only showing healthy event loop times is even worse than no indicator at all.

@mrocklin
Copy link
Member Author

I'm really excited to have visibility into this, but I'm worried that a visualization that's biased towards only showing healthy event loop times is even worse than no indicator at all.

In practice this thing is super valuable. I'd encourage folks to try it out before raising too many concerns.

I'm worried the signal will be nearly immediately overridden by the next tick measurement cycle. What if the scheduler event loop is blocked for 10s straight (as it is most times update_graph_hlg runs...), then goes back to normal? Won't the chart stay at 0.01s for all 10 seconds, then jump to 10s for an instant once all the blocked PeriodicCallbacks get to run again, but then go back to normal at most 500ms later when EventLoop.update runs again? That gives you a half-second window in which to notice something was wrong. Whereas a timeseries plot would make this easier to see.

Again, in practice I'm getting good signal from this thing.

At one point I started doing an EWMA on it with some lighter transparency bar. I think that that would give you what you're looking for. A timeseries would also be good. All of these ideas are good. I've got what I need here though and I'm focused on other things. I think that this is a strict improvement over current state, doesn't add significant complexity (I think), and so I think that it should go in.

If you're saying "Yes, this is cool, in the future it would be even cooler if it did X, Y, and Z" then I'm entirely aligned.

If you're saying "This is cool, but to go in it needs to be cooler and do X, Y, and Z" then I think that that would engage a conversation about prioritization, and whether or not X, Y, and Z are the highest priority things at the moment.

If you're saying "This isn't cool, and is a degredation on current behavior" then let's pause and chat about it. My experience using this is that it's very helpful. It's not perfect, but I don't think that that's necessary.

@gjoseph92
Copy link
Collaborator

I'm saying "I'd like to see a GIF of what a single 3sec pause of the scheduler event loop looks like with this chart."

I definitely agree this is helpful for shuffling. But there are other cases in which we're also worried about event loop blocking:

I think all of those cases above would be things this plot would show clearly, since they'd be brief but repeated interruptions instead of long but rare interruptions. So I also imagine this will be an improvement. But I still want to understand how misleading this will be on other cases before being sure of that judgement.

@mrocklin
Copy link
Member Author

I'm saying "I'd like to see a GIF of what a single 3sec pause of the scheduler event loop looks like with this chart."

I'm not sure if this is what you're asking, but it's unlikely that I take the time to form that GIF.

You're absolutely right though that this won't catch everything. If that's your concern then I absolutely cede the point.

In practice, smooth sailing, followed by a large pause, followed by smooth sailing isn't as common, I don't think, as general cruddy performance for an extended period of time.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with both @gjoseph92 concern and @mrocklin urge to move forward.

I agree that many use cases would actually want to have a more detailed visualization, e.g. by using an additional EWMA as suggested. I think the use case of high performing extensions (e.g. stealing, shuffling) is where this is good enough but for loop health debugging in "ordinary" circumstances, I can easily see that we need more.
However, the first iteration doesn't need to be perfect. I think it's perfectly fine to ask for more but if @mrocklin is not willing to add more to this, that's also fine. In almost all situations it should be OK to not increase scope of a PR if the additional scope are improvements and not an actual problems. From what I understand, most raised concerns are enhancements on this PR and we can follow up. Is this understanding correct?

Re: "we should get rid of the measure-tick thing"

The way I see it, this system existed for a while now and this PR improves visibility on this metric. Whether this metric is a problem or not can be discussed. We're not stating this anywhere but I would consider most individual-* plots to be experimental. If we remove the internal instrumentation supporting a given plot, we'll just remove the plot together with the instrumentation.

I had a minor comment about the worker heartbeat but this is not a blocker. If there is nothing else, we're good to go from my end

distributed/worker.py Show resolved Hide resolved
@mrocklin
Copy link
Member Author

Woot. Green tests.

@fjetter
Copy link
Member

fjetter commented Mar 23, 2022

Woot. Green tests.

I think your tests haven't even been scheduled, yet. There seems to be something wrong with GH actions :)

@gjoseph92
Copy link
Collaborator

From what I understand, most raised concerns are enhancements

My concern was about misleading users, or getting misleading bug reports like "the scheduler isn't doing anything and I see event loop blocked errors, but the event loop health plot looks fine." If this were going to be on a highly-visible page like /status, I'd want to see EWMA or something before merging. Given that this is very low-visibility right now though, I'm ok merging as-is, since it's still very useful and I agree we have other things to do. I just wouldn't want to make it more visible until we address this.

@mrocklin
Copy link
Member Author

Any objection to merging?

@mrocklin mrocklin merged commit 2fbf9eb into dask:main Mar 25, 2022
@mrocklin mrocklin deleted the event-loop-dashboard branch March 25, 2022 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants