Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dashboard component for size of open data transfers #6982

Merged
merged 7 commits into from
Sep 9, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Aug 31, 2022

Follow-up to #6936

Blocked by #6975

  • adds a plot the size of incoming/outgoing data transfers per worker
  • adds that plot to the tabs on the status page

transferbytes

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

github-actions bot commented Aug 31, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 44m 31s ⏱️ + 16m 51s
  3 082 tests  -   4    2 998 ✔️  -   2    84 💤  - 1  0  - 1 
22 802 runs   - 32  21 818 ✔️  - 30  984 💤  - 1  0  - 1 

Results for commit 06d4d11. ± Comparison against base commit bfc5cfe.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member

fjetter commented Sep 1, 2022

pic/gif?

@hendrikmakait hendrikmakait marked this pull request as ready for review September 1, 2022 12:27
@crusaderky
Copy link
Collaborator

crusaderky commented Sep 2, 2022

Some design observations:

  • I don't think that setting the x scale to memory_limit is very useful.
    A more reasonable scale would be max(memory_limit * transfer, current), or however you called the new 10% threshold you're implementing in Worker can memory overflow by fetching too much data at once #6208.
    Check WorkerMemory for how the dynamic x scale is implemented (it's tricky to avoid flickering).
  • I struggle reading the two bars side by side for each worker. The red one instinctively feels more important than the blue one. Maybe putting the origin in the center and having two bars come out for each worker (grow to the left for incoming; grow to the right for outgoing) would look more readable?
  • the tooltip for each worker should show both incoming and outgoing
  • it would be nice to dynamically change the title to show the grand total cluster-wide (outgoing and incoming should roughly be the same, minus race conditions, so just pick an arbitrary one)

@hendrikmakait
Copy link
Member Author

I don't think that setting the x scale to memory_limit is very useful.
A more reasonable scale would be max(memory_limit * transfer, current), or however you called the new 10% threshold you're implementing in #6208.
Check WorkerMemory for how the dynamic x scale is implemented (it's tricky to avoid flickering).

Agreed, this should be updated once #6208 is merged. For now, this seems better than adding some superfluous calculations myself.

I struggle reading the two bars side by side for each worker. The red one instinctively feels more important than the blue one. Maybe putting the origin in the center and having two bars come out for each worker (grow to the left for incoming; grow to the right for outgoing) would look more readable?

I stole this pattern from the bandwidth plots (e.g. disk bandwidth). I'd leave it as is for now and create a follow-up issue for design improvements. These might apply to more than this one plot.

@fjetter
Copy link
Member

fjetter commented Sep 2, 2022

I don't think that setting the x scale to memory_limit is very useful.
A more reasonable scale would be max(memory_limit * transfer, current), or however you called the new 10% threshold you're implementing in #6208.
Check WorkerMemory for how the dynamic x scale is implemented (it's tricky to avoid flickering).

Agreed, this should be updated once #6208 is merged. For now, this seems better than adding some superfluous calculations myself.

I suggest to merge the other PR first then.

@fjetter
Copy link
Member

fjetter commented Sep 2, 2022

FWIW I think a great visualization would be an overlay with the bytes stored but I think that's not straight forward with bokeh

(Pardon the pic quality; It shows "reserved bytes" overlaying the unmanaged memory section)

Screenshot 2022-09-02 at 15 20 45

Edit: I'm very happy with this PR already! Just saying :)

@hendrikmakait
Copy link
Member Author

I suggest to merge the other PR first then.
👍

FWIW I think a great visualization would be an overlay with the bytes stored but I think that's not straight forward with bokeh

That's something for another PR and maybe a more experienced Bokeh wiz.

Regarding the uncommented changed from @crusaderky: I'm incorporating those and updating the pic.

@crusaderky
Copy link
Collaborator

FWIW I think a great visualization would be an overlay with the bytes stored but I think that's not straight forward with bokeh

I thought about that; the problem is that we don't have the information about how much of the transfer_incoming_bytes already contribute to unmanaged_memory and how much they are on top of it instead (as they haven't landed yet). We could figure out a way to measure it but I suspect we would need to probe quite deep in the networking stack, as you want to catch it literally as soon as it's copied from the network card's buffer.

@fjetter
Copy link
Member

fjetter commented Sep 2, 2022

I thought about that; the problem is that we don't have the information about how much of the transfer_incoming_bytes already contribute to unmanaged_memory and how much they are on top of it instead (as they haven't landed yet). We could figure out a way to measure it but I suspect we would need to probe quite deep in the networking stack, as you want to catch it literally as soon as it's copied from the network card's buffer.

true. Since it's not straight forward to do anyhow, let's forget I brought it up :)

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Sep 2, 2022

true. Since it's not straight forward to do anyhow, let's forget I brought it up :)

FWIW that's why I added the plot as an additional tab to the status page panel. This way, you can at least observe them at the same time and we can spend some time thinking hard about making a good combined plot.

@fjetter fjetter merged commit 1fd07f0 into dask:main Sep 9, 2022
gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants