Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learn bandwidth over time #2658

Merged
merged 4 commits into from May 13, 2019

Conversation

@mrocklin
Copy link
Member

commented May 5, 2019

In order to schedule tasks intelligently we need to know how long
communications will take. To do this, we need to estimate the bandwidth
of the network. This can vary by orders of magnitude depending on
hardwware.

Previously we asked the user to specify this in configuration.

Now we learn it over time. Each worker keeps an exponentially weighted
moving average for all of its data communications. It sends this
information to the scheduler as part of the heartbeats (which include
lots of other diagnostic information). The scheduler updates its own
measurement accordingly.

Eventually, I may want to make our model of bandwidth more complex,
to include type and network information. I thought I'd start here though.

mrocklin added some commits May 5, 2019

Learn bandwidth over time
In order to schedule tasks intelligently we need to know how long
communications will take.  To do this, we need to estimate the bandwidth
of the network.  This can vary by orders of magnitude depending on
hardwware.

Previously we asked the user to specify this in configuration.

Now we learn it over time.  Each worker keeps an exponentially weighted
moving average for all of its data communications.  It sends this
information to the scheduler as part of the heartbeats (which include
lots of other diagnostic information).  The scheduler updates its own
measurement accordingly.
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented May 6, 2019

@quasiben you might enjoy reviewing this

@StephanErb
Copy link

left a comment

Interesting PR, thanks! This will make it easier for us to operate Dask on a diverse set of hardware platforms.

@@ -1842,6 +1846,7 @@ def gather_dep(self, worker, dep, deps, total_nbytes, cause=None):

total_bytes = sum(self.nbytes.get(dep, 0) for dep in response["data"])
duration = (stop - start) or 0.5
bandwidth = total_bytes / (duration - abs(self.scheduler_delay))

This comment has been minimized.

Copy link
@StephanErb

StephanErb May 10, 2019

There is a non-zero chance this results in a division by zero.

I am wondering if we could simply leave out the abs(self.scheduler_delay)) part?

  • the bandwidth is an approximation anyway
  • the code would be more robust
  • it would be consistent with the metrics computed below

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 10, 2019

Author Member

Thanks for the comment @StephanErb . What you say makes sense to me. I've removed the scheduler_delay change.

@@ -1336,6 +1337,8 @@ def heartbeat_worker(
host_info = host_info or {}

self.host_info[host]["last-seen"] = local_now
frac = 1 / 20 / len(self.workers)

This comment has been minimized.

Copy link
@StephanErb

StephanErb May 10, 2019

This is a bit over my head: Why do we need to divide by the number of workers?

On a large cluster with little activity, I fear that this could lead to very slow convergence.

This comment has been minimized.

Copy link
@mrocklin

mrocklin May 10, 2019

Author Member

Each worker sends a heartbeat independently. So the more workers there are, the more heartbeats there are. This delay time means (I think) that our bandwidth value changes with a time scale of around 20 heartbeats. This is probably about a minute of time, so should be fairly fast on the scheduler side. The worker's heartbeat also has a time scale though, and it only changes when there are actual communications, so that might move more slowly.

@StephanErb
Copy link

left a comment

LGTM, thanks for the update.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented May 13, 2019

Thanks for the review @StephanErb . I suspect that we will need to modify this in the future after we have more experience with it. Please raise an issue if you notice something odd.

@mrocklin mrocklin merged commit 3142dda into dask:master May 13, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mrocklin mrocklin deleted the mrocklin:learn-bandwidth branch May 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.