Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learn heterogeneous bandwidths #2743

mrocklin opened this issue Jun 3, 2019 · 0 comments


Copy link

commented Jun 3, 2019

In order to make good scheduling decisions the scheduler often has to make an estimate for how long transfers will take. Currently, it learns a uniform exponentially weighted moving average based on what the workers observe.

However, this assumption of uniformity breaks down in a few cases:

  1. Different types often incur different serialization costs (which we bundle into bandwidth here)
  2. Different types may also move over different transports, as with GPU data and NVLink
  3. Different workers may be closer or farther away from each other. For example they may be on the same node, in the same rack, or in the same data center
  4. Very small frames often have some baseline cost

Learning a model that estimates the total transit time of a piece of data would be useful, but it may also be somewhat tricky. There is a balance to be struck between generalizing across the cluster and data types and learning heterogeneity that may exist.

Also, this needs to be fairly lightweight on the scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
1 participant
You can’t perform that action at this time.