Collect worker-worker and type bandwidth information #3094

mrocklin · 2019-09-26T20:23:27Z

This collects the bandwidth that we observe both by type,
and by worker-worker pair.

This is currently used in dashboards plots (see below) and may be used in scheduling decisions in the future:

This collects the bandwidth that we observe both by type, and by worker-worker pair. Eventually we'll use this for visual diagnostics, and maybe scheduling decisions in the future.

quasiben · 2019-09-26T21:54:14Z

Very cool!

mrocklin · 2019-09-27T00:03:50Z

This information is also available if people want to dive into the scheduler internals directly:

In [1]: from dask.distributed import Client
In [2]: client = Client()
In [3]: import dask.array as da
In [4]: x = da.random.random((30000, 30000), chunks="128 MiB")

In [5]: y = (x + x.T) - x.mean(axis=0); y.sum().compute()
Out[5]: 449994918.14960176

In [6]: client.cluster.scheduler.bandwidth
Out[6]: 101225272.64331588

In [7]: client.cluster.scheduler.bandwidth_types
Out[7]: defaultdict(float, {'numpy.ndarray': 53980385.799921274})

In [8]: client.cluster.scheduler.bandwidth_workers
Out[8]:
defaultdict(float,
            {('tcp://127.0.0.1:52998',
              'tcp://127.0.0.1:52997'): 30879879.40114621,
             ('tcp://127.0.0.1:52998',
              'tcp://127.0.0.1:53000'): 32331470.115519114,
             ('tcp://127.0.0.1:53000',
              'tcp://127.0.0.1:52997'): 37212961.59090071,
             ('tcp://127.0.0.1:52997',
              'tcp://127.0.0.1:53000'): 16128730.449103117,
             ('tcp://127.0.0.1:52999',
              'tcp://127.0.0.1:52997'): 14742638.825626642,
             ('tcp://127.0.0.1:52997',
              'tcp://127.0.0.1:52999'): 172327687.3918158,
             ('tcp://127.0.0.1:52999',
              'tcp://127.0.0.1:53000'): 317673916.5222039,
             ('tcp://127.0.0.1:52998',
              'tcp://127.0.0.1:52999'): 5480834.635507911,
             ('tcp://127.0.0.1:52999',
              'tcp://127.0.0.1:52998'): 8030296.722465654,
             ('tcp://127.0.0.1:52997',
              'tcp://127.0.0.1:52998'): 137043235.69919452,
             ('tcp://127.0.0.1:53000',
              'tcp://127.0.0.1:52999'): 5711893.887075517,
             ('tcp://127.0.0.1:53000',
              'tcp://127.0.0.1:52998'): 59961461.29405087})

mrocklin · 2019-09-27T00:04:15Z

And now the worker-worker data is also available visually:

mrocklin · 2019-09-27T00:15:50Z

Now also available in JupyterLab

mrocklin · 2019-10-01T16:19:29Z

There are problems with this in the many-worker case:

If there are, say, 1000 workers, then this will maintain a million element list, updated fairly frequently.
Workers never learn when othere workers die, and so they don't forget bandwidths to peers. They continuously send this information to the scheduler, which can be a problem if there are many workers, or if our workers are highly adaptive, and come and go rapidly

mrocklin · 2019-10-01T19:09:59Z

OK, resolved. We no longer keep long-term bandwidths on the workers, but clear their information and only send up diffs to the scheduler. This should improve scalability as well.

TomAugspurger · 2020-05-13T15:28:24Z

@mrocklin I'm digging into the worker code now, but perhaps you know off the top of your head: why would there ever be communication from a worker to itself?

mrocklin · 2020-05-13T15:36:20Z

We populate the Bokeh plot with fake data to start, otherwise the layout engine fails I think (or at least this used to be the case).

…

On Wed, May 13, 2020 at 8:28 AM Tom Augspurger ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> I'm digging into the worker code now, but perhaps you know off the top of your head: why would there ever be communication from a worker to itself? [image: Screen Shot 2020-05-13 at 10 26 50 AM] <https://user-images.githubusercontent.com/1312546/81832630-5cf21c00-9504-11ea-9f76-87ae5b886ba5.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3094 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTC5MDT4TB5KLAS5GB3RRK4CTANCNFSM4I26VKSA> .

TomAugspurger · 2020-05-13T15:38:05Z

Yep, just noticed that when I restarted the cluster and the boxes were still there. Makes sense, thanks.

mrocklin added 5 commits September 26, 2019 15:22

Collect worker-worker and type bandwidth information

488d8cc

This collects the bandwidth that we observe both by type, and by worker-worker pair. Eventually we'll use this for visual diagnostics, and maybe scheduling decisions in the future.

Only add GPU plots if nvml is present and active

b692f3b

Test connectivity to all plots in individual-plots.json route

865d4bb

Add bandwidth types barchart

52ea5f1

Start bandwidth computation at first value rather than 0

40eba35

Add worker-worker bandwidth dashboard

e72fdb4

This was referenced Sep 27, 2019

Style Change to Task Stream #3070

Closed

Byte transfer size in Hover tool #3073

Closed

mrocklin added 5 commits September 27, 2019 08:23

Add stretch_both sizing to bandwidth plots

a151c5d

Remove entries from bandwidth_workers when workers die

821fe0d

Only measure bandwidth on transfers larger than 1MB

dd5a647

Fix test

79db08e

Support Python 3.5

220722b

Clear worker bandwidth data

f1fec3c

mrocklin merged commit 52d3e05 into dask:master Oct 1, 2019

mrocklin deleted the bandwidths branch October 1, 2019 21:32

ameetshah1983 mentioned this pull request Oct 8, 2019

Scheduler Dashboard Workers Link not working - distrbuted 2.5.2 #3120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect worker-worker and type bandwidth information #3094

Collect worker-worker and type bandwidth information #3094

mrocklin commented Sep 26, 2019 •

edited

Loading

quasiben commented Sep 26, 2019

mrocklin commented Sep 27, 2019

mrocklin commented Sep 27, 2019

mrocklin commented Sep 27, 2019

mrocklin commented Oct 1, 2019

mrocklin commented Oct 1, 2019

TomAugspurger commented May 13, 2020

mrocklin commented May 13, 2020 via email

TomAugspurger commented May 13, 2020

Collect worker-worker and type bandwidth information #3094

Collect worker-worker and type bandwidth information #3094

Conversation

mrocklin commented Sep 26, 2019 • edited Loading

quasiben commented Sep 26, 2019

mrocklin commented Sep 27, 2019

mrocklin commented Sep 27, 2019

mrocklin commented Sep 27, 2019

mrocklin commented Oct 1, 2019

mrocklin commented Oct 1, 2019

TomAugspurger commented May 13, 2020

mrocklin commented May 13, 2020 via email

TomAugspurger commented May 13, 2020

mrocklin commented Sep 26, 2019 •

edited

Loading