New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues with dask.scatter #3333
Comments
@rkube thanks for the issue. Unfortunately, there is not a whole to go on here. Would it be possible to generate a reproducible example ? |
Right, here is the simplest example I could come up with: from distributed import Client
import numpy as np
import logging
import threading
import queue
import timeit
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s,%(msecs)d %(levelname)s: %(message)s",
datefmt="%H:%M:%S",
)
def consume(Q, dask_client):
while True:
(i, data) = Q.get()
if i == -1:
Q.task_done()
break
tic_sc = timeit.default_timer()
future = dask_client.scatter(data, broadcast=True, direct=True)
toc_sc = timeit.default_timer()
logging.info(f"Scatter took {(toc_sc - tic_sc):6.4f}s")
Q.task_done()
def main():
dq = queue.Queue()
msg = None
data = np.zeros([192, 512, 38], dtype=np.complex128)
dask_client = Client(scheduler_file="/scratch/gpfs/rkube/dask_work/scheduler.json")
worker = threading.Thread(target=consume, args=(dq, dask_client))
worker.start()
for i in range(5):
data = data + np.random.uniform(0.0, 1.0, data.shape)
dq.put((i, data))
dq.put((-1, None))
worker.join()
dq.join()
if __name__ == "__main__":
main() And here is the output on a cluster running with 64 workers on 2 nodes. |
Running this locally it takes much less time for me. from dask.distributed import Client
client = Client()
import numpy as np
data = np.zeros([192, 512, 38], dtype=np.complex128)
%time client.scatter(data, broadcast=True, direct=True)
So I suspect that it has something to do with your setup. Maybe your client isn't well connected to your workers? Maybe something else? Unfortunately as an organization we're not set up to do this level of support for free. |
Right, I checked it on a single node now. In this configuration the scatter doesn't take up much time: $ python processor_dask_mockup.py
distributed.scheduler - INFO - Receive client connection: Client-4a49417a-2b0e-11ea-940c-0894ef80904b
distributed.core - INFO - Starting established connection
09:11:36,968 INFO: Scatter took 0.4835s
09:11:37,338 INFO: Scatter took 0.3693s
09:11:37,755 INFO: Scatter took 0.4166s
09:11:38,163 INFO: Scatter took 0.4073s
09:11:38,575 INFO: Scatter took 0.4112s This is with 64 dask workers running on the same node as the scheduler. In the case reported above, the scheduler was running on one of the two nodes, connected with gigabit ethernet. |
You'll probably have to do some profiling to figure out what is going on. I recommend the |
Hi,
I use dask.distributed to perform a large number of analysis routines on the same data.
After pre-processing, I scatter my data to the workers like this, but it takes a long time to run:
With a 28.5MB large data packet the following code takes about 7.5s:
This is with 32 workers on 2 nodes, connected via infiniband.
The text was updated successfully, but these errors were encountered: