Hi,
I'm facing an issue that has already been raised in other more complicated forms, see pangeo-data/pangeo/issues/266, dask/dask/issues/3514, and probabaly other places. I've read those threads, but I cannot see yet a simple solution or good workaround for my problem.
I'm using Dask to do some really basic embarrassingly parallel problem:
- I've got a scientific code, that I will call using subprocess
- I need to run int with exactly 3123120 different set of parameters.
So here is basically what I do:
cmd_lines = []
for param1 in param1_set
for param2 in param2_set
# Some other nested loops
cmd_lines += [' '.join(("myscript.sh", param1, param2, ... ))]
def run_process(cmd):
os.chdir(WORK_DIR)
process = Popen(cmd.split(' '), stdout=PIPE)
(output, err) = process.communicate()
exit_code = process.wait()
return output, err
futures = client.map(run_process, cmd_lines)
results = client.gather(futures)
When taking only the first 100,000 elements of cmd_lines, this works with some minor problems:
- It takes between 40 seconds and 1m30s before my workers do anything, I imagine this is the time of building the graph and sending to worker.
- Gather takes about 20 seconds, this seems better than the scattering though.
When trying with the full 3 billions parameters, I can't see any computation starting, and the Dashboard is almost not responding anymore.
So an easy workaround is just going 100,000 tasks by 100,000, but I'm not satisfied by this, and it unnecessarily complexifies the really simple code here.
To sum up, I'd like to have a solution for two problems:
- First one is more of an annoyance: the time taken to actually start computation, without any information on the Dashboard. For such an embarrassingly parallel workload, I would even expect the tasks to start right after submission, they could be sent to workers in a streaming way, and we should not wait for 100,000 of tasks to be all sent to workers to begin.
- Second one which is probably very related, scheduler should handle several billion tasks at a time.
Hi,
I'm facing an issue that has already been raised in other more complicated forms, see pangeo-data/pangeo/issues/266, dask/dask/issues/3514, and probabaly other places. I've read those threads, but I cannot see yet a simple solution or good workaround for my problem.
I'm using Dask to do some really basic embarrassingly parallel problem:
So here is basically what I do:
When taking only the first 100,000 elements of cmd_lines, this works with some minor problems:
When trying with the full 3 billions parameters, I can't see any computation starting, and the Dashboard is almost not responding anymore.
So an easy workaround is just going 100,000 tasks by 100,000, but I'm not satisfied by this, and it unnecessarily complexifies the really simple code here.
To sum up, I'd like to have a solution for two problems: