Skip to content

Several millions tasks and Client.map #2181

@guillaumeeb

Description

@guillaumeeb

Hi,

I'm facing an issue that has already been raised in other more complicated forms, see pangeo-data/pangeo/issues/266, dask/dask/issues/3514, and probabaly other places. I've read those threads, but I cannot see yet a simple solution or good workaround for my problem.

I'm using Dask to do some really basic embarrassingly parallel problem:

  • I've got a scientific code, that I will call using subprocess
  • I need to run int with exactly 3123120 different set of parameters.

So here is basically what I do:

cmd_lines = []
for param1 in param1_set
    for param2 in param2_set
         # Some other nested loops
         cmd_lines += [' '.join(("myscript.sh", param1, param2, ... ))]

def run_process(cmd):
    os.chdir(WORK_DIR)
    process = Popen(cmd.split(' '), stdout=PIPE)
    (output, err) = process.communicate()
    exit_code = process.wait()
    return output, err

futures = client.map(run_process, cmd_lines)
results = client.gather(futures)

When taking only the first 100,000 elements of cmd_lines, this works with some minor problems:

  • It takes between 40 seconds and 1m30s before my workers do anything, I imagine this is the time of building the graph and sending to worker.
  • Gather takes about 20 seconds, this seems better than the scattering though.

When trying with the full 3 billions parameters, I can't see any computation starting, and the Dashboard is almost not responding anymore.

So an easy workaround is just going 100,000 tasks by 100,000, but I'm not satisfied by this, and it unnecessarily complexifies the really simple code here.

To sum up, I'd like to have a solution for two problems:

  • First one is more of an annoyance: the time taken to actually start computation, without any information on the Dashboard. For such an embarrassingly parallel workload, I would even expect the tasks to start right after submission, they could be sent to workers in a streaming way, and we should not wait for 100,000 of tasks to be all sent to workers to begin.
  • Second one which is probably very related, scheduler should handle several billion tasks at a time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions