Skip to content

Different behaviour in distributed vs single-machine scheduler: missing task #4750

@maurosilber

Description

@maurosilber

I'm adding extra tasks on the optimizing step to save some task results to disk. The extra tasks are run on the dask single-machine scheduler, while they are not run on the distributed scheduler. It might not be a bug, but the intended behavior. In that case, how should have I done it?

Thanks!

Minimal Complete Verifiable Example:

Example 1:

import dask

def optimize(dsk, keys):
    print("Optimizing graph")
    dsk = dask.utils.ensure_dict(dsk)
    # Adding an extra task to the graph
    dsk["extra_task"] = (print, "Running extra task")
    return dsk

dask.config.set(delayed_optimize=optimize)

dask.delayed(print)("Running task").compute()

Running example 1 prints:

Optimizing graph
Running extra task
Running task

Example 2:

Start dask-scheduler and a dask-worker:

dask-scheduler & dask-worker localhost:8786 &

Add to example 1 the following:

import dask
import distributed

client = distributed.Client("localhost:8786")

def optimize(dsk, keys):
...

Example 2 prints "Optimizing graph" in the python process running the file, and "Running task" in the dask-worker process, but it doesn't print "Running extra task".

Environment:

  • Dask and distributed versions: 2021.04.0
  • Python version: 3.8.8
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): conda

New conda environment from environment-3.8.yaml. Cloned and "pip -e" installed dask and distributed, tags 2021.04.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions