Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise informative warning when rescheduling an unknown task #2916

Merged
merged 2 commits into from Aug 3, 2019

Conversation

@jrbourbeau
Copy link
Member

commented Aug 2, 2019

When tasks are quickly asked to be rescheduled, and also quickly canceled, it can lead to the task key being removed from the scheduler, but before the task can be canceled, the task begins running on the worker. Then when Reschedule is raised, the scheduler no longer has a key for the rescheduled task. Here's a (contrived) example of this behavior:

from distributed import Client, Reschedule

if __name__ == "__main__":

    def f(x):
        raise Reschedule

    with Client(processes=False, threads_per_worker=1) as client:
        for i in range(100):
            z = client.submit(f, i)
            z.cancel()

which raises a KeyError when the scheduler attempts to reschedule the (already canceled) task:

distributed.core - ERROR - 'f-eb18a99a0c19477232670692e952f322'
Traceback (most recent call last):
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/core.py", line 416, in handle_comm
    result = await result
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/scheduler.py", line 1489, in add_worker
    await self.handle_worker(comm=comm, worker=address)
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/scheduler.py", line 2404, in handle_worker
    await self.handle_stream(comm=comm, extra={"worker": worker})
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/core.py", line 477, in handle_stream
    handler(**merge(extra, msg))
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/scheduler.py", line 4366, in reschedule
    ts = self.tasks[key]
KeyError: 'f-eb18a99a0c19477232670692e952f322'

This PR adds an informative warning if attempting to reschedule a key not found on the scheduler and aborts the rescheduling process.

@jrbourbeau

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Hm not sure why there's a seg fault for the Python 3.5 Travis build (https://travis-ci.org/dask/distributed/jobs/567059993#L1586). I wanna say it's unrelated to the changes here

@jrbourbeau

This comment has been minimized.

Copy link
Member Author

commented Aug 3, 2019

Seems the failure was transient (thanks to whoever restarted the build)

@mrocklin mrocklin merged commit 20ba1a7 into dask:master Aug 3, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@mrocklin

This comment has been minimized.

Copy link
Member

commented Aug 3, 2019

Thanks @jrbourbeau !

@jrbourbeau jrbourbeau deleted the Quansight-Labs:quick-reschedule branch Aug 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.