Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise informative warning when rescheduling an unknown task #2916

Merged
merged 2 commits into from Aug 3, 2019

Conversation

@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Aug 2, 2019

When tasks are quickly asked to be rescheduled, and also quickly canceled, it can lead to the task key being removed from the scheduler, but before the task can be canceled, the task begins running on the worker. Then when Reschedule is raised, the scheduler no longer has a key for the rescheduled task. Here's a (contrived) example of this behavior:

from distributed import Client, Reschedule

if __name__ == "__main__":

    def f(x):
        raise Reschedule

    with Client(processes=False, threads_per_worker=1) as client:
        for i in range(100):
            z = client.submit(f, i)
            z.cancel()

which raises a KeyError when the scheduler attempts to reschedule the (already canceled) task:

distributed.core - ERROR - 'f-eb18a99a0c19477232670692e952f322'
Traceback (most recent call last):
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/core.py", line 416, in handle_comm
    result = await result
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/scheduler.py", line 1489, in add_worker
    await self.handle_worker(comm=comm, worker=address)
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/scheduler.py", line 2404, in handle_worker
    await self.handle_stream(comm=comm, extra={"worker": worker})
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/core.py", line 477, in handle_stream
    handler(**merge(extra, msg))
  File "/Users/jbourbeau/github/Quansight-Labs/distributed/distributed/scheduler.py", line 4366, in reschedule
    ts = self.tasks[key]
KeyError: 'f-eb18a99a0c19477232670692e952f322'

This PR adds an informative warning if attempting to reschedule a key not found on the scheduler and aborts the rescheduling process.

@jrbourbeau
Copy link
Member Author

@jrbourbeau jrbourbeau commented Aug 2, 2019

Hm not sure why there's a seg fault for the Python 3.5 Travis build (https://travis-ci.org/dask/distributed/jobs/567059993#L1586). I wanna say it's unrelated to the changes here

Loading

@jrbourbeau
Copy link
Member Author

@jrbourbeau jrbourbeau commented Aug 3, 2019

Seems the failure was transient (thanks to whoever restarted the build)

Loading

@mrocklin mrocklin merged commit 20ba1a7 into dask:master Aug 3, 2019
2 checks passed
Loading
@mrocklin
Copy link
Member

@mrocklin mrocklin commented Aug 3, 2019

Thanks @jrbourbeau !

Loading

@jrbourbeau jrbourbeau deleted the quick-reschedule branch Aug 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants