Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bound the wait for cancelled tasks to complete #82906

Open
DaveCTurner opened this issue Jan 21, 2022 · 3 comments
Open

Bound the wait for cancelled tasks to complete #82906

DaveCTurner opened this issue Jan 21, 2022 · 3 comments
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jan 21, 2022

Cancelling a task on a remote node is an active process: we send a cancellation request to the remote node and continue to wait for it to indicate the completion of the task. Typically the task will complete with a TaskCancelledException but it might fail in a different way, or even succeed, if the cancellation loses the race to completion. If the remote node is unable to respond for some reason then today we wait indefinitely, and this means we cannot free the resources held by the listener. If the remote node remains unresponsive for long enough then the build-up of listeners on other nodes can cause a cascading failure.

In contrast, if we specify a timeout on the transport request that triggers the remote task then we complete the listener eagerly at the timeout, although the task is still running remotely. Indeed we don't even attempt to cancel the remote task in this case (#66992) so it just keeps on running.

I believe we should not wait indefinitely for a cancelled task to complete and should instead unilaterally complete the waiting listener with a TaskCancelledException to protect against an unresponsive remote node. Maybe we should do this straight away, similarly to how we handle a timeout, but we could also allow some time for the cancellation to happen gracefully first.

Relates #82337 which describes a similar problem specifically about stats requests, since these are often the source of actual problems in this area.

@DaveCTurner DaveCTurner added >enhancement :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. team-discuss labels Jan 21, 2022
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jan 21, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

We (the @elastic/es-distributed team) discussed this today and agreed to proceed, at least with a short timeout. However this would change the behaviour of cancellations submitted via the tasks API, because such cancellations would no longer report failures to cancel remote tasks. I propose making this change only for cancellations triggered by a client which closes its HTTP connection before it received a response. WDYT?

@dnhatn
Copy link
Member

dnhatn commented Mar 18, 2022

I propose making this change only for cancellations triggered by a client which closes its HTTP connection before it received a response. WDYT?

+1. I think this is the right choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants