Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support task cancellation cross clusters #55779

Closed
wants to merge 15 commits into from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Apr 27, 2020

Today task cancellation works in a single cluster. Canceling cross-cluster search requests will leave sub-tasks on the remote cluster untouched. This change implements task cancellation across multiple clusters.

@dnhatn dnhatn added WIP :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. and removed WIP labels Apr 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Task Management)

@dnhatn dnhatn marked this pull request as ready for review April 27, 2020 03:08
@ywelsch ywelsch requested a review from Tim-Brooks April 27, 2020 10:20
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Nhat, I've left some comments to discuss. Overall looking very good already.

} else {
final ArrayList<DiscoveryNode> subNodes = new ArrayList<>(entry.getValue());
final DiscoveryNode targetNode = subNodes.remove(0);
if (targetNode.getVersion().onOrAfter(Version.V_8_0_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of "proxy mode" connections (see also ProxyConnectionStrategy), targetNode.getVersion() will always return Version.CURRENT.minimumCompatibilityVersion() AFAICS, which means that this request won't be send to those nodes.

The actual version of the node on the other side is available using channel.getVersion().
On the other hand channel.getNode will return the (possibly fake) DiscoveryNode object that was used to create the connection.

/cc: @tbrooks8 This is quite trappy, anything we can change in the transport in the short term to avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the handshaking to update the version of the target node with its actual version. I also exposed the proxy node so that we can check the version of the proxy node. I think we are good here.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left two more comments for discussion

@rjernst rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020
@dnhatn
Copy link
Member Author

dnhatn commented May 27, 2020

@ywelsch @javanna Thanks for your reviews. I have reworked this PR due to the changes in #56620. Can you please take a look?

@dnhatn dnhatn requested review from javanna and ywelsch May 27, 2020 15:29
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left one comment too discuss, before I go deeper into this. Also, I'm wondering if there's a way to test the versioned BWC logic (which is quite complex)

}

private static class HeartbeatRequest extends TransportRequest {
final String nodeId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if node id is sufficient here, or whether we need to mention concrete tasks here. Assume that a remote cluster is disconnected during a request, and then reconnects pretty quickly with parent tasks on source cluster going away (due to failure of child requests). In that case, we keep the child task alive on the remote cluster (as long as new requests are being processed), even though there's no point in doing so. Another option here (instead of sending parent task ids to renew) might be to use some numbering scheme (periodically incrementing numbers which correspond to named leases that represent an abstract set of long-running requests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will rework this.

@dnhatn
Copy link
Member Author

dnhatn commented May 29, 2020

I'm wondering if there's a way to test the versioned BWC logic (which is quite complex).

+1. I will work on this infra first in a separate PR.

@dnhatn
Copy link
Member Author

dnhatn commented Sep 11, 2020

This PR is quite old. I am closing it and work on a new one.

@dnhatn dnhatn closed this Sep 11, 2020
@dnhatn dnhatn deleted the cross-cluster-cancellation branch September 11, 2020 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. >enhancement Team:Distributed Meta label for distributed team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants