New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support task cancellation cross clusters #55779
Conversation
Pinging @elastic/es-distributed (:Distributed/Task Management) |
...ava/org/elasticsearch/action/admin/cluster/node/tasks/cancel/TransportCancelTasksAction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Nhat, I've left some comments to discuss. Overall looking very good already.
} else { | ||
final ArrayList<DiscoveryNode> subNodes = new ArrayList<>(entry.getValue()); | ||
final DiscoveryNode targetNode = subNodes.remove(0); | ||
if (targetNode.getVersion().onOrAfter(Version.V_8_0_0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of "proxy mode" connections (see also ProxyConnectionStrategy
), targetNode.getVersion()
will always return Version.CURRENT.minimumCompatibilityVersion()
AFAICS, which means that this request won't be send to those nodes.
The actual version of the node on the other side is available using channel.getVersion()
.
On the other hand channel.getNode
will return the (possibly fake) DiscoveryNode
object that was used to create the connection.
/cc: @tbrooks8 This is quite trappy, anything we can change in the transport in the short term to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modified the handshaking to update the version of the target node with its actual version. I also exposed the proxy node so that we can check the version of the proxy node. I think we are good here.
...ava/org/elasticsearch/action/admin/cluster/node/tasks/cancel/TransportCancelTasksAction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left two more comments for discussion
...ava/org/elasticsearch/action/admin/cluster/node/tasks/cancel/TransportCancelTasksAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left one comment too discuss, before I go deeper into this. Also, I'm wondering if there's a way to test the versioned BWC logic (which is quite complex)
...alClusterTest/java/org/elasticsearch/action/admin/cluster/node/tasks/CancellableTasksIT.java
Show resolved
Hide resolved
} | ||
|
||
private static class HeartbeatRequest extends TransportRequest { | ||
final String nodeId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if node id is sufficient here, or whether we need to mention concrete tasks here. Assume that a remote cluster is disconnected during a request, and then reconnects pretty quickly with parent tasks on source cluster going away (due to failure of child requests). In that case, we keep the child task alive on the remote cluster (as long as new requests are being processed), even though there's no point in doing so. Another option here (instead of sending parent task ids to renew) might be to use some numbering scheme (periodically incrementing numbers which correspond to named leases that represent an abstract set of long-running requests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will rework this.
+1. I will work on this infra first in a separate PR. |
This PR is quite old. I am closing it and work on a new one. |
Today task cancellation works in a single cluster. Canceling cross-cluster search requests will leave sub-tasks on the remote cluster untouched. This change implements task cancellation across multiple clusters.