New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More robust timeout for repo analysis #101184
More robust timeout for repo analysis #101184
Conversation
Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182
Hi @DaveCTurner, I've created a changelog YAML for you. |
Pinging @elastic/es-distributed (Team:Distributed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A few non-blocking comments mostly for explaining my thoughts when reading the code.
@@ -430,7 +434,7 @@ private boolean isRunning() { | |||
return false; | |||
} | |||
|
|||
if (timeoutTimeMillis < currentTimeMillisSupplier.getAsLong()) { | |||
if (cancellationListener.isDone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but a more specific SubscribableListener#isTimeout
method could help make the logic here more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fair point, we can just fail the task on a timeout and make this all a bunch clearer. See 504ebb1.
this.listener = listener; | ||
|
||
this.cancellationListener = new SubscribableListener<>(); | ||
this.listener = ActionListener.runBefore(listener, () -> cancellationListener.onResponse(null)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to think through the difference scenarios how cancellationListener
can already be completed before we invoke onResponse(null)
here. I think they are all OK. I am writing it down to be explicit and maybe you can double check it as well.
- The request fails due to timeout, i.e.
cancellationListener
is already timed out and this runs right before we are going to calllistener.onFailure
. This is fine becauseSubscribableListener
accepts only the first completion and silently ignores all future results. - We are about to call
listener.onResponse
for success whilecancellationListener
times out concurrently. The timeout will set the failure and try to cancel the tasks. This is fine because we don't check the failure object anymore and cancelling completed or non-existing tasks seem to be a noop. - Timeout can be called even after the
listener
is completed. This is fine since the timeout will be aftercancellationListener.onResponse(null)
and completing aSubscribableListener
more than once is ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 sounds about right, yes.
public void run() { | ||
assert queue.isEmpty() : "must only run action once"; | ||
assert failure.get() == null : "must only run action once"; | ||
|
||
logger.info("running analysis of repository [{}] using path [{}]", request.getRepositoryName(), blobPath); | ||
|
||
cancellationListener.addTimeout(request.getTimeout(), repository.threadPool(), EsExecutors.DIRECT_EXECUTOR_SERVICE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the cluster has many nodes and the repo analysis is configured to have high concurrency, would it be expensive to cancel the tasks on the scheduler thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh perhaps, but I wouldn't expect it to be a problem because (a) I don't see folks changing the concurrency very much and (b) even at 1000 nodes I don't think it'd be a huge deal, the cancel messages are tiny. We cancel things for other reasons on low-latency threads, e.g. RestCancellableNodeClient
.
@Override | ||
public void onFailure(Exception e) { | ||
// trigger another isRunning check which will cancel the task if not already failed or cancelled | ||
var isNowRunning = isRunning(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I find it a bit surprising that isRunning
does a bit more than an usualy isXxx
boolean method. Not suggesting any change for this PR though since it is existing code and the name is not related to what we are trying to fix here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💔 Backport failed
You can use sqren/backport to manually backport by running |
Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.
Relates #66992
Closes #101182