Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust timeout for repo analysis #101184

Merged

Conversation

DaveCTurner
Copy link
Contributor

Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates #66992
Closes #101182

Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates elastic#66992
Closes elastic#101182
@DaveCTurner DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.11.1 v8.12.0 labels Oct 21, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Oct 21, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Oct 21, 2023
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

A few non-blocking comments mostly for explaining my thoughts when reading the code.

@@ -430,7 +434,7 @@ private boolean isRunning() {
return false;
}

if (timeoutTimeMillis < currentTimeMillisSupplier.getAsLong()) {
if (cancellationListener.isDone()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but a more specific SubscribableListener#isTimeout method could help make the logic here more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fair point, we can just fail the task on a timeout and make this all a bunch clearer. See 504ebb1.

this.listener = listener;

this.cancellationListener = new SubscribableListener<>();
this.listener = ActionListener.runBefore(listener, () -> cancellationListener.onResponse(null));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think through the difference scenarios how cancellationListener can already be completed before we invoke onResponse(null) here. I think they are all OK. I am writing it down to be explicit and maybe you can double check it as well.

  1. The request fails due to timeout, i.e. cancellationListener is already timed out and this runs right before we are going to call listener.onFailure. This is fine because SubscribableListener accepts only the first completion and silently ignores all future results.
  2. We are about to call listener.onResponse for success while cancellationListener times out concurrently. The timeout will set the failure and try to cancel the tasks. This is fine because we don't check the failure object anymore and cancelling completed or non-existing tasks seem to be a noop.
  3. Timeout can be called even after the listener is completed. This is fine since the timeout will be after cancellationListener.onResponse(null) and completing a SubscribableListener more than once is ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 sounds about right, yes.

public void run() {
assert queue.isEmpty() : "must only run action once";
assert failure.get() == null : "must only run action once";

logger.info("running analysis of repository [{}] using path [{}]", request.getRepositoryName(), blobPath);

cancellationListener.addTimeout(request.getTimeout(), repository.threadPool(), EsExecutors.DIRECT_EXECUTOR_SERVICE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cluster has many nodes and the repo analysis is configured to have high concurrency, would it be expensive to cancel the tasks on the scheduler thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh perhaps, but I wouldn't expect it to be a problem because (a) I don't see folks changing the concurrency very much and (b) even at 1000 nodes I don't think it'd be a huge deal, the cancel messages are tiny. We cancel things for other reasons on low-latency threads, e.g. RestCancellableNodeClient.

@Override
public void onFailure(Exception e) {
// trigger another isRunning check which will cancel the task if not already failed or cancelled
var isNowRunning = isRunning();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find it a bit surprising that isRunning does a bit more than an usualy isXxx boolean method. Not suggesting any change for this PR though since it is existing code and the name is not related to what we are trying to fix here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair too. This was done before #82685 introduced org.elasticsearch.tasks.CancellableTask#addListener, but if we migrated to that we could drop the task.isCancelled() check here. I opened #101197.

@DaveCTurner DaveCTurner added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 23, 2023
@elasticsearchmachine elasticsearchmachine merged commit cfb0780 into elastic:main Oct 23, 2023
13 checks passed
@DaveCTurner DaveCTurner deleted the 2023/10/21/repo-analysis-timeout branch October 23, 2023 07:18
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.11 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 101184

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 23, 2023
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates elastic#66992
Closes elastic#101182
Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

elasticsearchmachine pushed a commit that referenced this pull request Oct 23, 2023
Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates #66992
Closes #101182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team v8.11.1 v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repository analysis timeout should apply to register operations
4 participants