More robust timeout for repo analysis #101184

DaveCTurner · 2023-10-21T05:35:57Z

Replaces the transport-level timeout with an overall timeout on the
whole repository analysis task to ensure that all child tasks terminate
promptly.

Relates #66992
Closes #101182

Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182

elasticsearchmachine · 2023-10-21T05:36:21Z

Hi @DaveCTurner, I've created a changelog YAML for you.

elasticsearchmachine · 2023-10-21T05:36:21Z

Pinging @elastic/es-distributed (Team:Distributed)

ywangd

LGTM

A few non-blocking comments mostly for explaining my thoughts when reading the code.

ywangd · 2023-10-23T00:31:03Z

.../src/main/java/org/elasticsearch/repositories/blobstore/testkit/RepositoryAnalyzeAction.java

@@ -430,7 +434,7 @@ private boolean isRunning() {
                return false;
            }

-            if (timeoutTimeMillis < currentTimeMillisSupplier.getAsLong()) {
+            if (cancellationListener.isDone()) {


Not for this PR, but a more specific SubscribableListener#isTimeout method could help make the logic here more explicit.

A fair point, we can just fail the task on a timeout and make this all a bunch clearer. See 504ebb1.

ywangd · 2023-10-23T01:09:17Z

.../src/main/java/org/elasticsearch/repositories/blobstore/testkit/RepositoryAnalyzeAction.java

-            this.listener = listener;
+
+            this.cancellationListener = new SubscribableListener<>();
+            this.listener = ActionListener.runBefore(listener, () -> cancellationListener.onResponse(null));


I am trying to think through the difference scenarios how cancellationListener can already be completed before we invoke onResponse(null) here. I think they are all OK. I am writing it down to be explicit and maybe you can double check it as well.

The request fails due to timeout, i.e. cancellationListener is already timed out and this runs right before we are going to call listener.onFailure. This is fine because SubscribableListener accepts only the first completion and silently ignores all future results.

We are about to call listener.onResponse for success while cancellationListener times out concurrently. The timeout will set the failure and try to cancel the tasks. This is fine because we don't check the failure object anymore and cancelling completed or non-existing tasks seem to be a noop.

Timeout can be called even after the listener is completed. This is fine since the timeout will be after cancellationListener.onResponse(null) and completing a SubscribableListener more than once is ignored.

👍 sounds about right, yes.

ywangd · 2023-10-23T01:10:45Z

.../src/main/java/org/elasticsearch/repositories/blobstore/testkit/RepositoryAnalyzeAction.java

        public void run() {
            assert queue.isEmpty() : "must only run action once";
            assert failure.get() == null : "must only run action once";

            logger.info("running analysis of repository [{}] using path [{}]", request.getRepositoryName(), blobPath);

+            cancellationListener.addTimeout(request.getTimeout(), repository.threadPool(), EsExecutors.DIRECT_EXECUTOR_SERVICE);


If the cluster has many nodes and the repo analysis is configured to have high concurrency, would it be expensive to cancel the tasks on the scheduler thread?

Eh perhaps, but I wouldn't expect it to be a problem because (a) I don't see folks changing the concurrency very much and (b) even at 1000 nodes I don't think it'd be a huge deal, the cancel messages are tiny. We cancel things for other reasons on low-latency threads, e.g. RestCancellableNodeClient.

ywangd · 2023-10-23T01:14:27Z

.../src/main/java/org/elasticsearch/repositories/blobstore/testkit/RepositoryAnalyzeAction.java

+            @Override
+            public void onFailure(Exception e) {
+                // trigger another isRunning check which will cancel the task if not already failed or cancelled
+                var isNowRunning = isRunning();


Nit: I find it a bit surprising that isRunning does a bit more than an usualy isXxx boolean method. Not suggesting any change for this PR though since it is existing code and the name is not related to what we are trying to fix here.

That's fair too. This was done before #82685 introduced org.elasticsearch.tasks.CancellableTask#addListener, but if we migrated to that we could drop the task.isCancelled() check here. I opened #101197.

elasticsearchmachine · 2023-10-23T07:19:01Z

💔 Backport failed

Status	Branch	Result
❌	8.11	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 101184

Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182

fcofdez

LGTM 👍

Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates #66992 Closes #101182

More robust timeout for repo analysis

c600d7a

Replaces the transport-level timeout with an overall timeout on the whole repository analysis task to ensure that all child tasks terminate promptly. Relates elastic#66992 Closes elastic#101182

DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.11.1 v8.12.0 labels Oct 21, 2023

DaveCTurner requested a review from ywangd October 21, 2023 05:35

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Oct 21, 2023

Update docs/changelog/101184.yaml

177ee41

DaveCTurner added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Oct 21, 2023

Fix YAML test

19a62e8

DaveCTurner mentioned this pull request Oct 21, 2023

Repo analysis of uncontended register behaviour #101185

Merged

DaveCTurner requested a review from fcofdez October 21, 2023 08:19

disruption.compareAndExchangeReturnsWitness

d8e0d02

DaveCTurner mentioned this pull request Oct 22, 2023

Rename RegisterAnalyzeAction to ContendedR... #101192

Merged

ywangd approved these changes Oct 23, 2023

View reviewed changes

DaveCTurner added 2 commits October 23, 2023 07:21

Merge branch 'main' into 2023/10/21/repo-analysis-timeout

407c283

Fail on timeout directly, no need to check isDone

504ebb1

DaveCTurner added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 23, 2023

elasticsearchmachine merged commit cfb0780 into elastic:main Oct 23, 2023
13 checks passed

DaveCTurner deleted the 2023/10/21/repo-analysis-timeout branch October 23, 2023 07:18

elasticsearchmachine added the backport pending label Oct 23, 2023

DaveCTurner removed the backport pending label Oct 23, 2023

fcofdez reviewed Oct 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust timeout for repo analysis #101184

More robust timeout for repo analysis #101184

DaveCTurner commented Oct 21, 2023

elasticsearchmachine commented Oct 21, 2023

elasticsearchmachine commented Oct 21, 2023

ywangd left a comment

ywangd Oct 23, 2023

DaveCTurner Oct 23, 2023

ywangd Oct 23, 2023

DaveCTurner Oct 23, 2023

ywangd Oct 23, 2023

DaveCTurner Oct 23, 2023

ywangd Oct 23, 2023

DaveCTurner Oct 23, 2023

elasticsearchmachine commented Oct 23, 2023

fcofdez left a comment

More robust timeout for repo analysis #101184

More robust timeout for repo analysis #101184

Conversation

DaveCTurner commented Oct 21, 2023

elasticsearchmachine commented Oct 21, 2023

elasticsearchmachine commented Oct 21, 2023

ywangd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 23, 2023

💔 Backport failed

fcofdez left a comment

Choose a reason for hiding this comment