Move Get Snapshots Serialization to Management Pool #83215

original-brownbear · 2022-01-27T15:01:17Z

It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive and
I've seen a couple of clusters slow-log for verbose=false requests or on the REST layer.
(also it's easy to reproduce in benchmarks when the repo contains 50k indices)
Better generate it off the transport threads.

relates #77466

It's in the title. For large index counts generating the response, both for transport as well as REST layer gets quite expensive. Better generate it off the transport threads.

elasticmachine · 2022-01-27T15:01:21Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2022-01-27T15:01:42Z

Hi @original-brownbear, I've created a changelog YAML for you.

original-brownbear · 2022-01-27T15:04:01Z

...n/java/org/elasticsearch/action/admin/cluster/snapshots/get/TransportGetSnapshotsAction.java

@@ -86,7 +86,11 @@ public TransportGetSnapshotsAction(
            GetSnapshotsRequest::new,
            indexNameExpressionResolver,
            GetSnapshotsResponse::new,
-            ThreadPool.Names.SAME
+            ThreadPool.Names.MANAGEMENT // Execute this on the management pool because creating the response can become fairly expensive


I can see this being a little controversial given that we generate the response on the meta pool if verbose=true. But in those cases I'd argue that we will probably be bound by the IO before the compute such that we never go 100% CPU on all pool threads simultaneously.

DaveCTurner

Seems reasonable with the usual concerns about piling more stuff into the MANAGEMENT threadpool, especially on the master, and losing the natural backpressure mechanism of blocking a transport thread. Still, these requests are cancellable so the backlog should eventually reach equilibrium without breaking anything.

We could of course move the serialization of the verbose=true case over to a MANAGEMENT thread too. I think I'd want there to be a limit on the number of in-flight get-snapshots requests before doing that tho.

original-brownbear · 2022-01-27T21:39:32Z

especially on the master, and losing the natural backpressure mechanism of blocking a transport thread.

Especially on master we don't want that backpressure mechanism IMO :) much worse for its transport threads to slow down.

We could of course move the serialization of the verbose=true case over to a MANAGEMENT thread too.

Not so important in this case I think. The creation of SnapshotInfo from the repository data I think is the slowest part (which isn't happening for ?verbose=true where we read it straight from the repo). Admittedly I haven't benchmarked it in that much detail yet but just looking at the code it's pretty obvious that looking up every index from the repo-data and tim-sorting sorting the indices lists for each snapshot is a little heavy once you reach O(10k) snapshots). The writing of the message to the wire on the transport layer is no big deal I think.
And either way, with this change the REST serialization goes to the MANAGEMENT pool in both cases now, that's the painful one.

This one isn't as bad as cluster stats or /_mappings btw. When benchmarking some crazy stuff like 200 snapshots of 50k indices I'm around a 10s block on master for most requests. You'll probably get hit by response size issues before this action gets so slow that requests are piling up right now. This is mainly about removing the warning which has been logged here and there lately and ruling this out as a source of master instability.

DaveCTurner

LGTM

original-brownbear · 2022-01-31T12:15:17Z

Thanks David!

It's in the title. For large index counts generating the response, both for transport as well as REST layer gets quite expensive. Better generate it off the transport threads.

elasticsearchmachine · 2022-01-31T12:17:15Z

💚 Backport successful

Status	Branch	Result
✅	8.0
✅	7.17

It's in the title. For large index counts generating the response, both for transport as well as REST layer gets quite expensive. Better generate it off the transport threads.

…83324) * Move Get Snapshots Serialization to Management Pool (#83215) It's in the title. For large index counts generating the response, both for transport as well as REST layer gets quite expensive. Better generate it off the transport threads. * fix compilation

Move Get Snapshots Serialization to Management Pool

efef49a

It's in the title. For large index counts generating the response, both for transport as well as REST layer gets quite expensive. Better generate it off the transport threads.

original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v8.1.0 v7.17.1 labels Jan 27, 2022

elasticmachine added the Team:Distributed Meta label for distributed team label Jan 27, 2022

Update docs/changelog/83215.yaml

f3a9a48

original-brownbear commented Jan 27, 2022

View reviewed changes

original-brownbear requested review from DaveCTurner and henningandersen January 27, 2022 16:00

DaveCTurner reviewed Jan 27, 2022

View reviewed changes

original-brownbear requested a review from DaveCTurner January 27, 2022 21:39

DaveCTurner approved these changes Jan 31, 2022

View reviewed changes

original-brownbear added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Jan 31, 2022

original-brownbear merged commit 50fcaac into elastic:master Jan 31, 2022

original-brownbear deleted the move-get-snapshots-to-snapshot-meta branch January 31, 2022 12:15

original-brownbear mentioned this pull request Jan 31, 2022

[8.0] Move Get Snapshots Serialization to Management Pool (#83215) #83323

Merged

original-brownbear mentioned this pull request Jan 31, 2022

[7.17] Move Get Snapshots Serialization to Management Pool (#83215) #83324

Merged

albertzaharovits added v7.17.0 and removed v7.17.1 labels Jan 31, 2022

original-brownbear added v7.17.6 and removed v7.17.0 labels Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Get Snapshots Serialization to Management Pool #83215

Move Get Snapshots Serialization to Management Pool #83215

original-brownbear commented Jan 27, 2022 •

edited

Loading

elasticmachine commented Jan 27, 2022

elasticsearchmachine commented Jan 27, 2022

original-brownbear Jan 27, 2022

DaveCTurner left a comment

original-brownbear commented Jan 27, 2022 •

edited

Loading

DaveCTurner left a comment

original-brownbear commented Jan 31, 2022

elasticsearchmachine commented Jan 31, 2022

Move Get Snapshots Serialization to Management Pool #83215

Move Get Snapshots Serialization to Management Pool #83215

Conversation

original-brownbear commented Jan 27, 2022 • edited Loading

elasticmachine commented Jan 27, 2022

elasticsearchmachine commented Jan 27, 2022

original-brownbear Jan 27, 2022

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented Jan 27, 2022 • edited Loading

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented Jan 31, 2022

elasticsearchmachine commented Jan 31, 2022

💚 Backport successful

original-brownbear commented Jan 27, 2022 •

edited

Loading

original-brownbear commented Jan 27, 2022 •

edited

Loading