Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Get Snapshots Serialization to Management Pool #83215

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Jan 27, 2022

It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive and
I've seen a couple of clusters slow-log for verbose=false requests or on the REST layer.
(also it's easy to reproduce in benchmarks when the repo contains 50k indices)
Better generate it off the transport threads.

relates #77466

It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive.
Better generate it off the transport threads.
@original-brownbear original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v8.1.0 v7.17.1 labels Jan 27, 2022
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jan 27, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @original-brownbear, I've created a changelog YAML for you.

@@ -86,7 +86,11 @@ public TransportGetSnapshotsAction(
GetSnapshotsRequest::new,
indexNameExpressionResolver,
GetSnapshotsResponse::new,
ThreadPool.Names.SAME
ThreadPool.Names.MANAGEMENT // Execute this on the management pool because creating the response can become fairly expensive
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this being a little controversial given that we generate the response on the meta pool if verbose=true. But in those cases I'd argue that we will probably be bound by the IO before the compute such that we never go 100% CPU on all pool threads simultaneously.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable with the usual concerns about piling more stuff into the MANAGEMENT threadpool, especially on the master, and losing the natural backpressure mechanism of blocking a transport thread. Still, these requests are cancellable so the backlog should eventually reach equilibrium without breaking anything.

We could of course move the serialization of the verbose=true case over to a MANAGEMENT thread too. I think I'd want there to be a limit on the number of in-flight get-snapshots requests before doing that tho.

@original-brownbear
Copy link
Member Author

original-brownbear commented Jan 27, 2022

especially on the master, and losing the natural backpressure mechanism of blocking a transport thread.

Especially on master we don't want that backpressure mechanism IMO :) much worse for its transport threads to slow down.

We could of course move the serialization of the verbose=true case over to a MANAGEMENT thread too.

Not so important in this case I think. The creation of SnapshotInfo from the repository data I think is the slowest part (which isn't happening for ?verbose=true where we read it straight from the repo). Admittedly I haven't benchmarked it in that much detail yet but just looking at the code it's pretty obvious that looking up every index from the repo-data and tim-sorting sorting the indices lists for each snapshot is a little heavy once you reach O(10k) snapshots). The writing of the message to the wire on the transport layer is no big deal I think.
And either way, with this change the REST serialization goes to the MANAGEMENT pool in both cases now, that's the painful one.

This one isn't as bad as cluster stats or /_mappings btw. When benchmarking some crazy stuff like 200 snapshots of 50k indices I'm around a 10s block on master for most requests. You'll probably get hit by response size issues before this action gets so slow that requests are piling up right now. This is mainly about removing the warning which has been logged here and there lately and ruling this out as a source of master instability.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Member Author

Thanks David!

@original-brownbear original-brownbear added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Jan 31, 2022
@original-brownbear original-brownbear merged commit 50fcaac into elastic:master Jan 31, 2022
@original-brownbear original-brownbear deleted the move-get-snapshots-to-snapshot-meta branch January 31, 2022 12:15
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 31, 2022
It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive.
Better generate it off the transport threads.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 31, 2022
It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive.
Better generate it off the transport threads.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.0
7.17

original-brownbear added a commit that referenced this pull request Feb 3, 2022
It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive.
Better generate it off the transport threads.
elasticsearchmachine pushed a commit that referenced this pull request Jul 7, 2022
…83324)

* Move Get Snapshots Serialization to Management Pool (#83215)

It's in the title. For large index counts generating the response,
both for transport as well as REST layer gets quite expensive.
Better generate it off the transport threads.

* fix compilation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team v7.17.6 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants