Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork CCS remote-cluster responses #98124

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today during CCS we process responses from remote clusters on the
transport worker thread, but that's not the right place for this work.
This commit moves it to the SEARCH_COORDINATION pool.

Closes #97997

Today during CCS we process responses from remote clusters on the
transport worker thread, but that's not the right place for this work.
This commit moves it to the `SEARCH_COORDINATION` pool.

Closes elastic#97997
@DaveCTurner DaveCTurner added >bug :Search/Search Search-related issues that do not fall into other categories v8.10.0 labels Aug 2, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Aug 2, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@quux00
Copy link
Contributor

quux00 commented Aug 4, 2023

I did a little ad-hoc testing of this change one of my CCS branches in progress and it seemed the same as before based on my limited testing of it (e.g., I'm currently dealing with a strange async-search CCS race condition and it happens on both SEARCH_COORDINATION threadpool and the DIRECT/transport pool).

I don't know a lot about ES threadpool details. Are there any limitations or considerations I should be aware of that this SEARCH_COORDINATION threadpool has compared to the "DIRECT" (transport) thread pool?

@DaveCTurner
Copy link
Contributor Author

Are there any limitations or considerations I should be aware of

We have to be very careful about what we run on the transport pool. We still use it far too much, with cluster-destroying consequences sometimes, although these days it's not as bad as it used to be. A lot of the places we use it are because historically this has been the default choice, but it's a really bad default choice: we really should default to another pool unless there's very good reasons to keep work on the transport pool. CCS responses in particular can be too large to reasonably process on the transport worker, and CCS has enough latency that we can afford a few µs of forking overhead, and I believe if DIRECT hadn't been the default up until #98131 then we would have chosen a different pool years ago.

In contrast the SEARCH_COORDINATION pool intended for the bits of searches that are kinda lightweight (IO-free) but not light enough to run directly on the transport pool. Today that mostly means the can-match phase, but processing CCS responses also fits into this category IMO. It has a bounded queue with default length of 1000; the bound does not apply to responses, but if the queue was full of response handling then we would start rejecting requests. One slight concern is that it defaults to having ½#CPUs threads, with a max of 5, whereas there are #CPUs transport threads. But then again there's no work-sharing between transport threads so you can hit bottlenecks at any scale there. I can see a few more things that I think make sense to move to SEARCH_COORDINATION, and with enough things happening here I could see an argument for expanding it to having #CPUs threads.

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Aug 4, 2023
Similar to elastic#98124, this action involves doing potentially O(#shards)
work on both sender and receiver so it'd be best to avoid the transport
worker.
Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DaveCTurner DaveCTurner merged commit 837dcc2 into elastic:main Aug 8, 2023
12 checks passed
DaveCTurner added a commit that referenced this pull request Aug 8, 2023
Similar to #98124, this action involves doing potentially O(#shards)
work on both sender and receiver so it'd be best to avoid the transport
worker.
@DaveCTurner DaveCTurner deleted the 2023/08/02/ccs-response-executor branch August 8, 2023 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop handling CCS-related responses from remote cluster on transport thread
4 participants