New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up reindex and friends #76978
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
@nik9000 nice idea... you are a dreamer, but let's make it happen 👍 |
I looked into a reindex-performance case recently and the bottleneck was absolutely definitely on the indexing side, the searches were trivially fast in comparison, so all the search-side cleverness with slicing etc didn't actually have much impact. I suspect the fact that we're specifying the doc ID on the way through is a big drag on indexing performance in cases where the doc ID doesn't matter to the end user, because at indexing time we need to check whether each externally-specified doc ID has been seen before or not. I guess you can drop the doc ID with a script or something but maybe we should make this easier. One area for improvement that I can see is the lack of any buffering between the search and index sides. Each job flip-flops between waiting for a batch of docs to arrive from the search and waiting for a batch of docs to be indexed, but we're never doing both. We really should be searching and indexing at the same time, and ideally doing multiple bulks in parallel to make use of multiple indexing threads on the shards. The buffer would also give us a nice way to see which side is the bottleneck. |
It'd be easy enough for the job to measure the time between it spends
waiting on search vs indexing and put it in the task. Reindex's task is
pretty voluminous.
Reindex and friends really could use a buffer. Or anything that disconnects
the incoming from the outgoing. There are some concerns with scroll
timeouts there, but it could be made fine.
…On Thu, Aug 26, 2021, 15:34 David Turner ***@***.***> wrote:
I looked into a reindex-performance case recently and the bottleneck was
absolutely definitely on the indexing side, the searches were trivially
fast in comparison, so all the search-side cleverness with slicing etc
didn't actually have much impact. I suspect the fact that we're specifying
the doc ID on the way through is a big drag on indexing performance in
cases where the doc ID doesn't matter to the end user, because at indexing
time we need to check whether each externally-specified doc ID has been
seen before or not. I guess you can drop the doc ID with a script or
something but maybe we should make this easier.
One area for improvement that I can see is the lack of any buffering
between the search and index sides. Each job flip-flops between waiting for
a batch of docs to arrive from the search and waiting for a batch of docs
to be indexed, but we're never doing both. We really should be searching
and indexing at the same time, and ideally doing multiple bulks in parallel
to make use of multiple indexing threads on the shards. The buffer would
also give us a nice way to see which side is the bottleneck.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76978 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUXITLM525SKQ5GYPUWDLT62JOFANCNFSM5C3MLVBA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
A neat fix would be to move to using PIT and |
Reindex, delete-by-query, and update-by-query are positively venerable now and they've mostly served us well over the years. Folks use them all the time with fairly consistent success. But they were never fast. They weren't designed to be. They come from an era when lots of folks had to cobble together a reindex script by hand that scrapped the scroll API. They exist because we figured it'd save time if we built the same thing into ES directly. So they operate about as simply as you can - opening a scroll and iterating on it. That's nice and low impact if you have a production cluster serving a huge amount of search traffic.
Folks frequently use reindex to revive a "stuck" cluster. Or they are just running it on clusters with more capacity. We've understood this but never fundamentally altered the design. We brought in native slicing support which was helpful but didn't really bring as much speed as you'd hope. We took to advising folks to slice the index on natural keys like a
@timestamp
and running many reindexes in parallel - either by hand or by something like gnu-parallel.These are fine recommendations for some folks, but as more folks have started using ES more folks are starting to have to do this parallel reindex process. It might be time to see if we can help all of them at once.
I sure don't want to prescribe a how though. If we feel like the parallel slicing mechanism makes sense we could look at it. Now that we have persistent tasks and a good tasks API it could be a thing. But I really don't know the right solution. Just that we have an opportunity to save a lot of folks a lot of time.
The text was updated successfully, but these errors were encountered: