Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up reindex and friends #76978

Open
nik9000 opened this issue Aug 26, 2021 · 5 comments
Open

Speed up reindex and friends #76978

nik9000 opened this issue Aug 26, 2021 · 5 comments
Labels
:Distributed/Reindex Issues relating to reindex that are not caused by issues further down >enhancement Team:Distributed Meta label for distributed team

Comments

@nik9000
Copy link
Member

nik9000 commented Aug 26, 2021

Reindex, delete-by-query, and update-by-query are positively venerable now and they've mostly served us well over the years. Folks use them all the time with fairly consistent success. But they were never fast. They weren't designed to be. They come from an era when lots of folks had to cobble together a reindex script by hand that scrapped the scroll API. They exist because we figured it'd save time if we built the same thing into ES directly. So they operate about as simply as you can - opening a scroll and iterating on it. That's nice and low impact if you have a production cluster serving a huge amount of search traffic.

Folks frequently use reindex to revive a "stuck" cluster. Or they are just running it on clusters with more capacity. We've understood this but never fundamentally altered the design. We brought in native slicing support which was helpful but didn't really bring as much speed as you'd hope. We took to advising folks to slice the index on natural keys like a @timestamp and running many reindexes in parallel - either by hand or by something like gnu-parallel.

These are fine recommendations for some folks, but as more folks have started using ES more folks are starting to have to do this parallel reindex process. It might be time to see if we can help all of them at once.

I sure don't want to prescribe a how though. If we feel like the parallel slicing mechanism makes sense we could look at it. Now that we have persistent tasks and a good tasks API it could be a thing. But I really don't know the right solution. Just that we have an opportunity to save a lot of folks a lot of time.

@nik9000 nik9000 added >enhancement :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Aug 26, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Aug 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@hadadil
Copy link

hadadil commented Aug 26, 2021

@nik9000 nice idea... you are a dreamer, but let's make it happen 👍

@DaveCTurner
Copy link
Contributor

I looked into a reindex-performance case recently and the bottleneck was absolutely definitely on the indexing side, the searches were trivially fast in comparison, so all the search-side cleverness with slicing etc didn't actually have much impact. I suspect the fact that we're specifying the doc ID on the way through is a big drag on indexing performance in cases where the doc ID doesn't matter to the end user, because at indexing time we need to check whether each externally-specified doc ID has been seen before or not. I guess you can drop the doc ID with a script or something but maybe we should make this easier.

One area for improvement that I can see is the lack of any buffering between the search and index sides. Each job flip-flops between waiting for a batch of docs to arrive from the search and waiting for a batch of docs to be indexed, but we're never doing both. We really should be searching and indexing at the same time, and ideally doing multiple bulks in parallel to make use of multiple indexing threads on the shards. The buffer would also give us a nice way to see which side is the bottleneck.

@nik9000
Copy link
Member Author

nik9000 commented Aug 27, 2021 via email

@DaveCTurner
Copy link
Contributor

There are some concerns with scroll timeouts there, but it could be made fine.

A neat fix would be to move to using PIT and search_after instead of scroll, because you can keep a PIT alive without consuming any docs. Although having said that perhaps you can keep a scroll alive without consuming any docs too, by setting "size":0? Not sure, needs some investigation. In any case we don't seem to worry about that today, and inserting a buffer shouldn't really affect the frequency of searches once it's got past the initial burst.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Reindex Issues relating to reindex that are not caused by issues further down >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

4 participants