Speed up reindex and friends #76978

nik9000 · 2021-08-26T13:39:09Z

Reindex, delete-by-query, and update-by-query are positively venerable now and they've mostly served us well over the years. Folks use them all the time with fairly consistent success. But they were never fast. They weren't designed to be. They come from an era when lots of folks had to cobble together a reindex script by hand that scrapped the scroll API. They exist because we figured it'd save time if we built the same thing into ES directly. So they operate about as simply as you can - opening a scroll and iterating on it. That's nice and low impact if you have a production cluster serving a huge amount of search traffic.

Folks frequently use reindex to revive a "stuck" cluster. Or they are just running it on clusters with more capacity. We've understood this but never fundamentally altered the design. We brought in native slicing support which was helpful but didn't really bring as much speed as you'd hope. We took to advising folks to slice the index on natural keys like a @timestamp and running many reindexes in parallel - either by hand or by something like gnu-parallel.

These are fine recommendations for some folks, but as more folks have started using ES more folks are starting to have to do this parallel reindex process. It might be time to see if we can help all of them at once.

I sure don't want to prescribe a how though. If we feel like the parallel slicing mechanism makes sense we could look at it. Now that we have persistent tasks and a good tasks API it could be a thing. But I really don't know the right solution. Just that we have an opportunity to save a lot of folks a lot of time.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-26T13:39:11Z

Pinging @elastic/es-distributed (Team:Distributed)

hadadil · 2021-08-26T18:24:56Z

@nik9000 nice idea... you are a dreamer, but let's make it happen 👍

DaveCTurner · 2021-08-26T19:34:46Z

I looked into a reindex-performance case recently and the bottleneck was absolutely definitely on the indexing side, the searches were trivially fast in comparison, so all the search-side cleverness with slicing etc didn't actually have much impact. I suspect the fact that we're specifying the doc ID on the way through is a big drag on indexing performance in cases where the doc ID doesn't matter to the end user, because at indexing time we need to check whether each externally-specified doc ID has been seen before or not. I guess you can drop the doc ID with a script or something but maybe we should make this easier.

One area for improvement that I can see is the lack of any buffering between the search and index sides. Each job flip-flops between waiting for a batch of docs to arrive from the search and waiting for a batch of docs to be indexed, but we're never doing both. We really should be searching and indexing at the same time, and ideally doing multiple bulks in parallel to make use of multiple indexing threads on the shards. The buffer would also give us a nice way to see which side is the bottleneck.

nik9000 · 2021-08-27T00:53:03Z

It'd be easy enough for the job to measure the time between it spends waiting on search vs indexing and put it in the task. Reindex's task is pretty voluminous. Reindex and friends really could use a buffer. Or anything that disconnects the incoming from the outgoing. There are some concerns with scroll timeouts there, but it could be made fine.

…

On Thu, Aug 26, 2021, 15:34 David Turner ***@***.***> wrote: I looked into a reindex-performance case recently and the bottleneck was absolutely definitely on the indexing side, the searches were trivially fast in comparison, so all the search-side cleverness with slicing etc didn't actually have much impact. I suspect the fact that we're specifying the doc ID on the way through is a big drag on indexing performance in cases where the doc ID doesn't matter to the end user, because at indexing time we need to check whether each externally-specified doc ID has been seen before or not. I guess you can drop the doc ID with a script or something but maybe we should make this easier. One area for improvement that I can see is the lack of any buffering between the search and index sides. Each job flip-flops between waiting for a batch of docs to arrive from the search and waiting for a batch of docs to be indexed, but we're never doing both. We really should be searching and indexing at the same time, and ideally doing multiple bulks in parallel to make use of multiple indexing threads on the shards. The buffer would also give us a nice way to see which side is the bottleneck. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#76978 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABUXITLM525SKQ5GYPUWDLT62JOFANCNFSM5C3MLVBA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

DaveCTurner · 2021-09-02T07:46:36Z

There are some concerns with scroll timeouts there, but it could be made fine.

A neat fix would be to move to using PIT and search_after instead of scroll, because you can keep a PIT alive without consuming any docs. Although having said that perhaps you can keep a scroll alive without consuming any docs too, by setting "size":0? Not sure, needs some investigation. In any case we don't seem to worry about that today, and inserting a buffer shouldn't really affect the frequency of searches once it's got past the initial burst.

nik9000 added >enhancement :Distributed/Reindex Issues relating to reindex that are not caused by issues further down labels Aug 26, 2021

elasticmachine added the Team:Distributed Meta label for distributed team label Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up reindex and friends #76978

Speed up reindex and friends #76978

nik9000 commented Aug 26, 2021

elasticmachine commented Aug 26, 2021

hadadil commented Aug 26, 2021

DaveCTurner commented Aug 26, 2021

nik9000 commented Aug 27, 2021 via email

DaveCTurner commented Sep 2, 2021

Speed up reindex and friends #76978

Speed up reindex and friends #76978

Comments

nik9000 commented Aug 26, 2021

elasticmachine commented Aug 26, 2021

hadadil commented Aug 26, 2021

DaveCTurner commented Aug 26, 2021

nik9000 commented Aug 27, 2021 via email

DaveCTurner commented Sep 2, 2021