Add a bulk-loading mode to indexes #97534

jpountz · 2023-07-10T15:20:44Z

Description

It is a frequent use-case to have an initial load of data when no searches are expected, followed by rare updates but heavy searches.

For such use-cases, it would be interesting to tune Elasticsearch appropriately for each of these two mode, e.g. the bulk load mode could:

increase the merge factor from 10 to 32
increase the flush interval/size to reduce segment flushing
disable scheduled refreshes

And then we could also specialize the rare-update/frequent-search use-case. In addition to bringing above values back to normal:

increase the min merge segment size from 2MB to a much higher value like 100MB

It's possible to do all these things manually today already, but it would be nice to package it better so that there's a single setting that needs to be updated to change the index "mode".

elasticsearchmachine · 2023-07-10T15:21:09Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2023-07-10T16:31:05Z

I also wonder if there's a way we could auto-detect this mode switch. For instance would it work to start a new (non-data-stream) index in bulk-load mode and then flip it into regular mode on the first search?

jpountz · 2023-07-10T17:09:21Z

This sounds like it could be useful to avoid issues with users forgetting to switch back from bulk-load mode to regular.

DaveCTurner · 2023-07-11T07:52:47Z

In fact I wonder if we could piggy-back these things onto the existing search-idle mode. If a shard is search-idle it already skips scheduled refreshes. Could we also set the merge factor according to search-idleness?

Does adjusting the flush interval or size make a meaningful performance difference in these cases? By default we flush every 12h or 512MiB which already seems pretty relaxed to me.

jpountz · 2023-07-11T10:58:49Z

In fact I wonder if we could piggy-back these things onto the existing search-idle mode. If a shard is search-idle it already skips scheduled refreshes. Could we also set the merge factor according to search-idleness?

I guess we could but I would be a bit reluctant to apply some of the above ideas to an index that is in a steady state, e.g. the increase in flush size/interval could make recoveries take significantly longer. Another challenge is that the more search-y use-cases that have this bulk load use-case would generally like to have their first query fast, so even though we would make it automatic via the search-idle mechanism, they would still have somewhat complex workflows to also e.g. wait for big merges to complete before making their indexes serve searches.

Does adjusting the flush interval or size make a meaningful performance difference in these cases? By default we flush every 12h or 512MiB which already seems pretty relaxed to me.

For reference, it was changed to 1 min / 10 GiB in #93524 as flushing every 512MiB boils down to flushing every 3-5 seconds with the TSDB track on moderately powerful hardware. On some datasets like those that have knn vectors that are expensive to merge, saving segment refreshes/flushes can make a good difference.

jpountz added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. Team:Search Meta label for search team labels Jul 10, 2023

elasticsearchmachine added Team:Distributed Meta label for distributed team and removed Team:Search Meta label for search team labels Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a bulk-loading mode to indexes #97534

Add a bulk-loading mode to indexes #97534

jpountz commented Jul 10, 2023

elasticsearchmachine commented Jul 10, 2023

DaveCTurner commented Jul 10, 2023

jpountz commented Jul 10, 2023

DaveCTurner commented Jul 11, 2023

jpountz commented Jul 11, 2023

Add a bulk-loading mode to indexes #97534

Add a bulk-loading mode to indexes #97534

Comments

jpountz commented Jul 10, 2023

Description

elasticsearchmachine commented Jul 10, 2023

DaveCTurner commented Jul 10, 2023

jpountz commented Jul 10, 2023

DaveCTurner commented Jul 11, 2023

jpountz commented Jul 11, 2023