Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a bulk-loading mode to indexes #97534

Open
jpountz opened this issue Jul 10, 2023 · 5 comments
Open

Add a bulk-loading mode to indexes #97534

jpountz opened this issue Jul 10, 2023 · 5 comments
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement Team:Distributed Meta label for distributed team

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 10, 2023

Description

It is a frequent use-case to have an initial load of data when no searches are expected, followed by rare updates but heavy searches.

For such use-cases, it would be interesting to tune Elasticsearch appropriately for each of these two mode, e.g. the bulk load mode could:

  • increase the merge factor from 10 to 32
  • increase the flush interval/size to reduce segment flushing
  • disable scheduled refreshes

And then we could also specialize the rare-update/frequent-search use-case. In addition to bringing above values back to normal:

  • increase the min merge segment size from 2MB to a much higher value like 100MB

It's possible to do all these things manually today already, but it would be nice to package it better so that there's a single setting that needs to be updated to change the index "mode".

@jpountz jpountz added >enhancement :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. Team:Search Meta label for search team labels Jul 10, 2023
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team and removed Team:Search Meta label for search team labels Jul 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

I also wonder if there's a way we could auto-detect this mode switch. For instance would it work to start a new (non-data-stream) index in bulk-load mode and then flip it into regular mode on the first search?

@jpountz
Copy link
Contributor Author

jpountz commented Jul 10, 2023

This sounds like it could be useful to avoid issues with users forgetting to switch back from bulk-load mode to regular.

@DaveCTurner
Copy link
Contributor

In fact I wonder if we could piggy-back these things onto the existing search-idle mode. If a shard is search-idle it already skips scheduled refreshes. Could we also set the merge factor according to search-idleness?

Does adjusting the flush interval or size make a meaningful performance difference in these cases? By default we flush every 12h or 512MiB which already seems pretty relaxed to me.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 11, 2023

In fact I wonder if we could piggy-back these things onto the existing search-idle mode. If a shard is search-idle it already skips scheduled refreshes. Could we also set the merge factor according to search-idleness?

I guess we could but I would be a bit reluctant to apply some of the above ideas to an index that is in a steady state, e.g. the increase in flush size/interval could make recoveries take significantly longer. Another challenge is that the more search-y use-cases that have this bulk load use-case would generally like to have their first query fast, so even though we would make it automatic via the search-idle mechanism, they would still have somewhat complex workflows to also e.g. wait for big merges to complete before making their indexes serve searches.

Does adjusting the flush interval or size make a meaningful performance difference in these cases? By default we flush every 12h or 512MiB which already seems pretty relaxed to me.

For reference, it was changed to 1 min / 10 GiB in #93524 as flushing every 512MiB boils down to flushing every 3-5 seconds with the TSDB track on moderately powerful hardware. On some datasets like those that have knn vectors that are expensive to merge, saving segment refreshes/flushes can make a good difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

3 participants