Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine ILM shrink and force merge #73499

Open
dakrone opened this issue May 27, 2021 · 8 comments
Open

Combine ILM shrink and force merge #73499

dakrone opened this issue May 27, 2021 · 8 comments
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team

Comments

@dakrone
Copy link
Member

dakrone commented May 27, 2021

It's a common use case for an ILM policy to have a shrink action as well as a forcemerge action in the warm phase. However, in order to reduce DTS costs, we should investigate combining these actions.

Currently when performing a shrink, the following actions are taken by ILM (this is a subset):

  • Select a node that will perform the shrink
  • Relocate a single copy of each shard to the node
  • Perform the shrink
  • Shrink creates a new index on the same node, with the same number of replica shards
  • As the new index initializes, it is then replicated to a different node (assuming number_of_replicas=1)
  • Add the new index to the data stream or alias while removing the old index

The forcemerge performs a simple forcemerge of the index, but it does mean that the forcemerge is duplicated, and because merging is non deterministic the segments will likely differ between the nodes, leading to replication of segments.

There are at least two things we can do to help reduce DTS costs related to this:

Shrink into an index with zero replicas

When we shrink, currently ILM creates the shrunken index with the same replica count, but since this is going on transparently in the background, there is no need to create a shrunken index with a single replica. Instead, we can create the index with zero replicas, and increase the number of replicas to the original index's count prior to deletion of the original index.

Since shrink now has ILM resiliency, it means that in the event that something goes wrong, no data loss occurs, and ILM can retry.

By itself, this doesn't reduce DTS, because regardless the data will still have to be replicated across the zone boundary. However, if it was combined with the next enhancement:

Perform forcemerge prior to increasing the replica count

Forcemerge also ends up leading to replication across zone boundaries, however, if we perform the forcemerge at a point where the index has no replicas, then it only need be performed once, and the data will be replicated to a different zone only a single time.

If we combine both of these behaviors, the new behavior looks like:

  • Select a node that will perform the shrink
  • Relocate a single copy of each shard to the node
  • Perform the shrink
  • Shrink creates a new index on the chosen node with 0 replicas
  • The new index is initialized
  • Force merge the shrunken index
  • Increase the number of replicas to the force merged and shrunk index back to the original index's count (likely 1 replica)
  • Add the new index to the data stream or alias while removing the old index

Here is a before picture:
71903033-F5B5-48DA-AD30-2DB01F26D696

And here is an after picture:
4D545A4E-3157-4B00-A796-AD4F6709E755

In both examples I treated the single node allocation rule (where ILM has to get a copy of each shard on the same node) as "smart" and not sending any data across zones. Still, this step is tedious, and it would be nice if we could skip it.

@dakrone dakrone added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels May 27, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label May 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@gaobinlong
Copy link
Contributor

@dakrone , can I work on this issue? I'm a deep user of ILM and want to make more contributions to the feature.

@dakrone
Copy link
Member Author

dakrone commented Jul 28, 2021

@gaobinlong I appreciate the interest! For this one though, I think we should hold off. I'm not sure yet the best way to implement this, whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized.

@gaobinlong
Copy link
Contributor

@dakrone thanks for you reply, I will keep track of this issue and follow up the development of ILM.

@jpountz
Copy link
Contributor

jpountz commented Oct 27, 2021

whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized

Maybe one argument for the latter is that we would likely want to also optimize the forcemerge + shrink + searchable_snapshot workflow to replace the step that increases the number of replicas of the shrunken index with taking a snapshot and doing a snapshot recovery?

@dakrone
Copy link
Member Author

dakrone commented Oct 27, 2021

@jpountz yes with a logical plan we could re-order, elide, or enhance actions to make more combinations of actions efficient.

@jpountz
Copy link
Contributor

jpountz commented Oct 28, 2021

In addition to the DTS costs, there is another aspect of this proposal that I like a lot, which is the fact that we would reduce the CPU cost of the forcemerge operation by 2x since it would run on a single shard copy.

This would be a win on its own, plus we could then have more discussions about shifting some of the CPU cost from natural merges to forced merges, e.g.

  • Maybe our built-in index templates / ILM policies should index with index.codec: best_speed and we'd only move to index.codec:best_compression for forcemerge.
  • Maybe data streams and time-based indices could have a merge policy that is a bit lighter on natural merges, e.g. by decreasing the max merged segment size from 5GB to 2GB (which would need to be evaluated properly due to the potential impact on search performance) and we'd then do more merging in the forcemerge.

@VimCommando
Copy link
Contributor

There is related discussion in Can we avoid force-merging all shard copies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants