Combine ILM shrink and force merge #73499

dakrone · 2021-05-27T19:43:41Z

It's a common use case for an ILM policy to have a shrink action as well as a forcemerge action in the warm phase. However, in order to reduce DTS costs, we should investigate combining these actions.

Currently when performing a shrink, the following actions are taken by ILM (this is a subset):

Select a node that will perform the shrink
Relocate a single copy of each shard to the node
Perform the shrink
Shrink creates a new index on the same node, with the same number of replica shards
As the new index initializes, it is then replicated to a different node (assuming number_of_replicas=1)
Add the new index to the data stream or alias while removing the old index

The forcemerge performs a simple forcemerge of the index, but it does mean that the forcemerge is duplicated, and because merging is non deterministic the segments will likely differ between the nodes, leading to replication of segments.

There are at least two things we can do to help reduce DTS costs related to this:

Shrink into an index with zero replicas

When we shrink, currently ILM creates the shrunken index with the same replica count, but since this is going on transparently in the background, there is no need to create a shrunken index with a single replica. Instead, we can create the index with zero replicas, and increase the number of replicas to the original index's count prior to deletion of the original index.

Since shrink now has ILM resiliency, it means that in the event that something goes wrong, no data loss occurs, and ILM can retry.

By itself, this doesn't reduce DTS, because regardless the data will still have to be replicated across the zone boundary. However, if it was combined with the next enhancement:

Perform forcemerge prior to increasing the replica count

Forcemerge also ends up leading to replication across zone boundaries, however, if we perform the forcemerge at a point where the index has no replicas, then it only need be performed once, and the data will be replicated to a different zone only a single time.

If we combine both of these behaviors, the new behavior looks like:

Select a node that will perform the shrink
Relocate a single copy of each shard to the node
Perform the shrink
Shrink creates a new index on the chosen node with 0 replicas
The new index is initialized
Force merge the shrunken index
Increase the number of replicas to the force merged and shrunk index back to the original index's count (likely 1 replica)
Add the new index to the data stream or alias while removing the old index

Here is a before picture:

And here is an after picture:

In both examples I treated the single node allocation rule (where ILM has to get a copy of each shard on the same node) as "smart" and not sending any data across zones. Still, this step is tedious, and it would be nice if we could skip it.

elasticmachine · 2021-05-27T19:43:43Z

Pinging @elastic/es-core-features (Team:Core/Features)

gaobinlong · 2021-07-23T07:33:10Z

@dakrone , can I work on this issue? I'm a deep user of ILM and want to make more contributions to the feature.

dakrone · 2021-07-28T18:34:03Z

@gaobinlong I appreciate the interest! For this one though, I think we should hold off. I'm not sure yet the best way to implement this, whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized.

gaobinlong · 2021-07-31T02:47:54Z

@dakrone thanks for you reply, I will keep track of this issue and follow up the development of ILM.

jpountz · 2021-10-27T15:15:49Z

whether we want to put something solely in the shrink action, or whether we want to introduce the concept of a "logical plan" into ILM that can re-order or combine steps to be optimized

Maybe one argument for the latter is that we would likely want to also optimize the forcemerge + shrink + searchable_snapshot workflow to replace the step that increases the number of replicas of the shrunken index with taking a snapshot and doing a snapshot recovery?

dakrone · 2021-10-27T15:36:26Z

@jpountz yes with a logical plan we could re-order, elide, or enhance actions to make more combinations of actions efficient.

jpountz · 2021-10-28T07:26:26Z

In addition to the DTS costs, there is another aspect of this proposal that I like a lot, which is the fact that we would reduce the CPU cost of the forcemerge operation by 2x since it would run on a single shard copy.

This would be a win on its own, plus we could then have more discussions about shifting some of the CPU cost from natural merges to forced merges, e.g.

Maybe our built-in index templates / ILM policies should index with index.codec: best_speed and we'd only move to index.codec:best_compression for forcemerge.
Maybe data streams and time-based indices could have a merge policy that is a bit lighter on natural merges, e.g. by decreasing the max merged segment size from 5GB to 2GB (which would need to be evaluated properly due to the potential impact on search performance) and we'd then do more merging in the forcemerge.

VimCommando · 2022-08-15T20:35:00Z

There is related discussion in Can we avoid force-merging all shard copies?

dakrone added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels May 27, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label May 27, 2021

This was referenced May 27, 2021

Shrink an index from a snapshot #73500

Closed

Reduce DTS costs for cross zone data transfer within Elasticsearch #73501

Open

joegallo mentioned this issue Jan 17, 2024

ILM execution order on phase rollover #61014

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine ILM shrink and force merge #73499

Combine ILM shrink and force merge #73499

dakrone commented May 27, 2021

elasticmachine commented May 27, 2021

gaobinlong commented Jul 23, 2021

dakrone commented Jul 28, 2021

gaobinlong commented Jul 31, 2021

jpountz commented Oct 27, 2021

dakrone commented Oct 27, 2021

jpountz commented Oct 28, 2021

VimCommando commented Aug 15, 2022

Combine ILM shrink and force merge #73499

Combine ILM shrink and force merge #73499

Comments

dakrone commented May 27, 2021

Shrink into an index with zero replicas

Perform forcemerge prior to increasing the replica count

elasticmachine commented May 27, 2021

gaobinlong commented Jul 23, 2021

dakrone commented Jul 28, 2021

gaobinlong commented Jul 31, 2021

jpountz commented Oct 27, 2021

dakrone commented Oct 27, 2021

jpountz commented Oct 28, 2021

VimCommando commented Aug 15, 2022