New primary shards are all assigned to single node that is over watermark.low, if it has fewer shards #109775

EmilBode · 2024-06-14T17:47:37Z

Elasticsearch Version

7.17.15

Installed Plugins

Java Version

bundled

OS Version

Ubuntu 20.04.6 LTS

Problem Description

Technically, elasticsearch behaves as documented, but it results in undesired behavior.

Our setup

We ingest streaming data, where documents have a lifetime that is not known at ingest time. To ingest it all, we have an index lifecycle policy where we switch to a new index of the older is getting too full, but we don't have a set expiry date for old indices. Instead, we manually delete documents that are no longer needed, and perform manual forcemerges when necessary.
The end result is that we have a fairly large number of indices that are relatively small (and shrinking as we delete data from them, yet we only delete the index if there are no more documents at all in it), coupled with newly generated indices that are large (20 shards of 50 GB plus replica before rollover).

Furthermore, we have zone awareness in our 8 nodes: 4 are in zone-1, and 4 in zone-2 (ignoring the master-eligible nodes).

I'm not sure anymore exactly what caused it, but at some point the shard assignment became unbalanced (we had to switch a drive for one of the nodes, and had a few network outages, but it doesn't really matter), and one of our nodes hit a watermark low, preventing any shards from being reallocated to that node, even though it had a smaller number of shards than the other nodes.

Situation when the problems started

So, we had 8 nodes, with the following number of shards assigned and disk usage:

Node A was in zone-0, had ~350 shards, and ~94% disk used.
Nodes B, C and D were in zone-0, each had ~550 shards, and ~62% disk used
Nodes E, F, G and H were in zone-1, and each had ~500 shards, and ~70% disk used
Watermark was set at low 90%, high 95%
Note that as each of our indices had 1 replica, movement of data between ABCD and EFGH was impossible due to zone awareness.

The problem

First, no rebalancing was done between A and BCD. While node A had fewer shards, reallocation was suppressed because of the low watermark. On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.

Yet, at rollover, all new primary shards were assigned to node A. It had the fewest number of shards, and primary shards can be allocated to nodes that are over the low watermark. Also, assignment of new shards didn't cost disk usage at creation, so nothing triggering the watermark_high yet.
Only when the new index started to fill up, did it at some point trigger a relocation as our high watermark was hit.

Suggested solutions

While I think that it's good that allocation of new primary shards shouldn't be prevented by a watermark.treshold_low, there's no reason why allocation shouldn't allocate new primaries to a node that has hit it, as long as there are alternatives available.
Perhaps there could be some rule where the reallocater could "exchange" shards between nodes even if one is above a low watermark, as long as the exchange results in less disk usage for the affected node.

This is the way I manually fixed our problems now:

Disable automatic reallocation
Manually move some large shards from A to BCD
Manually move some more smaller shards from BCD to A, to balance the shard count
Re-enable automatic reallocation
Wait until the reallocation algorithm is happy, while monitoring disk usage.

Steps to Reproduce

I think the issue can be reproduced with two smallish nodes (or setting watermarks to trigger really soon).

Disable automatic reallocation
Create a number of indices with many shards, but without replicas, and put some documents in them (indices must remain small)
Create a new index, with an index lifecycle policy, and also no replicas
Fill the new index so that it becomes large (close to rollover size)
Manually move (with POST /_cluster/reroute) the shards from the small indices to node 1.
Manually move the shards from the large (lifecycled) index to node 2.
Set the watermarks so that watermark.low is triggered on node 2, but watermark.high is not, nor will it be until (long) after rollover.
Re-enable automatic reallocation.
Continue writing document to the lifecycled index, until rollover is triggered

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2024-06-14T18:11:38Z

On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.

I think you describe the behaviour in 7.17 accurately, but recent 8.x versions now (a) account for disk usage in the balancer rather than just focussing on shard count and (b) forecast the eventual size to which shards will grow rather than treating new shards as if they will always have near-zero size. I recommend upgrading, and I'm closing this issue since I think it's already been fixed (or at least changed beyond recognition) in newer versions.

EmilBode added >bug needs:triage Requires assignment of a team area label labels Jun 14, 2024

DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New primary shards are all assigned to single node that is over watermark.low, if it has fewer shards #109775

New primary shards are all assigned to single node that is over watermark.low, if it has fewer shards #109775

EmilBode commented Jun 14, 2024

DaveCTurner commented Jun 14, 2024

New primary shards are all assigned to single node that is over watermark.low, if it has fewer shards #109775

New primary shards are all assigned to single node that is over watermark.low, if it has fewer shards #109775

Comments

EmilBode commented Jun 14, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Our setup

Situation when the problems started

The problem

Suggested solutions

Steps to Reproduce

Logs (if relevant)

DaveCTurner commented Jun 14, 2024