You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Technically, elasticsearch behaves as documented, but it results in undesired behavior.
Our setup
We ingest streaming data, where documents have a lifetime that is not known at ingest time. To ingest it all, we have an index lifecycle policy where we switch to a new index of the older is getting too full, but we don't have a set expiry date for old indices. Instead, we manually delete documents that are no longer needed, and perform manual forcemerges when necessary.
The end result is that we have a fairly large number of indices that are relatively small (and shrinking as we delete data from them, yet we only delete the index if there are no more documents at all in it), coupled with newly generated indices that are large (20 shards of 50 GB plus replica before rollover).
Furthermore, we have zone awareness in our 8 nodes: 4 are in zone-1, and 4 in zone-2 (ignoring the master-eligible nodes).
I'm not sure anymore exactly what caused it, but at some point the shard assignment became unbalanced (we had to switch a drive for one of the nodes, and had a few network outages, but it doesn't really matter), and one of our nodes hit a watermark low, preventing any shards from being reallocated to that node, even though it had a smaller number of shards than the other nodes.
Situation when the problems started
So, we had 8 nodes, with the following number of shards assigned and disk usage:
Node A was in zone-0, had ~350 shards, and ~94% disk used.
Nodes B, C and D were in zone-0, each had ~550 shards, and ~62% disk used
Nodes E, F, G and H were in zone-1, and each had ~500 shards, and ~70% disk used
Watermark was set at low 90%, high 95%
Note that as each of our indices had 1 replica, movement of data between ABCD and EFGH was impossible due to zone awareness.
The problem
First, no rebalancing was done between A and BCD. While node A had fewer shards, reallocation was suppressed because of the low watermark. On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.
Yet, at rollover, all new primary shards were assigned to node A. It had the fewest number of shards, and primary shards can be allocated to nodes that are over the low watermark. Also, assignment of new shards didn't cost disk usage at creation, so nothing triggering the watermark_high yet.
Only when the new index started to fill up, did it at some point trigger a relocation as our high watermark was hit.
Suggested solutions
While I think that it's good that allocation of new primary shards shouldn't be prevented by a watermark.treshold_low, there's no reason why allocation shouldn't allocate new primaries to a node that has hit it, as long as there are alternatives available.
Perhaps there could be some rule where the reallocater could "exchange" shards between nodes even if one is above a low watermark, as long as the exchange results in less disk usage for the affected node.
This is the way I manually fixed our problems now:
Disable automatic reallocation
Manually move some large shards from A to BCD
Manually move some more smaller shards from BCD to A, to balance the shard count
Re-enable automatic reallocation
Wait until the reallocation algorithm is happy, while monitoring disk usage.
Steps to Reproduce
I think the issue can be reproduced with two smallish nodes (or setting watermarks to trigger really soon).
Disable automatic reallocation
Create a number of indices with many shards, but without replicas, and put some documents in them (indices must remain small)
Create a new index, with an index lifecycle policy, and also no replicas
Fill the new index so that it becomes large (close to rollover size)
Manually move (with POST /_cluster/reroute) the shards from the small indices to node 1.
Manually move the shards from the large (lifecycled) index to node 2.
Set the watermarks so that watermark.low is triggered on node 2, but watermark.high is not, nor will it be until (long) after rollover.
Re-enable automatic reallocation.
Continue writing document to the lifecycled index, until rollover is triggered
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered:
On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.
I think you describe the behaviour in 7.17 accurately, but recent 8.x versions now (a) account for disk usage in the balancer rather than just focussing on shard count and (b) forecast the eventual size to which shards will grow rather than treating new shards as if they will always have near-zero size. I recommend upgrading, and I'm closing this issue since I think it's already been fixed (or at least changed beyond recognition) in newer versions.
Elasticsearch Version
7.17.15
Installed Plugins
Java Version
bundled
OS Version
Ubuntu 20.04.6 LTS
Problem Description
Technically, elasticsearch behaves as documented, but it results in undesired behavior.
Our setup
We ingest streaming data, where documents have a lifetime that is not known at ingest time. To ingest it all, we have an index lifecycle policy where we switch to a new index of the older is getting too full, but we don't have a set expiry date for old indices. Instead, we manually delete documents that are no longer needed, and perform manual forcemerges when necessary.
The end result is that we have a fairly large number of indices that are relatively small (and shrinking as we delete data from them, yet we only delete the index if there are no more documents at all in it), coupled with newly generated indices that are large (20 shards of 50 GB plus replica before rollover).
Furthermore, we have zone awareness in our 8 nodes: 4 are in zone-1, and 4 in zone-2 (ignoring the master-eligible nodes).
I'm not sure anymore exactly what caused it, but at some point the shard assignment became unbalanced (we had to switch a drive for one of the nodes, and had a few network outages, but it doesn't really matter), and one of our nodes hit a watermark low, preventing any shards from being reallocated to that node, even though it had a smaller number of shards than the other nodes.
Situation when the problems started
So, we had 8 nodes, with the following number of shards assigned and disk usage:
Watermark was set at low 90%, high 95%
Note that as each of our indices had 1 replica, movement of data between ABCD and EFGH was impossible due to zone awareness.
The problem
First, no rebalancing was done between A and BCD. While node A had fewer shards, reallocation was suppressed because of the low watermark. On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.
Yet, at rollover, all new primary shards were assigned to node A. It had the fewest number of shards, and primary shards can be allocated to nodes that are over the low watermark. Also, assignment of new shards didn't cost disk usage at creation, so nothing triggering the watermark_high yet.
Only when the new index started to fill up, did it at some point trigger a relocation as our high watermark was hit.
Suggested solutions
This is the way I manually fixed our problems now:
Steps to Reproduce
I think the issue can be reproduced with two smallish nodes (or setting watermarks to trigger really soon).
POST /_cluster/reroute
) the shards from the small indices to node 1.watermark.low
is triggered on node 2, butwatermark.high
is not, nor will it be until (long) after rollover.Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: