Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New primary shards are all assigned to single node that is over watermark.low, if it has fewer shards #109775

Closed
EmilBode opened this issue Jun 14, 2024 · 1 comment
Labels
>bug needs:triage Requires assignment of a team area label

Comments

@EmilBode
Copy link

Elasticsearch Version

7.17.15

Installed Plugins

Java Version

bundled

OS Version

Ubuntu 20.04.6 LTS

Problem Description

Technically, elasticsearch behaves as documented, but it results in undesired behavior.

Our setup

We ingest streaming data, where documents have a lifetime that is not known at ingest time. To ingest it all, we have an index lifecycle policy where we switch to a new index of the older is getting too full, but we don't have a set expiry date for old indices. Instead, we manually delete documents that are no longer needed, and perform manual forcemerges when necessary.
The end result is that we have a fairly large number of indices that are relatively small (and shrinking as we delete data from them, yet we only delete the index if there are no more documents at all in it), coupled with newly generated indices that are large (20 shards of 50 GB plus replica before rollover).

Furthermore, we have zone awareness in our 8 nodes: 4 are in zone-1, and 4 in zone-2 (ignoring the master-eligible nodes).

I'm not sure anymore exactly what caused it, but at some point the shard assignment became unbalanced (we had to switch a drive for one of the nodes, and had a few network outages, but it doesn't really matter), and one of our nodes hit a watermark low, preventing any shards from being reallocated to that node, even though it had a smaller number of shards than the other nodes.

Situation when the problems started

So, we had 8 nodes, with the following number of shards assigned and disk usage:

  • Node A was in zone-0, had ~350 shards, and ~94% disk used.
  • Nodes B, C and D were in zone-0, each had ~550 shards, and ~62% disk used
  • Nodes E, F, G and H were in zone-1, and each had ~500 shards, and ~70% disk used
    Watermark was set at low 90%, high 95%
    Note that as each of our indices had 1 replica, movement of data between ABCD and EFGH was impossible due to zone awareness.

The problem

First, no rebalancing was done between A and BCD. While node A had fewer shards, reallocation was suppressed because of the low watermark. On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.

Yet, at rollover, all new primary shards were assigned to node A. It had the fewest number of shards, and primary shards can be allocated to nodes that are over the low watermark. Also, assignment of new shards didn't cost disk usage at creation, so nothing triggering the watermark_high yet.
Only when the new index started to fill up, did it at some point trigger a relocation as our high watermark was hit.

Suggested solutions

  • While I think that it's good that allocation of new primary shards shouldn't be prevented by a watermark.treshold_low, there's no reason why allocation shouldn't allocate new primaries to a node that has hit it, as long as there are alternatives available.
  • Perhaps there could be some rule where the reallocater could "exchange" shards between nodes even if one is above a low watermark, as long as the exchange results in less disk usage for the affected node.

This is the way I manually fixed our problems now:

  • Disable automatic reallocation
  • Manually move some large shards from A to BCD
  • Manually move some more smaller shards from BCD to A, to balance the shard count
  • Re-enable automatic reallocation
  • Wait until the reallocation algorithm is happy, while monitoring disk usage.

Steps to Reproduce

I think the issue can be reproduced with two smallish nodes (or setting watermarks to trigger really soon).

  • Disable automatic reallocation
  • Create a number of indices with many shards, but without replicas, and put some documents in them (indices must remain small)
  • Create a new index, with an index lifecycle policy, and also no replicas
  • Fill the new index so that it becomes large (close to rollover size)
  • Manually move (with POST /_cluster/reroute) the shards from the small indices to node 1.
  • Manually move the shards from the large (lifecycled) index to node 2.
  • Set the watermarks so that watermark.low is triggered on node 2, but watermark.high is not, nor will it be until (long) after rollover.
  • Re-enable automatic reallocation.
  • Continue writing document to the lifecycled index, until rollover is triggered

Logs (if relevant)

No response

@EmilBode EmilBode added >bug needs:triage Requires assignment of a team area label labels Jun 14, 2024
@DaveCTurner
Copy link
Contributor

On the other hand, rebalancing of disk usage was also not done, as rebalancing mainly looks at the number of shards.

I think you describe the behaviour in 7.17 accurately, but recent 8.x versions now (a) account for disk usage in the balancer rather than just focussing on shard count and (b) forecast the eventual size to which shards will grow rather than treating new shards as if they will always have near-zero size. I recommend upgrading, and I'm closing this issue since I think it's already been fixed (or at least changed beyond recognition) in newer versions.

@DaveCTurner DaveCTurner closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

No branches or pull requests

2 participants