-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Primary shard balancing #41543
Comments
Pinging @elastic/es-distributed |
For what it's worth, https://github.com/datarank/tempest does this. Disclaimer, I am one of the authors of that project. I am hoping to port it to more recent versions of ES at some point. |
The trickiest part of this feature is that today there is no way to demote a primary back to being a replica. The existing mechanism requires us to fail the primary, promote a specific replica, and then perform a recovery to replace the failed primary as a replica. Until the recovery is complete the shard is running with reduced redundancy, and I don't think it's acceptable for the balancer to detract from resilience like this. Therefore, I've labelling this as high-hanging fruit. I've studied Tempest a little and I don't think anything in that plugin changes this. I also think it does not balance the primaries out in situations like the one described in the OP:
At least, there doesn't seem to be a test showing that it does and I couldn't find a mechanism that would have helped here after a small amount of looking. I could be wrong, I don't know the codebase (or Kotlin) very well, so a link to some code or docs to help us understand Tempest's approach would be greatly appreciated. |
Here's a real-life scenario. Take this index:
Some spot instance types are more vulnerable than others, so over time we slowly see primaries moving over to the long-lived instances as the short-lived ones are being reclaimed by AWS. With this setup we're "at risk" of experiencing both the skew in shard sizes from the custom routing (what Tempest tries to solve) as well as the skew in indexing load (what this issue is addressing). The shard sizes we can deal with on the application side, but the primaries bundling up is a bigger problem. Given the number of replicas, it's hard to do any meaningful custom balancing. Taking a primary out of rotation by moving it would choose a new primary at random, so trying to balance this way could simply result in an endless cycle of shuffling. Never mind the performance impact and potentially the cross-AZ network costs too. We could try the "bag of hardware" solution, but we'd need to add a new instance type to the spot pool for each additional instance to mitigate spot wipeout, and we'd consequently need to increase the replica count as well. Storage costs would increase, and it might not even solve the problem. We could also move to regular instances, but that's not great either for obvious reasons. If, at the very least, we could choose which shard was primary, we could solve this problem with a custom solution (without the cost overhead). Having native support for balancing primaries would obviously solve this too. |
Found out that ES doesn't promote a replica when moving the primary - it'll literally just move the primary. Duh.. Anyway, knowing that, it was much simpler to balance the primaries, so for anyone finding this and having the samme issue, I wrote this tool to solve it: https://github.com/hallh/elasticsearch-primary-balancer. I've run this on our production cluster a few times already and balancing primaries did solve the issue we had with sudden spikes of write rejections under otherwise consistent load. |
Amazing work hallh. But it's funny, because I also realized that a primary and a replica could be swaped via cluster relocation API as a way to balance replicas (inefficient, but effective). Even more, I also developed such a tool to evaluate cluster state, generate a migration plan and execute it (I'm talking 4 days ago... so our work overlapped :-( ) So now we have two primary balancer tools... Now, as a long term solution, we will try to push elasticsearch capability to move primary role from one shard to another. It may be slow, but being backed by pay support may help. |
Posting some suggetions for elasticsearch-primary-balancer here . As an update, I will post my rebalance tool too, sooner than later, I've been busy supporting clusre routing location awareness, as some of my cluster are using it, and that furher limtis the ways you can to try to balance shards roles. |
I also encountered the same problem. When the primary shards are all on the same node, the load on this node can be very large. shards allocation:
Task distribution is as follows
any good solutions? |
+1, we also found this case in our production |
+1, We would like to have this functionality. And if a node in the cluster gets rebooted it loses its primary shards and make the cluster even more unbalanced. |
+1 .. same issue here. 2 data node cluster.. primaries evenly balanced. One node died.... all primaries end up on live node (good). Bring the dead node back up, and we have all primaries on one, and all secondaries on another. Wont rebalance the primaries. This means all writes go to one node.... plus aggregate reads all come from one node. There is no spl,it usage acrss nodes. Scratch that... found the way. |
Is the solution @bitonp gave above the official recommendation for rebalancing primary shards? |
The biggest problem we see is that not all shards have the same size. Because the rebalancing logic of ES only looks at the number of shards on a node, this causes a lot of problems when nodes are running out of diskspace. I'm doing some "manual" rebalancing, but I seems that is something that ES would do a lot better. I find it strange that not more people running into this issue. PS: https://discuss.elastic.co/t/shard-allocation-based-on-shard-size/257817/12 |
We have been using it and it saves us money even though it feels like a hack. Details in https://underthehood.meltwater.com/blog/2023/05/11/promoting-replica-shards-to-primary-in-elasticsearch-and-how-it-saves-us-12k-during-rolling-restarts/ |
O noticed that some people are putting all thee primaries on a single node. This is sub-optimal (and 'relational' in concept). By splitting nodes between primaries and secondaries on the nodes, in the case of a node outage, it is quick to promote secondaries to primaries yo fill the gaps. |
|
Since 8.6 Elasticsearch takes account of write load when making balancing decisions, and we have also been doing other things which separate indexing and search workloads more completely. Given these recent changes and the technical obstacles to genuinely balancing primary shards as suggested here (particularly the need for graceful demotion back to a replica) it's really rather unlikely we'll implement this feature in the foreseeable future. Therefore I'm closing this issue to indicate that it is not something we expect to pursue. |
Primary shard balancing
There uses cases where we should be able to have a mechanism to balance primary shards through all the nodes so the number of primaries is uniformly distributed.
Problem description
There are situations where the distribution of primary shards is unevenly distributed through the nodes, for instance when doing a rolling restart last node wont have any primary shards as the other nodes would assume primary shard role while the other node is down.
This issue has pop up in other occassions, and the usual answer was that primary/replica role is not an issue becouse the workload a primary or a replica assume is similar. But there are important uses cases where this does not apply.
For instance, in an index heavy scenario, where indexing must be implemented as an scripted upsert, the execution of the upsert logic falls onto primarie shards, and replicas just have to insert the result.
In this cases having unbalanced primaries excerts a bigger workload on the nodes hosting the primaries, this can even overload the cluster capacity as the cluster bottleneck will be the capacity of the nodes hosting primaries and not the sum of the cluster nodes.
Related thread in official forum
Workarounds
Actually there are some workarounds for this situations, but they are not efficient:
But even, when reroute is possible it means that shard data has to be moved from one node to another (I/O and network...).
Also there is not an automatted way to detect what primaries are unbalanced, and what shards can be swapped and execute that rerouting in samlls chunks in order to dont overload node resources. But implementing a utility or script that does that is feasible (see possible solutions).
Possible solutions? (all of them imply new features to be implemented)
So, lets consider possible solutions (take into account that I don't know elasticsearch internals):
If reroute API has this functionality somehow, it would be possible to develop a script that detects primary shard imbalance and reroute primary hard roles accordingly.
The text was updated successfully, but these errors were encountered: