Default allocation of new shards create hotspots #45851

ragri8 · 2019-08-22T15:02:26Z

We have a big issue of hotspot on our cluster.

5 data nodes, 670 indexes, 1720 shards and 40 billion documents of log data, and growing fast. We're running elasticsearch version 7.3.0.

We have multiple daily indices created each day indexing 4000 documents per second or more together.
One of them is heavier than every others, indexing alone 3000 docs/sec. Because of this, we split it into 5 primary shards (and 1 replica each), each holding about 10-20Gb.

But right now, it's hard for the cluster to keep up the pace because instead of sending 2 shards per node, all 5 primary shards are sent to the same node and the replicas are then split between the 4 remaining, even sometime skipping one.

The number of shard per node is balanced for 4 of the nodes (400+/-1), but the last one has less (120) because he has a lot less space than the other nodes and keep hitting the low watermark (85%).

The problem couldn't only be caused by the cluster trying to rebalance the shards evenly. Yesterday morning because of the disk space watermark hit on the 5th node, not a single new shard has been assigned to it, but all 5 primary shards of the heavy index started on the 2nd node.

As a default configuration, elasticsearch should always try to send new shards of the same index evenly accross all the nodes. This would ensure more than anything a minimum share of network traffic, CPU and I/O, because the heaviest index would always be splitted, and such setting wouldn't prevent shards from different indexes to spread evenly too.

I could still use index.routing.allocation.total_shards_per_node and setting it to something like 3, but doing so would also force older shards to move, potentially causing a reallocation storm that I want to prevent at all cost.
The balancing heuristic has already been modified with cluster.routing.allocation.balance.shard at 0.3 and cluster.routing.allocation.balance.index at 0.7. I fear that setting the balance index higher would result in the same reallocation storm.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-08-22T21:54:58Z

Pinging @elastic/es-distributed

jakelandis · 2019-08-22T21:55:48Z

I think this is a duplicate of #17213

@ragri8 - you may want to see if there are any suggestions on that thread that may help

ragri8 · 2019-08-23T14:30:19Z

As I said, this is a different issue, because when shard allocation was disabled on the unbalanced node because of the low watermark, primary shards of the same index still land on the same node, even when all available nodes are balanced.

The discussion emphasize on an upper bound of ressource use, while I'm suggesting a better heuristic to choose where to allocate each new shards. With the solution proposed, there would at most guarantee spread of new shards between every nodes, but won't help preventing a large index with multiple shard to land on a single node.

Let me give a more detailed situation:
I have 10 different index, rotating daily.
Index-1 biggest one has 10 shards (5 primaries and 5 replicas), index 3200docs/sec, and use around 150Gb of space on weekdays.
Index-2 one has 2 shards (1 primary and 1 replicas), index 900docs/sec and use around 60Gb. For optimal result, we will instead use 2 primaries and 2 replicas.
Index-3 to 10 each has 1 primary and 1 replica and together index less than 200docs/sec and use less than 10 Gb per day.

So there is 30 shards to spread between 5 nodes, but 14 shards hold 95% of the traffic and 75% come from a single index.

With any hard limit on ressources,
at best this could limit to 6 the number of active shard per node. But the Index-1 might still place its 5 primary on one, giving it almost 40% of the traffic and maybe more if it gets a shard from Index-2. Also, if new shards are added while the older one are still indexing (it does happen for me), this would become even harder to set the ressource limit efficiently without getting the cluster to yellow or red status.

If instead, we add a simple rule to always split shards from the same indice evenly accross nodes,
each node will get 2 shards from Index-1 and 1 shard at most from Index-2. Before adding the 16 small shards remaining, we have 4 nodes with 20% of traffic and one with 15%, and we didn't need even a single heuristic to get this almost perfect share of ressources.
And if the low watermark is hit on the 5th node, remaining nodes will still get 29% and 21% of traffic each, which is still better than the hard limit rule and won't prevent allocation of new shards.

Of course, not everyone has the same setup as me. Let's say that someone with as much node and new shards per day as me, has indexes with only 1 primary and 1 replica, my rule won't help them, but it won't hurt them either. More allocation rules could still be used around this one without conflict.

I think this is the easiest approach to solve my issue and help many others and I still think this should be a default config.

DaveCTurner · 2019-08-26T07:17:06Z

This very much looks like a duplicate of #17213. You seem to be saying that your nodes cannot handle more than N shards of a particular index. If so, you can avoid allocating more than N shards of that index to any one node by setting index.routing.allocation.total_shards_per_node: N on that index. When the index no longer needs this constraint you can remove it (see #44070).

This setting applies per index, so you can set it only on the index you want to spread out more. It will not directly cause shards of older indices to move, although you may see a small amount of extra shard movement to fix any resulting imbalance.

I think you might be confused about the difference between a primary and a replica in terms of their resource requirements. They both require the same resources. It does not matter whether a node has 5 primaries or 5 replicas of an index, it will see the same load. (For completeness, I should add that this is sometimes not the case, but it is true when indexing log data).

You also talk about "a reallocation storm that I want to prevent at all cost". If shard movements are damaging to your cluster stability then you may have misconfigured your cluster (e.g. set indices.recovery.max_bytes_per_sec or cluster.routing.allocation.node_concurrent_recoveries to too high a value).

I think it would be best to continue this discussion on the forums, so I'm closing this issue. If the discussion in the forums identifies something that isn't already covered by #17213 then we can always reopen this issue.

ragri8 · 2019-08-26T19:44:18Z

This very much looks like a duplicate of #17213.

My proposed solution is quite different.
In the issue #17213, the solution doesn't take into account separation of shards from the same indice and could only mitigate my issue by using multi-dimensionnal costs from ressources, which would be harder for a less experimented user like me to implement and misconfiguration could even lead to data loss.
My proposed solution would provide an out-of-the-box solution available for every type of user. Hard limits always have the potential to break something if users aren't aware of every change in the cluster and this does happen in large organizations.

If so, you can avoid allocating more than N shards of that index to any one node by setting index.routing.allocation.total_shards_per_node: N on that index.

I could, but that would cause undesired tradeoff while I think it could be easily avoidable.

I think you might be confused about the difference between a primary and a replica in terms of their resource requirements.

I do understand ressource requirements of both primaries and replicas. While replicas are (most of the time?) evenly spread accross the cluster, primaries aren't, and 50% of the total traffic indexed using only 20% of the ressources create a hotspot.

It will not directly cause shards of older indices to move

If shard movements are damaging to your cluster stability then you may have misconfigured your cluster

In this case I admit I didn't understood properly how ES handle reallocations. The only reallocation problem we have is caused by the smaller node constantly moving away old shards when it hits the low watermark, but we solved this issue by blocking bigger indexes from writing to it.

I think a soft rule spreading evenly shards from the same indice accross nodes should be the default configuration, as I don't see how could it be an unwanted behaviour, except for exceptionnal cases which already need custom allocation rules.

I think this issue is closer to #43350 , but with a simpler proposal. It's also simpler than #17213.

A new allocation algorithm could use the current one to allocate the first shard from an indice on a node, but then, after applying the configured hard limits, choose the node with the smallest number of shards from this indice and use the current algorithm to choose when multiple nodes has the lowest number of shards.
As an option or by default configuration, instead of a "smallest first" approach, nodes with shards from the same indice could have some "weight" to influence the choice and give other options to choose (like ressource costs, which would lead later to a merge with #17213).

jakelandis added the :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Aug 22, 2019

DaveCTurner closed this as completed Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default allocation of new shards create hotspots #45851

Default allocation of new shards create hotspots #45851

ragri8 commented Aug 22, 2019 •

edited

Loading

elasticmachine commented Aug 22, 2019

jakelandis commented Aug 22, 2019

ragri8 commented Aug 23, 2019

DaveCTurner commented Aug 26, 2019

ragri8 commented Aug 26, 2019

Default allocation of new shards create hotspots #45851

Default allocation of new shards create hotspots #45851

Comments

ragri8 commented Aug 22, 2019 • edited Loading

elasticmachine commented Aug 22, 2019

jakelandis commented Aug 22, 2019

ragri8 commented Aug 23, 2019

DaveCTurner commented Aug 26, 2019

ragri8 commented Aug 26, 2019

ragri8 commented Aug 22, 2019 •

edited

Loading