Document total_shards_per_node as a recipe for hotspots #61306

ppf2 · 2020-08-19T01:34:11Z

index.routing.allocation.total_shards_per_node has become a very common recommendation for dealing with hotspots in a cluster that has suboptimal shard allocation. For example, shards of the most actively written indices accumulating on 1 or few nodes in the hot tier, a new warm node receiving more than its fair share of shards from high-shard-count indices, etc..).

Many users have hit this (I just had another user hit this yesterday). Unfortunately, they tend to find out when their clusters are already performing poorly with specific nodes getting hammered with requests. Until the two high hanging fruit issues are addressed in the future, it can be a nice addition to our documentation to provide the common use case of total_shards_per_node to prevent hotspots in the cluster (certainly, also indicating the tradeoff).

2 potential places to provide this guidance:

https://www.elastic.co/guide/en/elasticsearch/reference/7.9/allocation-total-shards.html
Add a new page under "Recipes": https://www.elastic.co/guide/en/elasticsearch/reference/7.9/recipes.html

If we decide to make this doc improvement, we may want to do this around the time we enhance ILM to allow updating of total_shards_per_node in ILM phases.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-08-19T01:34:13Z

Pinging @elastic/es-docs (>docs)

elasticmachine · 2020-08-19T01:34:13Z

Pinging @elastic/es-distributed (:Distributed/Allocation)

Leaf-Lin · 2020-08-20T23:54:57Z

Another proposal (a temporarily workaround before permenant fix becomes available) to avoid hotspots when adding new node is to:

Use index shard allocation filtering in the templates, preventing that new indices can go to the new node (via _exclude).
Add the new node.
Await until the number of shards is balanced across the hosts.
Remove the exclude restriction form the templates.
Remove the index allocation exclude restriction from the already created indices - we have to be careful here so we do not remove any other allocation exclude rules the indices may have prior to this operation. It is usually as simple of just setting the value to null, but if there were any other allocation exclude rules that's not valid and we have to keep them.

VimCommando · 2020-09-11T04:13:26Z

From a Slack thread on this topic, let me throw this out as a part of the recipe:

File this under “Things that should be simple, but are not”. I think it ends up to be:
total_shards_per_node = (primary + (replicas * primary)) / (nodes - allowed_failures)
Where:

primary is number of primary shards
replicas is number of replica sets
nodes is number of participating data nodes (factor in hot/warm data tier)
allowed_failures is how many nodes can be down at a time

So for example, if you have 14 primary shards, 1 replica set, 12 nodes, and want to allow 2 node failures:

total_shards_per_node 
  = (primary + (replicas * primary)) / (nodes - allowed_failures)
  = (14 + (14 * 1)) / (12 - 2) 
  = 28 / 10
  = 2.8

which would round up to 3

kunisen · 2020-09-11T05:05:09Z

I tend to think it a bit simpler like "set this number to replica # + 1" as a getting started.
Also maybe good to consider the combination w/ awareness (e.g. Rack/Zone) for advanced tuning if needed.

ppf2 added >enhancement >docs General docs changes :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) needs:triage Requires assignment of a team area label labels Aug 19, 2020

elasticmachine added Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team labels Aug 19, 2020

jrodewig mentioned this issue Sep 3, 2020

[DOCS] Document shard sizing guide #61942

Merged

jrodewig self-assigned this Sep 3, 2020

jrodewig removed the needs:triage Requires assignment of a team area label label Sep 14, 2020

jrodewig closed this as completed in #61942 Sep 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document total_shards_per_node as a recipe for hotspots #61306

Document total_shards_per_node as a recipe for hotspots #61306

ppf2 commented Aug 19, 2020 •

edited by inqueue

Loading

elasticmachine commented Aug 19, 2020

elasticmachine commented Aug 19, 2020

Leaf-Lin commented Aug 20, 2020

VimCommando commented Sep 11, 2020 •

edited

Loading

kunisen commented Sep 11, 2020 •

edited

Loading

Document total_shards_per_node as a recipe for hotspots #61306

Document total_shards_per_node as a recipe for hotspots #61306

Comments

ppf2 commented Aug 19, 2020 • edited by inqueue Loading

elasticmachine commented Aug 19, 2020

elasticmachine commented Aug 19, 2020

Leaf-Lin commented Aug 20, 2020

VimCommando commented Sep 11, 2020 • edited Loading

kunisen commented Sep 11, 2020 • edited Loading

ppf2 commented Aug 19, 2020 •

edited by inqueue

Loading

VimCommando commented Sep 11, 2020 •

edited

Loading

kunisen commented Sep 11, 2020 •

edited

Loading