[8.19] Limit number of allocation explanations in `shards_availability` health indicator (#136060) #136471

nielsbauman · 2025-10-13T09:37:05Z

Backport

This will backport the following commits from main to 8.19:

Limit number of allocation explanations in shards_availability health indicator (#136060)

Questions ?

Please refer to the Backport tool documentation

…th indicator (elastic#136060) We currently compute the shard allocation explanation for every unassigned shard (primaries and replicas) in the health report API when `verbose` is `true`, which includes the periodic health logs. Computing the shard allocation explanation of a shard is quite expensive in large clusters. Therefore, when there are lots of unassigned shards, `ShardsAvailabilityHealthIndicatorService` can take a long time to complete - we've seen cases of 2 minutes with 40k unassigned shards. To avoid the runtime of `ShardsAvailabilityHealthIndicatorService` scaling linearly with the number of unassigned shards (times the size of the cluster), we limit the number of allocation explanations we compute to `maxAffectedResourcesCount`, which comes from the `size` parameter of the `_health_report` API and currently defaults to `1000` - a follow-up PR will address the high default size. This significantly reduces the runtime of this health indicator and avoids the periodic health logs from overlapping. A downside of this change is that the returned list of diagnoses may be incomplete. For example, if the `size` parameter is set to `10`, and the first 10 shards are unassigned due to reason `X` and the remaining unassigned shards due to reason `Y`, only reason `X` will be returned in the health API. We accept this downside as we expect that there are generally not many different diagnoses relevant - if more than `size` shards are unassigned, they're likely all unassigned due to the same reason. Users can always increase `size` and/or manually call the allocation explain API to get more detailed information. (cherry picked from commit ede1d06) # Conflicts: # server/src/main/java/org/elasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java # server/src/test/java/org/elasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorServiceTests.java

nielsbauman added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport :Data Management/Health >bug Team:Data Management Meta label for data/management team labels Oct 13, 2025

elasticsearchmachine added the v8.19.6 label Oct 13, 2025

nielsbauman mentioned this pull request Oct 13, 2025

Limit number of allocation explanations in shards_availability health indicator #136060

Merged

elasticsearchmachine merged commit 6f30995 into elastic:8.19 Oct 13, 2025
24 checks passed

nielsbauman deleted the backport/8.19/pr-136060 branch October 13, 2025 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[8.19] Limit number of allocation explanations in `shards_availability` health indicator (#136060) #136471

[8.19] Limit number of allocation explanations in `shards_availability` health indicator (#136060) #136471

Uh oh!

nielsbauman commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[8.19] Limit number of allocation explanations in shards_availability health indicator (#136060) #136471

[8.19] Limit number of allocation explanations in shards_availability health indicator (#136060) #136471

Uh oh!

Conversation

nielsbauman commented Oct 13, 2025

Backport

Questions ?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[8.19] Limit number of allocation explanations in `shards_availability` health indicator (#136060) #136471

[8.19] Limit number of allocation explanations in `shards_availability` health indicator (#136060) #136471