Skip to content

Conversation

@nielsbauman
Copy link
Contributor

Backport

This will backport the following commits from main to 8.19:

Questions ?

Please refer to the Backport tool documentation

…th indicator (elastic#136060)

We currently compute the shard allocation explanation for every
unassigned shard (primaries and replicas) in the health report API when
`verbose` is `true`, which includes the periodic health logs. Computing
the shard allocation explanation of a shard is quite expensive in large
clusters. Therefore, when there are lots of unassigned shards,
`ShardsAvailabilityHealthIndicatorService` can take a long time to
complete - we've seen cases of 2 minutes with 40k unassigned shards.

To avoid the runtime of `ShardsAvailabilityHealthIndicatorService`
scaling linearly with the number of unassigned shards (times the size of
the cluster), we limit the number of allocation explanations we compute
to `maxAffectedResourcesCount`, which comes from the `size` parameter of
the `_health_report` API and currently defaults to `1000` - a follow-up
PR will address the high default size. This significantly reduces the
runtime of this health indicator and avoids the periodic health logs
from overlapping.

A downside of this change is that the returned list of diagnoses may be
incomplete. For example, if the `size` parameter is set to `10`, and the
first 10 shards are unassigned due to reason `X` and the remaining
unassigned shards due to reason `Y`, only reason `X` will be returned in
the health API. We accept this downside as we expect that there are
generally not many different diagnoses relevant - if more than `size`
shards are unassigned, they're likely all unassigned due to the same
reason. Users can always increase `size` and/or manually call the
allocation explain API to get more detailed information.

(cherry picked from commit ede1d06)

# Conflicts:
#	server/src/main/java/org/elasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorService.java
#	server/src/test/java/org/elasticsearch/cluster/routing/allocation/shards/ShardsAvailabilityHealthIndicatorServiceTests.java
@nielsbauman nielsbauman added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport :Data Management/Health >bug Team:Data Management Meta label for data/management team labels Oct 13, 2025
@elasticsearchmachine elasticsearchmachine merged commit 6f30995 into elastic:8.19 Oct 13, 2025
24 checks passed
@nielsbauman nielsbauman deleted the backport/8.19/pr-136060 branch October 13, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) backport >bug :Data Management/Health Team:Data Management Meta label for data/management team v8.19.6

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants