shards allocation health indicator services #83513

idegtiarenko · 2022-02-04T12:14:24Z

Adding implementations that will check shards status and report health
status based on their availability

Closes: #83240
Related to: #83303

Adding implementations that will check shards status and report health status based on their availability

elasticsearchmachine · 2022-02-04T12:14:47Z

Hi @idegtiarenko, I've created a changelog YAML for you.

idegtiarenko · 2022-02-14T13:24:58Z

server/src/main/java/org/elasticsearch/cluster/metadata/NodesShutdownMetadata.java

@@ -70,17 +71,17 @@ public static NodesShutdownMetadata fromXContent(XContentParser parser) {

    public static Optional<NodesShutdownMetadata> getShutdowns(final ClusterState state) {
        assert state != null : "cluster state should never be null";
-        return Optional.ofNullable(state).map(ClusterState::metadata).map(m -> m.custom(TYPE));
+        return Optional.of(state).map(ClusterState::metadata).map(m -> m.custom(TYPE));


According to the line above it is always not null

idegtiarenko · 2022-02-14T16:39:44Z

...est/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorServiceTests.java

+                    GREEN,
+                    String.format(
+                        Locale.ROOT,
+                        "This cluster has %d shards including %d primaries and %d replicas (%s temporary unallocated due to node restarting).",


Should we explicitly highlight shards that are temporary unavailable due to node restart (even though they are not affecting health status) or just silently ignore them?

elasticmachine · 2022-02-14T16:40:13Z

Pinging @elastic/es-distributed (Team:Distributed)

arteam

LGTM! I've left a couple of small comments

arteam · 2022-02-15T09:55:41Z

...src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorService.java

+                    }
+                }
+
+                if (shardRouting.replicaShards().isEmpty()) {


Since this condition is exclusive I guess it's better to express it with if-else

if (shardRouting.replicaShards().isEmpty()) { stats.unreplicatedPrimaries.add(primaryShard.shardId()); } else { for (ShardRouting replicaShard : shardRouting.replicaShards()) { if (replicaShard.active()) { stats.allocatedReplicas++; } else if (isRestarting(replicaShard)) { stats.restartingReplicas.add(replicaShard.shardId()); } else { stats.unallocatedReplicas.add(replicaShard.shardId()); } } }

arteam · 2022-02-15T10:14:46Z

...est/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorServiceTests.java

+            primary ? RecoverySource.EmptyStoreRecoverySource.INSTANCE : RecoverySource.PeerRecoverySource.INSTANCE,
+            new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null)
+        );
+        if (state == UNASSIGNED) {


WDYT about leveraging switch expressions here?

return switch (state) { case UNASSIGNED -> routing; case STARTED -> routing.initialize(UUID.randomUUID().toString(), null, 0).moveToStarted(); case UNASSIGNED_RESTARTING -> routing.initialize(UUID.randomUUID().toString(), null, 0) .moveToStarted() .moveToUnassigned( new UnassignedInfo( UnassignedInfo.Reason.NODE_RESTARTING, null, null, -1, 0, 0, false, UnassignedInfo.AllocationStatus.DELAYED_ALLOCATION, Set.of(), UUID.randomUUID().toString() ) ); };

I am not sure if that is easily achievable. I would say this is not having exclusive branches but rather executing subset of branches

arteam · 2022-02-15T10:15:41Z

...est/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorServiceTests.java

+    }
+
+    private static Supplier<IndexRoutingTable> indexGenerator(String prefix, ShardState primaryState, ShardState... replicaStates) {
+        var index = new AtomicInteger(0);


Nit: 0 is redundant here

arteam · 2022-02-15T10:16:32Z

...est/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorServiceTests.java

+
+    private static ClusterState createClusterStateWith(List<IndexRoutingTable> indexes) {
+        var builder = RoutingTable.builder();
+        for (IndexRoutingTable index : indexes) {


Consider using a shorter indexes.forEach(builder::add) here.

arteam · 2022-02-15T10:18:57Z

...est/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorServiceTests.java

+    }
+
+    private static IndexRoutingTable index(String name, ShardState primaryState, ShardState... replicaStates) {
+        var index = new Index(name, UUID.randomUUID().toString());


Nit: If you want to use an UUID here, I believe it needs to Base64 encoded, so it's better to use UUIDs.randomBase64UUID(), but I also think that any random string would be sufficient.

UUIDs.randomBase64UUID(random()) allows to have reproducible UUIDs which is handy in randomized tests.

tlrx · 2022-02-15T10:26:20Z

...src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorService.java

+    }
+
+    @Override
+    public HealthIndicatorResult calculate() {


This indicator should have same or similar behavior to existing health endpoint.

According to #83240 we'd like this indicator to report same health in the same situation as the existing cluster health, but the logic in this method looks different from ClusterShardHealth & co ? For example, an initializing primary would be reported here with a RED status. Is that expected?

Good point!

# Conflicts: # server/src/main/java/org/elasticsearch/node/Node.java

Tim-Brooks

I started looking at this today, but honestly need to pull it down and run some of the tests and did not have time for that. It looks like a good step.

I guess my first question is if we should be using the term "assigned" and "unassigned". It is true that we have the cluster allocation explain api. But that even refers to "assigned" in the response and documentation.

There is a special case where a shard is currently located on a node, but needs to be reallocated to a different node. That shard I think is technically assigned, but needs reallocation. I don't think that this PR addresses that circumstance? It might qualify under a different indicator.

Cloud banner currently uses the term "unavailable". We certainly could also generalize to that. But I think that assigned/unassigned are well known existing ES terms.

Tim-Brooks · 2022-02-16T04:56:46Z

Cluster health api also uses unassigned_shards.

idegtiarenko · 2022-02-16T07:52:04Z

...src/main/java/org/elasticsearch/cluster/routing/allocation/ShardsHealthIndicatorService.java

+                    stats.unallocatedPrimaries.add(primaryShard.shardId());
+                }
+
+                if (shardRouting.replicaShards().isEmpty()) {


I just realized I should also check if it is a shard on clod/frozen tier backed by a snapshot. In such case it should not be reported as yellow

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

idegtiarenko · 2022-03-02T16:02:34Z

@elasticmachine please run elasticsearch-ci/part-1

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

henningandersen

Left a few comments, otherwise looking good.

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

henningandersen · 2022-03-03T12:51:57Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+
+        var state = clusterService.state();
+        var shutdown = state.getMetadata().custom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY);
+        var status = new ShardAllocationStatus();


Can we construct this with the shutdown metadata instead, such that we only pass the node shutdowns in the constructor? Would look cleaner to me. Also, can we pass just a Function<String, SingleNodeShutdownMetadata like:

Suggested change

var status = new ShardAllocationStatus();

var status = new ShardAllocationStatus(shutdown.getAllNodeMetadataMap()::get);

Nodes shutdown is not required for output and only used for calculation. I believe it is better to keep such dependencies as arguments

I do not agree, by having the addPrimary/Replica methods accept the shutdowns input as well, a specific implementation is dictated, where the information is resolved immediately rather than on output. But given the localness of the code, I can live with how it is for now.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

I agree with Henning on this

henningandersen · 2022-03-03T12:55:34Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+            replicas.increment(routing, metadata);
+        }
+
+        public HealthStatus getStatus() {


We should ideally also trigger on initializing shards. This may have some complication to integrate with restart though if the operator/orchestration waits for green or yellow _cluster/health before removing the restart indication it might work out ok.

I think deferring to a follow-up is fine, but perhaps add a comment here to aid the next reader?

Adding to the handover document

.../elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorServiceTests.java

henningandersen · 2022-03-03T13:03:31Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+
+        public HealthStatus getStatus() {
+            if (primaries.unassigned > 0) {
+                return RED;


I think we need to special handle new indices similar to ClusterShardHealth. I think we should report green for now for shards of such new indices in this PR, since there will need to be some interaction with allocation to see if this is unassigned due to throttling or because the shard cannot be assigned at all currently.

A separate count of the "new" unassigned/initializing shards seems necessary.

.../elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorServiceTests.java

henningandersen · 2022-03-03T13:13:39Z

.../elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorServiceTests.java

+        );
+    }
+
+    public void testShouldBeRedWhenRestartingPrimariesReachedAllocationDelay() {


I think we should add this for replicas too (checking for yellow status when restart has expired).

henningandersen · 2022-03-03T13:17:51Z

.../elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorServiceTests.java

+            shardId,
+            primary,
+            primary ? RecoverySource.EmptyStoreRecoverySource.INSTANCE : RecoverySource.PeerRecoverySource.INSTANCE,
+            new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null)


I suppose we do not trigger on this, but neither do we on recovery. I wonder if we should randomize between more causes here, like NODE_LEFT, PRIMARY_FAILED and more?

May be. Due to the way how shards are constructed this could be done by adding more ShardState and extending this method with additional simulation

tlrx · 2022-03-03T13:46:50Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+        if (info == null || info.getReason() != UnassignedInfo.Reason.NODE_RESTARTING) {
+            return false;
+        }
+        var shutdown = metadata.getAllNodeMetadataMap().get(info.getLastAllocatedNodeId());


Would be better to retrive the Map<String, SingleNodeShutdownMetadata> once (cf Henning's comment)

tlrx · 2022-03-03T13:48:14Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+        if (shutdown == null || shutdown.getType() != SingleNodeShutdownMetadata.Type.RESTART) {
+            return false;
+        }
+        var now = System.currentTimeMillis();


You may want to pass a LongSupplier to unit test this more easily

Currently this is used in a static method that is used in a static inner class. Passing the reference to the LongSupplier might make this code more complex. Currently the unit test injects the time of the entity relative to the current and does not rely on injecting time to the class.

Sure, no need to hold on this. I did not realize it was used in a static context. I'm just advocating on trying to avoid System.currentTimeMillis() in tests and use monotonic clocks instead, or maybe just some time supplier as it is even more simple.

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

tlrx · 2022-03-03T13:54:38Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+                || replicas.unassigned_restarting > 0) {
+                builder.append(
+                    Stream.of(
+                        createMessage(primaries.unassigned, "unavailable primary", " unavailable primaries"),


nit: did you consider implement a getSummary() method in ShardAllocationCounts directly, and call the methods here?

Yes, but I did not come up with a good way to merge distinct messages. It would require appending primary/primaries/replica/replicas suffixes as well as properly setting comas.

idegtiarenko · 2022-03-03T16:02:40Z

@elasticmachine please run elasticsearch-ci/part-1

Tim-Brooks

A few nits.

Tim-Brooks · 2022-03-04T01:19:20Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+ * <p>
+ * Indicator will report:
+ * * RED when one or more primary shards are not available
+ * * YELLOW when one or more replica shards are not replicated


"replica shards are not replicated" should this be "replica shards are not available"

Tim-Brooks · 2022-03-04T01:24:11Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+                    ).flatMap(Function.identity()).collect(joining(" , "))
+                ).append(".");
+            } else {
+                builder.append("no unavailable shards.");


I think the preference from yesterday's meeting is no double negatives. "This cluster has all shards available."

tlrx · 2022-03-04T12:37:03Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+            return false;
+        }
+        var now = System.currentTimeMillis();
+        var restartingAllocationDelayExpiration = info.getUnassignedTimeInMillis() + shutdown.getAllocationDelay().getMillis();


Can we use info.getUnassignedTimeInNanos() as this is the one used to calculate the delay for delayed shard allocation?

henningandersen · 2022-03-04T13:35:43Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+        private int unassigned_restarting = 0;
+        private int initializing = 0;
+        private int started = 0;
+        private int reallocating = 0;


Can we rename this to relocating?

henningandersen · 2022-03-04T13:53:17Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+        private int started = 0;
+        private int reallocating = 0;
+
+        public void increment(ShardRouting routing, NodesShutdownMetadata metadata) {


Let us rename metadata to shutdowns.

henningandersen · 2022-03-04T13:53:35Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+        }
+    }
+
+    private static boolean isUnassignedDueToTimelyRestart(ShardRouting routing, NodesShutdownMetadata metadata) {


Let us rename metadata to shutdowns.

henningandersen · 2022-03-04T13:58:48Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+
+        var state = clusterService.state();
+        var shutdown = state.getMetadata().custom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY);
+        var status = new ShardAllocationStatus();


I do not agree, by having the addPrimary/Replica methods accept the shutdowns input as well, a specific implementation is dictated, where the information is resolved immediately rather than on output. But given the localness of the code, I can live with how it is for now.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

henningandersen

LGTM.

idegtiarenko · 2022-03-04T14:28:44Z

@elasticmachine please run elasticsearch-ci/part-2

tlrx

LGTM

tlrx · 2022-03-04T15:14:23Z

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java

+
+        var state = clusterService.state();
+        var shutdown = state.getMetadata().custom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY);
+        var status = new ShardAllocationStatus();


I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

I agree with Henning on this

Add a health indicator implementations that checks shards status and report their health status based on availability

shards allocation health indicator services

1b73587

Adding implementations that will check shards status and report health status based on their availability

idegtiarenko added >enhancement Team:Distributed Meta label for distributed team :Data Management/Health v8.2.0 labels Feb 4, 2022

idegtiarenko and others added 6 commits February 4, 2022 13:14

Update docs/changelog/83513.yaml

65af584

Merge branch 'master' into 83240_shards_health_indicator

ef2bfaf

Merge branch 'master' into 83240_shards_health_indicator

bde7eb7

update counts

171b819

refactor assertion

5f98333

add summary string

31d698a

idegtiarenko commented Feb 14, 2022

View reviewed changes

idegtiarenko added 3 commits February 14, 2022 14:27

cleanup

6fddaac

fix forbidden api usage

75a9730

ignore temporary unallocated shards due to node restarting

e7512cb

idegtiarenko commented Feb 14, 2022

View reviewed changes

idegtiarenko requested review from arteam, Tim-Brooks, henningandersen and tlrx February 14, 2022 16:40

idegtiarenko marked this pull request as ready for review February 14, 2022 16:40

arteam reviewed Feb 15, 2022

View reviewed changes

tlrx reviewed Feb 15, 2022

View reviewed changes

idegtiarenko added 2 commits February 15, 2022 17:15

take into account initializing status

4f11792

Merge branch 'master' into 83240_shards_health_indicator

7e7aba5

# Conflicts: # server/src/main/java/org/elasticsearch/node/Node.java

Tim-Brooks reviewed Feb 16, 2022

View reviewed changes

idegtiarenko commented Feb 16, 2022

View reviewed changes

idegtiarenko commented Mar 1, 2022

View reviewed changes

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java Show resolved Hide resolved

idegtiarenko added 2 commits March 2, 2022 16:42

become red when AllocationDelay is reached

2a52d81

Merge branch 'master' into 83240_shards_health_indicator

a9802b5

idegtiarenko commented Mar 3, 2022

View reviewed changes

...a/org/elasticsearch/cluster/routing/allocation/ShardsAvailabilityHealthIndicatorService.java Outdated Show resolved Hide resolved

idegtiarenko added 2 commits March 3, 2022 10:36

use getUnassignedTimeInMillis

cf2f4de

Merge branch 'master' into 83240_shards_health_indicator

07013a7

henningandersen reviewed Mar 3, 2022

View reviewed changes

tlrx reviewed Mar 3, 2022

View reviewed changes

idegtiarenko added 2 commits March 3, 2022 15:02

upd

1f33992

upd

a8dc1a8

include reallocating shards

8ad8ca0

Tim-Brooks reviewed Mar 4, 2022

View reviewed changes

idegtiarenko added 3 commits March 4, 2022 11:48

update messages

be0666a

fmt

49bcc79

Merge branch 'master' into 83240_shards_health_indicator

aa0e9c6

tlrx reviewed Mar 4, 2022

View reviewed changes

use monotonic time

d2cc9d3

idegtiarenko requested review from tlrx and henningandersen March 4, 2022 13:25

henningandersen reviewed Mar 4, 2022

View reviewed changes

henningandersen approved these changes Mar 4, 2022

View reviewed changes

some renaming

86264ea

tlrx approved these changes Mar 4, 2022

View reviewed changes

Merge branch 'master' into 83240_shards_health_indicator

39703b4

idegtiarenko merged commit 8d637f5 into elastic:master Mar 7, 2022

idegtiarenko deleted the 83240_shards_health_indicator branch March 7, 2022 08:31

arteam pushed a commit to arteam/elasticsearch that referenced this pull request Mar 9, 2022

shards allocation health indicator services (elastic#83513)

840e268

Add a health indicator implementations that checks shards status and report their health status based on availability

	var status = new ShardAllocationStatus();
	var status = new ShardAllocationStatus(shutdown.getAllNodeMetadataMap()::get);

shards allocation health indicator services #83513

shards allocation health indicator services #83513

Conversation

idegtiarenko commented Feb 4, 2022 • edited

elasticsearchmachine commented Feb 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Feb 14, 2022

arteam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tim-Brooks left a comment

Choose a reason for hiding this comment

Tim-Brooks commented Feb 16, 2022

Choose a reason for hiding this comment

idegtiarenko commented Mar 2, 2022

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idegtiarenko commented Mar 3, 2022

Tim-Brooks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

idegtiarenko commented Mar 4, 2022

tlrx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

idegtiarenko commented Feb 4, 2022 •

edited