Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shards allocation health indicator services #83513

Merged

Conversation

idegtiarenko
Copy link
Contributor

@idegtiarenko idegtiarenko commented Feb 4, 2022

Adding implementations that will check shards status and report health
status based on their availability

Closes: #83240
Related to: #83303

Adding implementations that will check shards status and report health
status based on their availability
@elasticsearchmachine
Copy link
Collaborator

Hi @idegtiarenko, I've created a changelog YAML for you.

@@ -70,17 +71,17 @@ public static NodesShutdownMetadata fromXContent(XContentParser parser) {

public static Optional<NodesShutdownMetadata> getShutdowns(final ClusterState state) {
assert state != null : "cluster state should never be null";
return Optional.ofNullable(state).map(ClusterState::metadata).map(m -> m.custom(TYPE));
return Optional.of(state).map(ClusterState::metadata).map(m -> m.custom(TYPE));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the line above it is always not null

GREEN,
String.format(
Locale.ROOT,
"This cluster has %d shards including %d primaries and %d replicas (%s temporary unallocated due to node restarting).",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly highlight shards that are temporary unavailable due to node restart (even though they are not affecting health status) or just silently ignore them?

@idegtiarenko idegtiarenko marked this pull request as ready for review February 14, 2022 16:40
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@arteam arteam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've left a couple of small comments

}
}

if (shardRouting.replicaShards().isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this condition is exclusive I guess it's better to express it with if-else

if (shardRouting.replicaShards().isEmpty()) {
    stats.unreplicatedPrimaries.add(primaryShard.shardId());
} else {
    for (ShardRouting replicaShard : shardRouting.replicaShards()) {
        if (replicaShard.active()) {
            stats.allocatedReplicas++;
        } else if (isRestarting(replicaShard)) {
            stats.restartingReplicas.add(replicaShard.shardId());
        } else {
            stats.unallocatedReplicas.add(replicaShard.shardId());
        }
    }
} 

primary ? RecoverySource.EmptyStoreRecoverySource.INSTANCE : RecoverySource.PeerRecoverySource.INSTANCE,
new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null)
);
if (state == UNASSIGNED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about leveraging switch expressions here?

return switch (state) {
    case UNASSIGNED -> routing;
    case STARTED -> routing.initialize(UUID.randomUUID().toString(), null, 0).moveToStarted();
    case UNASSIGNED_RESTARTING -> routing.initialize(UUID.randomUUID().toString(), null, 0)
        .moveToStarted()
        .moveToUnassigned(
            new UnassignedInfo(
                UnassignedInfo.Reason.NODE_RESTARTING,
                null,
                null,
                -1,
                0,
                0,
                false,
                UnassignedInfo.AllocationStatus.DELAYED_ALLOCATION,
                Set.of(),
                UUID.randomUUID().toString()
            )
        );
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if that is easily achievable. I would say this is not having exclusive branches but rather executing subset of branches

}

private static Supplier<IndexRoutingTable> indexGenerator(String prefix, ShardState primaryState, ShardState... replicaStates) {
var index = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 0 is redundant here


private static ClusterState createClusterStateWith(List<IndexRoutingTable> indexes) {
var builder = RoutingTable.builder();
for (IndexRoutingTable index : indexes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a shorter indexes.forEach(builder::add) here.

}

private static IndexRoutingTable index(String name, ShardState primaryState, ShardState... replicaStates) {
var index = new Index(name, UUID.randomUUID().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If you want to use an UUID here, I believe it needs to Base64 encoded, so it's better to use UUIDs.randomBase64UUID(), but I also think that any random string would be sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUIDs.randomBase64UUID(random()) allows to have reproducible UUIDs which is handy in randomized tests.

}

@Override
public HealthIndicatorResult calculate() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicator should have same or similar behavior to existing health endpoint.

According to #83240 we'd like this indicator to report same health in the same situation as the existing cluster health, but the logic in this method looks different from ClusterShardHealth & co ? For example, an initializing primary would be reported here with a RED status. Is that expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking at this today, but honestly need to pull it down and run some of the tests and did not have time for that. It looks like a good step.

I guess my first question is if we should be using the term "assigned" and "unassigned". It is true that we have the cluster allocation explain api. But that even refers to "assigned" in the response and documentation.

There is a special case where a shard is currently located on a node, but needs to be reallocated to a different node. That shard I think is technically assigned, but needs reallocation. I don't think that this PR addresses that circumstance? It might qualify under a different indicator.

Cloud banner currently uses the term "unavailable". We certainly could also generalize to that. But I think that assigned/unassigned are well known existing ES terms.

@Tim-Brooks
Copy link
Contributor

Cluster health api also uses unassigned_shards.

stats.unallocatedPrimaries.add(primaryShard.shardId());
}

if (shardRouting.replicaShards().isEmpty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized I should also check if it is a shard on clod/frozen tier backed by a snapshot. In such case it should not be reported as yellow

@idegtiarenko
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/part-1

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, otherwise looking good.


var state = clusterService.state();
var shutdown = state.getMetadata().custom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY);
var status = new ShardAllocationStatus();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we construct this with the shutdown metadata instead, such that we only pass the node shutdowns in the constructor? Would look cleaner to me. Also, can we pass just a Function<String, SingleNodeShutdownMetadata like:

Suggested change
var status = new ShardAllocationStatus();
var status = new ShardAllocationStatus(shutdown.getAllNodeMetadataMap()::get);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nodes shutdown is not required for output and only used for calculation. I believe it is better to keep such dependencies as arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree, by having the addPrimary/Replica methods accept the shutdowns input as well, a specific implementation is dictated, where the information is resolved immediately rather than on output. But given the localness of the code, I can live with how it is for now.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

I agree with Henning on this

replicas.increment(routing, metadata);
}

public HealthStatus getStatus() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ideally also trigger on initializing shards. This may have some complication to integrate with restart though if the operator/orchestration waits for green or yellow _cluster/health before removing the restart indication it might work out ok.

I think deferring to a follow-up is fine, but perhaps add a comment here to aid the next reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to the handover document


public HealthStatus getStatus() {
if (primaries.unassigned > 0) {
return RED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to special handle new indices similar to ClusterShardHealth. I think we should report green for now for shards of such new indices in this PR, since there will need to be some interaction with allocation to see if this is unassigned due to throttling or because the shard cannot be assigned at all currently.

A separate count of the "new" unassigned/initializing shards seems necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

);
}

public void testShouldBeRedWhenRestartingPrimariesReachedAllocationDelay() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this for replicas too (checking for yellow status when restart has expired).

shardId,
primary,
primary ? RecoverySource.EmptyStoreRecoverySource.INSTANCE : RecoverySource.PeerRecoverySource.INSTANCE,
new UnassignedInfo(UnassignedInfo.Reason.INDEX_CREATED, null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we do not trigger on this, but neither do we on recovery. I wonder if we should randomize between more causes here, like NODE_LEFT, PRIMARY_FAILED and more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be. Due to the way how shards are constructed this could be done by adding more ShardState and extending this method with additional simulation

if (info == null || info.getReason() != UnassignedInfo.Reason.NODE_RESTARTING) {
return false;
}
var shutdown = metadata.getAllNodeMetadataMap().get(info.getLastAllocatedNodeId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to retrive the Map<String, SingleNodeShutdownMetadata> once (cf Henning's comment)

if (shutdown == null || shutdown.getType() != SingleNodeShutdownMetadata.Type.RESTART) {
return false;
}
var now = System.currentTimeMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to pass a LongSupplier to unit test this more easily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is used in a static method that is used in a static inner class. Passing the reference to the LongSupplier might make this code more complex. Currently the unit test injects the time of the entity relative to the current and does not rely on injecting time to the class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no need to hold on this. I did not realize it was used in a static context. I'm just advocating on trying to avoid System.currentTimeMillis() in tests and use monotonic clocks instead, or maybe just some time supplier as it is even more simple.

|| replicas.unassigned_restarting > 0) {
builder.append(
Stream.of(
createMessage(primaries.unassigned, "unavailable primary", " unavailable primaries"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: did you consider implement a getSummary() method in ShardAllocationCounts directly, and call the methods here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I did not come up with a good way to merge distinct messages. It would require appending primary/primaries/replica/replicas suffixes as well as properly setting comas.

@idegtiarenko
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/part-1

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits.

* <p>
* Indicator will report:
* * RED when one or more primary shards are not available
* * YELLOW when one or more replica shards are not replicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"replica shards are not replicated" should this be "replica shards are not available"

).flatMap(Function.identity()).collect(joining(" , "))
).append(".");
} else {
builder.append("no unavailable shards.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the preference from yesterday's meeting is no double negatives. "This cluster has all shards available."

return false;
}
var now = System.currentTimeMillis();
var restartingAllocationDelayExpiration = info.getUnassignedTimeInMillis() + shutdown.getAllocationDelay().getMillis();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use info.getUnassignedTimeInNanos() as this is the one used to calculate the delay for delayed shard allocation?

private int unassigned_restarting = 0;
private int initializing = 0;
private int started = 0;
private int reallocating = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to relocating?

private int started = 0;
private int reallocating = 0;

public void increment(ShardRouting routing, NodesShutdownMetadata metadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us rename metadata to shutdowns.

}
}

private static boolean isUnassignedDueToTimelyRestart(ShardRouting routing, NodesShutdownMetadata metadata) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us rename metadata to shutdowns.


var state = clusterService.state();
var shutdown = state.getMetadata().custom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY);
var status = new ShardAllocationStatus();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not agree, by having the addPrimary/Replica methods accept the shutdowns input as well, a specific implementation is dictated, where the information is resolved immediately rather than on output. But given the localness of the code, I can live with how it is for now.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@idegtiarenko
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/part-2

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


var state = clusterService.state();
var shutdown = state.getMetadata().custom(NodesShutdownMetadata.TYPE, NodesShutdownMetadata.EMPTY);
var status = new ShardAllocationStatus();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see us just pass around Function<String, SingleNodeShutdownMetadata> rather than the full metadata.

I agree with Henning on this

@idegtiarenko idegtiarenko merged commit 8d637f5 into elastic:master Mar 7, 2022
@idegtiarenko idegtiarenko deleted the 83240_shards_health_indicator branch March 7, 2022 08:31
arteam pushed a commit to arteam/elasticsearch that referenced this pull request Mar 9, 2022
Add a health indicator implementations that checks shards status 
and report their health status based on availability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement health indicator for shard allocation
9 participants