Skip to content

HDDS-13535. Show under/over-replication in replicas verify --container-state results#9135

Merged
adoroszlai merged 33 commits intoapache:masterfrom
0lai0:HDDS-13535
Dec 9, 2025
Merged

HDDS-13535. Show under/over-replication in replicas verify --container-state results#9135
adoroszlai merged 33 commits intoapache:masterfrom
0lai0:HDDS-13535

Conversation

@0lai0
Copy link
Contributor

@0lai0 0lai0 commented Oct 10, 2025

What changes were proposed in this pull request?

Enhanced container state representation to include replication health indicators. The toString() methods of ECContainerReplicaCount and RatisContainerReplicaCount now display under/over replication status with specific details without affecting the underlying container health assessment logic.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13535

How was this patch tested?

unit test

Copy link
Contributor

@Gargi-jais11 Gargi-jais11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0lai0 for working on this.
Your toString() changes are good additions that improve debugging, but they won't automatically appear in the CLI output without connecting them to the container-state verification flow.
The main need of this issue is that the replication related health can also be indicated in the --container-state check.
Such that in case of under/over replication, the output on running ozone debug replicas verify --container-state /volume/bucket/key can specify the case.

Copy link
Contributor

@Gargi-jais11 Gargi-jais11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add logic to ContainerStateVerifier.verifyBlock().
@Tejaskriya could you also please review this?

Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, @0lai0.

The goal of this JIRA is to introduce a flag to the ozone debug replicas verify --container-state command to indicate containers that are under or over replicated.

We can also consider including under or over replicated containers in the existing failure checks if that approach is more feasible as it avoids adding a new flag.

@errose28 errose28 added the tools Tools that helps with debugging label Oct 13, 2025
@Tejaskriya
Copy link
Contributor

Tejaskriya commented Oct 14, 2025

+1 to @sarvekshayr 's comment. If you check the other sub tasks under the main jira, there is HDDS-12595: which added a container state check. Here, as of the current implementation, only the health states of the replicas are printed.
The idea of the jira you are working on is to check for under replicated and over replicated scenarios as a part of this check. In case that proves to be difficult, we can add a different flag+check.

@0lai0
Copy link
Contributor Author

0lai0 commented Oct 20, 2025

Thank you all for the detailed feedback and valuable suggestions.

I understand now that the main goal of this JIRA is to integrate the under/over-replication state check directly into the ContainerStateVerifier.verifyBlock() method, and have the result reflected in the ozone debug replicas verify --container-state command's output. My initial toString() changes were meant to aid in debugging, and I appreciate the clear direction on how to connect this to the core verification logic. Let me make that adjustment.

@adoroszlai
Copy link
Contributor

Thanks @0lai0 for updating the patch. Please take a look at checkstyle and pmd failures:

https://github.com/0lai0/ozone/actions/runs/18685558075/job/53277131284#step:16:17

https://github.com/0lai0/ozone/actions/runs/18685558075/job/53277131255#step:16:17

@0lai0
Copy link
Contributor Author

0lai0 commented Oct 22, 2025

@adoroszlai Thank you for the review. Let me fix it.

Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all square brackets from the output strings.

container-state-verifier.robot test is expected to fail due of this change. Please review the git diff file and apply the fix to the robot test.
container-state-verifier-test-fix.txt

@0lai0
Copy link
Contributor Author

0lai0 commented Oct 22, 2025

@sarvekshayr Thank you for your suggestion!

Copy link
Contributor

@sarvekshayr sarvekshayr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @0lai0 for iterating on the feedback and patiently incorporating all the changes.

@0lai0
Copy link
Contributor Author

0lai0 commented Nov 11, 2025

Thank you @sarvekshayr for your detailed guidance! Your detailed feedback was incredibly helpful. I really appreciate it.

@sarvekshayr
Copy link
Contributor

@dombizita requesting your review on this.

@adoroszlai adoroszlai requested a review from dombizita November 30, 2025 08:33
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0lai0 for working on this.

Comment on lines +416 to +420

@Override
public String toString() {
return name();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not needed, since enums provide this by default.

Suggested change
@Override
public String toString() {
return name();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove this. Thanks

Comment on lines +163 to +214
private static class ContainerInfoToken {
private HddsProtos.LifeCycleState state;
private final String encodedToken;
private final ContainerInfo containerInfo;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContainerInfoToken is stored in encodedTokenCache, with default cache size of 1 million. ContainerInfo is a much larger object, so the process may run out of memory. The new logic only requires the number of nodes in addition to existing items, so storing the complete ContainerInfo object is unnecessary. But I don't think even "number of required nodes" is the right information to cache (see my comment on getContainerReplicas).

Also, expected memory requirement of the cache is mentioned in CLI help. It should be updated to reflect the increased size:

"'--container-state'. Default is 1 million containers (which takes around 43MB). " +

return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": no replicas found";
}

int replicationFactor = containerInfo.getReplicationFactor().getNumber();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicationFactor does not support EC. Please use getReplicationConfig().getRequiredNodes().

REPLICATION_CHECK_FAILED: Replication configuration of type EC does not have a replication factor property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I'll change to use ReplicationConfig.

Comment on lines +196 to +204
if (healthyReplicas < replicationFactor) {
return String.format("%s: %d/%d healthy replicas",
ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED, healthyReplicas, replicationFactor);
} else if (healthyReplicas > replicationFactor) {
return String.format("%s: %d/%d healthy replicas",
ContainerHealthResult.ReplicationStatus.OVER_REPLICATED, healthyReplicas, replicationFactor);
} else {
return ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid duplication.

Suggested change
if (healthyReplicas < replicationFactor) {
return String.format("%s: %d/%d healthy replicas",
ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED, healthyReplicas, replicationFactor);
} else if (healthyReplicas > replicationFactor) {
return String.format("%s: %d/%d healthy replicas",
ContainerHealthResult.ReplicationStatus.OVER_REPLICATED, healthyReplicas, replicationFactor);
} else {
return ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
}
if (healthyReplicas == replicationFactor) {
return ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString();
}
ContainerHealthResult.ReplicationStatus status = healthyReplicas < replicationFactor
? ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED
: ContainerHealthResult.ReplicationStatus.OVER_REPLICATED;
return String.format("%s: %d/%d healthy replicas", status, healthyReplicas, replicationFactor);

Comment on lines +180 to +181
List<ContainerReplicaInfo> replicaInfos =
containerOperationClient.getContainerReplicas(containerId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verifyBlock is called for a specific datanode, but checkReplicationStatus fetches info for all replicas. So we are checking all replicas for each replica. This should be done once and stored in the cache, instead of the ContainerInfo or "number of required nodes".

Comment on lines +412 to +415
public enum ReplicationStatus {
UNDER_REPLICATED,
OVER_REPLICATED,
HEALTHY_REPLICATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need this new enum, it seems to be a simplified version of HealthState. Since this is just for output, I think using a subset of those states is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I will remove ReplicationStatus.

Comment on lines +576 to +588

if (!isSufficientlyReplicated()) {
List<Integer> missingIndexes = unavailableIndexes(true);
sb.append(' ').append(ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED)
.append(": missing indexes ").append(missingIndexes);
} else if (isOverReplicated()) {
List<Integer> excessIndexes = overReplicatedIndexes(false);
sb.append(' ').append(ContainerHealthResult.ReplicationStatus.OVER_REPLICATED)
.append(": excess indexes ").append(excessIndexes);
} else {
sb.append(' ').append(ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this output is used for the new functionality. Also, I don't think toString() should be performing such logic. So if I'm missing something and it is indeed required, please move it to a separate function, otherwise please remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. It seems I went in the wrong function. I will remove under/over/healthy logic, and also remove RatisContainerReplicaCount.toString() relpication logic.

Comment on lines +260 to +270

if (!isSufficientlyReplicated()) {
result += " " + ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": need "
+ additionalReplicaNeeded() + " more";
} else if (isOverReplicated()) {
result += " " + ContainerHealthResult.ReplicationStatus.OVER_REPLICATED + ": excess "
+ getExcessRedundancy(true) + " replica(s)";
} else {
result += " " + ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, please delete or move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks @adoroszlai for review.

@adoroszlai adoroszlai changed the title HDDS-13535. Flag to indicate under and over replications in container-state check HDDS-13535. Show under/over-replication in replicas verify --container-state results Nov 30, 2025
@0lai0
Copy link
Contributor Author

0lai0 commented Nov 30, 2025

Thank you @adoroszlai for review. I'll modify the code according to the comments above.

@0lai0 0lai0 requested a review from adoroszlai December 4, 2025 09:09
@0lai0
Copy link
Contributor Author

0lai0 commented Dec 4, 2025

@adoroszlai , PTAL when you have a moment. Thanks.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0lai0 for updating the patch.

I would like to suggest some changes, e.g. instead of introducing a new cache, add replicationStatus to ContainerInfoToken. Please see adoroszlai@cd6f9f6.

0lai0 and others added 3 commits December 6, 2025 09:38
…m/container/replication/ECContainerReplicaCount.java

Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
@0lai0
Copy link
Contributor Author

0lai0 commented Dec 8, 2025

Thanks @sarvekshayr for review. PTAL

@adoroszlai adoroszlai merged commit 9036e1a into apache:master Dec 9, 2025
43 checks passed
@adoroszlai
Copy link
Contributor

Thanks @0lai0 for the patch, @chungen0126, @Gargi-jais11, @sarvekshayr, @sodonnel, @Tejaskriya for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tools Tools that helps with debugging

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants