HDDS-13535. Show under/over-replication in replicas verify --container-state results#9135
HDDS-13535. Show under/over-replication in replicas verify --container-state results#9135adoroszlai merged 33 commits intoapache:masterfrom
replicas verify --container-state results#9135Conversation
There was a problem hiding this comment.
Thanks @0lai0 for working on this.
Your toString() changes are good additions that improve debugging, but they won't automatically appear in the CLI output without connecting them to the container-state verification flow.
The main need of this issue is that the replication related health can also be indicated in the --container-state check.
Such that in case of under/over replication, the output on running ozone debug replicas verify --container-state /volume/bucket/key can specify the case.
Gargi-jais11
left a comment
There was a problem hiding this comment.
You need to add logic to ContainerStateVerifier.verifyBlock().
@Tejaskriya could you also please review this?
.../src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECContainerReplicaCount.java
Outdated
Show resolved
Hide resolved
sarvekshayr
left a comment
There was a problem hiding this comment.
Thanks for picking this up, @0lai0.
The goal of this JIRA is to introduce a flag to the ozone debug replicas verify --container-state command to indicate containers that are under or over replicated.
We can also consider including under or over replicated containers in the existing failure checks if that approach is more feasible as it avoids adding a new flag.
|
+1 to @sarvekshayr 's comment. If you check the other sub tasks under the main jira, there is HDDS-12595: which added a container state check. Here, as of the current implementation, only the health states of the replicas are printed. |
|
Thank you all for the detailed feedback and valuable suggestions. I understand now that the main goal of this JIRA is to integrate the under/over-replication state check directly into the ContainerStateVerifier.verifyBlock() method, and have the result reflected in the ozone debug replicas verify --container-state command's output. My initial toString() changes were meant to aid in debugging, and I appreciate the clear direction on how to connect this to the core verification logic. Let me make that adjustment. |
|
Thanks @0lai0 for updating the patch. Please take a look at checkstyle and pmd failures: https://github.com/0lai0/ozone/actions/runs/18685558075/job/53277131284#step:16:17 https://github.com/0lai0/ozone/actions/runs/18685558075/job/53277131255#step:16:17 |
|
@adoroszlai Thank you for the review. Let me fix it. |
sarvekshayr
left a comment
There was a problem hiding this comment.
Please remove all square brackets from the output strings.
container-state-verifier.robot test is expected to fail due of this change. Please review the git diff file and apply the fix to the robot test.
container-state-verifier-test-fix.txt
|
@sarvekshayr Thank you for your suggestion! |
hadoop-ozone/dist/src/main/smoketest/debug/container-state-verifier.robot
Outdated
Show resolved
Hide resolved
sarvekshayr
left a comment
There was a problem hiding this comment.
LGTM. Thank you @0lai0 for iterating on the feedback and patiently incorporating all the changes.
|
Thank you @sarvekshayr for your detailed guidance! Your detailed feedback was incredibly helpful. I really appreciate it. |
|
@dombizita requesting your review on this. |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @0lai0 for working on this.
|
|
||
| @Override | ||
| public String toString() { | ||
| return name(); | ||
| } |
There was a problem hiding this comment.
nit: not needed, since enums provide this by default.
| @Override | |
| public String toString() { | |
| return name(); | |
| } |
There was a problem hiding this comment.
Sure, I will remove this. Thanks
| private static class ContainerInfoToken { | ||
| private HddsProtos.LifeCycleState state; | ||
| private final String encodedToken; | ||
| private final ContainerInfo containerInfo; |
There was a problem hiding this comment.
ContainerInfoToken is stored in encodedTokenCache, with default cache size of 1 million. ContainerInfo is a much larger object, so the process may run out of memory. The new logic only requires the number of nodes in addition to existing items, so storing the complete ContainerInfo object is unnecessary. But I don't think even "number of required nodes" is the right information to cache (see my comment on getContainerReplicas).
Also, expected memory requirement of the cache is mentioned in CLI help. It should be updated to reflect the increased size:
| return ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": no replicas found"; | ||
| } | ||
|
|
||
| int replicationFactor = containerInfo.getReplicationFactor().getNumber(); |
There was a problem hiding this comment.
ReplicationFactor does not support EC. Please use getReplicationConfig().getRequiredNodes().
REPLICATION_CHECK_FAILED: Replication configuration of type EC does not have a replication factor property.
There was a problem hiding this comment.
Thanks for the pointer. I'll change to use ReplicationConfig.
| if (healthyReplicas < replicationFactor) { | ||
| return String.format("%s: %d/%d healthy replicas", | ||
| ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED, healthyReplicas, replicationFactor); | ||
| } else if (healthyReplicas > replicationFactor) { | ||
| return String.format("%s: %d/%d healthy replicas", | ||
| ContainerHealthResult.ReplicationStatus.OVER_REPLICATED, healthyReplicas, replicationFactor); | ||
| } else { | ||
| return ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString(); | ||
| } |
There was a problem hiding this comment.
nit: avoid duplication.
| if (healthyReplicas < replicationFactor) { | |
| return String.format("%s: %d/%d healthy replicas", | |
| ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED, healthyReplicas, replicationFactor); | |
| } else if (healthyReplicas > replicationFactor) { | |
| return String.format("%s: %d/%d healthy replicas", | |
| ContainerHealthResult.ReplicationStatus.OVER_REPLICATED, healthyReplicas, replicationFactor); | |
| } else { | |
| return ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString(); | |
| } | |
| if (healthyReplicas == replicationFactor) { | |
| return ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION.toString(); | |
| } | |
| ContainerHealthResult.ReplicationStatus status = healthyReplicas < replicationFactor | |
| ? ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED | |
| : ContainerHealthResult.ReplicationStatus.OVER_REPLICATED; | |
| return String.format("%s: %d/%d healthy replicas", status, healthyReplicas, replicationFactor); |
| List<ContainerReplicaInfo> replicaInfos = | ||
| containerOperationClient.getContainerReplicas(containerId); |
There was a problem hiding this comment.
verifyBlock is called for a specific datanode, but checkReplicationStatus fetches info for all replicas. So we are checking all replicas for each replica. This should be done once and stored in the cache, instead of the ContainerInfo or "number of required nodes".
| public enum ReplicationStatus { | ||
| UNDER_REPLICATED, | ||
| OVER_REPLICATED, | ||
| HEALTHY_REPLICATION; |
There was a problem hiding this comment.
I'm not sure we need this new enum, it seems to be a simplified version of HealthState. Since this is just for output, I think using a subset of those states is fine.
There was a problem hiding this comment.
Thanks for the pointer. I will remove ReplicationStatus.
|
|
||
| if (!isSufficientlyReplicated()) { | ||
| List<Integer> missingIndexes = unavailableIndexes(true); | ||
| sb.append(' ').append(ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED) | ||
| .append(": missing indexes ").append(missingIndexes); | ||
| } else if (isOverReplicated()) { | ||
| List<Integer> excessIndexes = overReplicatedIndexes(false); | ||
| sb.append(' ').append(ContainerHealthResult.ReplicationStatus.OVER_REPLICATED) | ||
| .append(": excess indexes ").append(excessIndexes); | ||
| } else { | ||
| sb.append(' ').append(ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION); | ||
| } | ||
|
|
There was a problem hiding this comment.
I don't see where this output is used for the new functionality. Also, I don't think toString() should be performing such logic. So if I'm missing something and it is indeed required, please move it to a separate function, otherwise please remove it.
There was a problem hiding this comment.
Thanks for the clarification. It seems I went in the wrong function. I will remove under/over/healthy logic, and also remove RatisContainerReplicaCount.toString() relpication logic.
|
|
||
| if (!isSufficientlyReplicated()) { | ||
| result += " " + ContainerHealthResult.ReplicationStatus.UNDER_REPLICATED + ": need " | ||
| + additionalReplicaNeeded() + " more"; | ||
| } else if (isOverReplicated()) { | ||
| result += " " + ContainerHealthResult.ReplicationStatus.OVER_REPLICATED + ": excess " | ||
| + getExcessRedundancy(true) + " replica(s)"; | ||
| } else { | ||
| result += " " + ContainerHealthResult.ReplicationStatus.HEALTHY_REPLICATION; | ||
| } | ||
|
|
There was a problem hiding this comment.
Similarly, please delete or move.
replicas verify --container-state results
|
Thank you @adoroszlai for review. I'll modify the code according to the comments above. |
|
@adoroszlai , PTAL when you have a moment. Thanks. |
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @0lai0 for updating the patch.
I would like to suggest some changes, e.g. instead of introducing a new cache, add replicationStatus to ContainerInfoToken. Please see adoroszlai@cd6f9f6.
.../src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECContainerReplicaCount.java
Outdated
Show resolved
Hide resolved
…m/container/replication/ECContainerReplicaCount.java Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
...c/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java
Show resolved
Hide resolved
|
Thanks @sarvekshayr for review. PTAL |
...c/main/java/org/apache/hadoop/hdds/scm/container/replication/RatisContainerReplicaCount.java
Outdated
Show resolved
Hide resolved
|
Thanks @0lai0 for the patch, @chungen0126, @Gargi-jais11, @sarvekshayr, @sodonnel, @Tejaskriya for the review. |
What changes were proposed in this pull request?
Enhanced container state representation to include replication health indicators. The toString() methods of ECContainerReplicaCount and RatisContainerReplicaCount now display under/over replication status with specific details without affecting the underlying container health assessment logic.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13535
How was this patch tested?
unit test