-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-14119. Capture all container replication status in SCM container info #9472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devmadhuu given few comments
.../apache/hadoop/hdds/scm/container/replication/health/ClosedWithUnhealthyReplicasHandler.java
Outdated
Show resolved
Hide resolved
...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java
Show resolved
Hide resolved
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java
Outdated
Show resolved
Hide resolved
| .setDeleteTransactionId(info.getDeleteTransactionId()) | ||
| .setReplicationConfig(config) | ||
| .setSequenceId(info.getSequenceId()) | ||
| .build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice find, but it should be fixed separately so that we can backport it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created HDDS-14196.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sure, I have reverted to original code, Will push it separately. Thank you for your review.
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerHealthState.java
Outdated
Show resolved
Hide resolved
...hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManagerReport.java
Outdated
Show resolved
Hide resolved
...rg/apache/hadoop/hdds/scm/container/replication/health/QuasiClosedStuckReplicationCheck.java
Outdated
Show resolved
Hide resolved
.../apache/hadoop/hdds/scm/container/replication/health/ClosedWithUnhealthyReplicasHandler.java
Outdated
Show resolved
Hide resolved
.../apache/hadoop/hdds/scm/container/replication/health/VulnerableUnhealthyReplicasHandler.java
Outdated
Show resolved
Hide resolved
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@sodonnel please take a look |
…um reference and saving memory.
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private short healthStateValue; // 2 bytes - memory efficient!
Looks like an AI optimization, which means it has no context about real world clusters. I recently worked on a cluster with 32 million containers and 200PB+ of data running with 127GB of heap. Storing this as a short will consume 32Mil * 2B = 64MB of space. Storing it as an enum, which is much easier for devs to work with, will consume 32Mil * 8B = 256MB of space. (256mb - 64mb)/127gb = 0.001
So all this inconvenience is for a 0.1% reduction in heap usage at scale.
I do think we should track the container health state in the container info object in memory for easier queries, but based on real world numbers we can just use an enum for this.
|
There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers. I cannot really come up with a concrete reason as to why I think using container Info for this is wrong, aside from it only being updated by RM with its periodic runs. Perhaps I am over thinking it and its fine. I think there are also some places that use RM in a read only mode to check container states (decommission maybe), so that may update the containerInfo states between RM runs. I am not sure if that is a problem or not. Probably not as it can only make the state more current. Aside from the above scanning the PR quickly, the thing I am not sure about is multiplying up the states - like under_replicated, unhealthy_under_replicate, qc_under_replicated ... It leads to a lot more states that may just be more confusing than helpful. In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated. It can be missing and under-replicated I think. Missing is kind of an extreme version of under-replicated. The only way to capture these "double states" with a single field is to multiple up the states I guess. |
I remembered why a container has 2 states. Its got a "lifecycle state" - open, closed, closing, qc, qc_stuck, unhealthy And it has "replication states" - under, over and mis-replicated. |
Yes , with this new way, ContainerInfo object will hold its state (multiple or just single) and there is a possibility of changing it between two different RM runs. But in real time, that may be good also as it will reflect the current state even in read only mode. Could you please summarize your points to get better understanding how the current behavior may be an issue, As per my understanding, that should not be an issue, but still looking for deeper understanding from your contextual thinking. |
Thanks @errose28 for providing your real time large sized cluster based computation on memory foot print. I agree with your explanation. We can work with enum itself rather short value. |
This is true of almost all in-memory state that SCM has. A lot of the metrics we are tracking to follow the cluster state are derived from container report information. Same with Recon which also has an API to list all unhealthy containers. I think the container report still has merit as a point in time snapshot of the replication manager and we should leave it as is, but I don't think that should exclude us from adding additional functionality like querying SCM's current in-memory view of the cluster. If there's a different place to maintain this information which still supports querying containers by health state we can use that instead, but to me the in-memory |
|
@errose28 @sodonnel , I have removed short byte value and now using enum, pushed the code. Kindly have a relook. Also current PR is having sampling present to provide only first 100 unhealthy containers, but in future PRs, we can remove sampling considering not much impact on memory as we saw with above computation analysis. so that in future , below CLI can provide any specific unhealthy state containers with some limit option, something like below:
Because with every RM run, we are updating the state in ContainerInfo also in this PR, so not sure if need both. |
|
@errose28 can you pls re-review ? |
What changes were proposed in this pull request?
This PR adds container health state tracking to Apache Ozone SCM, storing a 2-byte health state value in each ContainerInfo object. This enables operators to query all unhealthy containers programmatically, provides foundation for REST APIs, and supports automation.
Key Metrics:
What This PR Implements
Core Feature: Health State in ContainerInfo
Every container now stores its health state:
10 Individual States (values 0-9):
6 Combination States (values 100-105):
Memory Impact Analysis
Before This PR (Baseline)
What Was Tracked:
Limitation: Only 100 containers visible per state
After This PR
What Is Tracked:
Capability: ALL containers queryable
Also Keep:
Total Memory:
Memory Sizing by Cluster Size
Conclusion: Scales linearly, stays under 2% even for massive clusters
Scenario: 2 Million Unhealthy Containers
Cluster:
Memory Breakdown:
Important: Memory is fixed at 20 MB regardless of unhealthy count
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14119
How was this patch tested?
This patch is tested using new test cases:
TestContainerHealthStatecovering:- All individual states and combinations
- Conversion from ReplicationManager states
- Protobuf serialization/deserialization
- Health state filtering and querying
- Edge cases and invalid states
Manual testing in local docker: