HDDS-14119. Capture all container replication status in SCM container info #9472

devmadhuu · 2025-12-10T08:16:40Z

What changes were proposed in this pull request?

This PR adds container health state tracking to Apache Ozone SCM, storing a 2-byte health state value in each ContainerInfo object. This enables operators to query all unhealthy containers programmatically, provides foundation for REST APIs, and supports automation.

Key Metrics:

Memory overhead: 20 MB for 10M containers (0.25% of 8GB heap)
CPU overhead: 0.15% of ReplicationManager cycle
Enables: Complete queries, REST APIs, automation, real-time monitoring

What This PR Implements

Core Feature: Health State in ContainerInfo

Every container now stores its health state:

public class ContainerInfo {
  private short healthStateValue;  // 2 bytes - memory efficient!
  
  // External API - returns enum for clean usage
  public ContainerHealthState getHealthState() {
    return ContainerHealthState.fromValue(healthStateValue);
  }
  
  // Internal API - stores short value directly
  public void setHealthState(short stateValue) {
    this.healthStateValue = stateValue;
  }
}

10 Individual States (values 0-9):

HEALTHY (0), UNDER_REPLICATED (1), MIS_REPLICATED (2), OVER_REPLICATED (3)
MISSING (4), UNHEALTHY (5), EMPTY (6), OPEN_UNHEALTHY (7)
QUASI_CLOSED_STUCK (8), OPEN_WITHOUT_PIPELINE (9)

6 Combination States (values 100-105):

UNHEALTHY_UNDER_REPLICATED (100)
UNHEALTHY_OVER_REPLICATED (101)
MISSING_UNDER_REPLICATED (102)
QUASI_CLOSED_STUCK_UNDER_REPLICATED (103)
QUASI_CLOSED_STUCK_OVER_REPLICATED (104)
QUASI_CLOSED_STUCK_MISSING (105)

Memory Impact Analysis

Before This PR (Baseline)

What Was Tracked:

ReplicationManagerReport samples:
  - First 100 container IDs per health state
  - Maximum: 100 × 16 states = 1,600 container IDs
  
Memory: 1,600 × 8 bytes = 12.8 KB

Limitation: Only 100 containers visible per state

After This PR

What Is Tracked:

ContainerInfo.healthStateValue:
  - Health state for EVERY container
  - ALL 10 million containers tracked
  
Memory: 10M × 2 bytes = 20 MB

Capability: ALL containers queryable

Also Keep:

ReplicationManagerReport samples:
  - First 100 per state (for CLI backward compatibility)
  
Memory: 12.8 KB

Total Memory:

Health states: 20 MB
Samples: 12.8 KB
══════════════════════
Total: 20 MB

Increase: +20 MB vs baseline

Memory Sizing by Cluster Size

Cluster Size	Health State Memory	% of Heap	Total Heap
100K containers	200 KB	0.005%	4 GB
1M containers	2 MB	0.05%	4 GB
10M containers	20 MB	0.25%	8 GB
100M containers	200 MB	1.25%	16 GB

Conclusion: Scales linearly, stays under 2% even for massive clusters

Scenario: 2 Million Unhealthy Containers

Cluster:

Total containers: 10 million
Unhealthy: 2 million (20% - worst case)
SCM heap: 8 GB

Memory Breakdown:

Component	Memory	Notes
Health states (ALL 10M)	20 MB	Every container tracked
Report samples	12.8 KB	First 100 per state
Total	20 MB	0.25% of 8GB heap

Important: Memory is fixed at 20 MB regardless of unhealthy count

100K unhealthy: Still 20 MB
2M unhealthy: Still 20 MB
Depends on total containers, not unhealthy count

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14119

How was this patch tested?

This patch is tested using new test cases:
TestContainerHealthState covering:
- All individual states and combinations
- Conversion from ReplicationManager states
- Protobuf serialization/deserialization
- Health state filtering and querying
- Edge cases and invalid states

Manual testing in local docker:

bash-5.1$ ozone admin container info 2
Container id: 2
Pipeline id: 58417e48-d16e-41b2-9d4b-ac728c93ef18
Write PipelineId: 9ad07e69-3af9-44cd-b05c-fa119ea93d83
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []

bash-5.1$ ozone admin container info 1
Container id: 1
Pipeline id: fdca751b-9417-4be5-9e32-02d8f087ad44
Write PipelineId: 02f015ca-7ffc-4480-b60b-a886330f487d
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []

bash-5.1$ ozone admin container info 3
Container id: 3
Pipeline id: 4d061122-1319-481a-955f-b895f7e43bac
Write PipelineId: 0f9a52a6-3023-48c4-9f42-2cfcafd7a590
Write Pipeline State: CLOSED
Container State: CLOSED
Datanodes: [ea3521ce-ef01-4d4a-b308-19c4904dffaa/ozone-datanode-1.ozone_default]
Replicas: [State: CLOSED; ReplicaIndex: 0; Origin: ea3521ce-ef01-4d4a-b308-19c4904dffaa; Location: ea3521ce-ef01-4d4a-b308-19c4904dffaa/ozone-datanode-1.ozone_default]

bash-5.1$ ozone admin container info 4
Container id: 4
Pipeline id: e572b173-cbe0-4797-85c1-752f1cbb8f1b
Write PipelineId: 479e2f1a-bc09-47fd-b339-6b299b1fc61d
Write Pipeline State: CLOSED
Container State: CLOSED
Datanodes: [130e3b26-712c-487b-b1bb-677dbab05f18/ozone-datanode-4.ozone_default]
Replicas: [State: CLOSED; ReplicaIndex: 0; Origin: 130e3b26-712c-487b-b1bb-677dbab05f18; Location: 130e3b26-712c-487b-b1bb-677dbab05f18/ozone-datanode-4.ozone_default]

bash-5.1$ ozone admin container info 5
Container id: 5
Pipeline id: 6176af26-5c8f-4b59-980b-d2884375b8fd
Write PipelineId: 8242a1a1-d399-44ba-b80d-853a4e2571b3
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []
bash-5.1$ ozone admin container info 6
Container id: 6
Pipeline id: 731777c2-3886-4d1f-8f17-6f5346f8f6b3
Write PipelineId: 3bfce619-21f6-4b54-b627-60ac960d4efc
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []
bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-12-18T11:04:43Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 4
QUASI_CLOSED: 0
CLOSED: 2
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 4
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

First 100 MISSING containers:
#1, #2, #5, #6

bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-12-18T11:06:58Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 0
QUASI_CLOSED: 0
CLOSED: 6
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

…ontainer info.

sumitagrawl

@devmadhuu given few comments

.../apache/hadoop/hdds/scm/container/replication/health/ClosedWithUnhealthyReplicasHandler.java

...r-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerManagerImpl.java

adoroszlai · 2025-12-17T08:43:34Z

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerInfo.java

        .setDeleteTransactionId(info.getDeleteTransactionId())
        .setReplicationConfig(config)
-        .setSequenceId(info.getSequenceId())
-        .build();


This is a nice find, but it should be fixed separately so that we can backport it.

Created HDDS-14196.

Ok sure, I have reverted to original code, Will push it separately. Thank you for your review.

hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ContainerHealthState.java

...hdds/common/src/main/java/org/apache/hadoop/hdds/scm/container/ReplicationManagerReport.java

...rg/apache/hadoop/hdds/scm/container/replication/health/QuasiClosedStuckReplicationCheck.java

.../apache/hadoop/hdds/scm/container/replication/health/ClosedWithUnhealthyReplicasHandler.java

.../apache/hadoop/hdds/scm/container/replication/health/VulnerableUnhealthyReplicasHandler.java

sumitagrawl

LGTM

adoroszlai · 2025-12-18T13:43:42Z

@sodonnel please take a look

…um reference and saving memory.

errose28

private short healthStateValue; // 2 bytes - memory efficient!

Looks like an AI optimization, which means it has no context about real world clusters. I recently worked on a cluster with 32 million containers and 200PB+ of data running with 127GB of heap. Storing this as a short will consume 32Mil * 2B = 64MB of space. Storing it as an enum, which is much easier for devs to work with, will consume 32Mil * 8B = 256MB of space. (256mb - 64mb)/127gb = 0.001

So all this inconvenience is for a 0.1% reduction in heap usage at scale.

I do think we should track the container health state in the container info object in memory for easier queries, but based on real world numbers we can just use an enum for this.

sodonnel · 2025-12-19T17:19:53Z

There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers.

I cannot really come up with a concrete reason as to why I think using container Info for this is wrong, aside from it only being updated by RM with its periodic runs. Perhaps I am over thinking it and its fine.

I think there are also some places that use RM in a read only mode to check container states (decommission maybe), so that may update the containerInfo states between RM runs. I am not sure if that is a problem or not. Probably not as it can only make the state more current.

Aside from the above scanning the PR quickly, the thing I am not sure about is multiplying up the states - like under_replicated, unhealthy_under_replicate, qc_under_replicated ... It leads to a lot more states that may just be more confusing than helpful.

In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated. It can be missing and under-replicated I think. Missing is kind of an extreme version of under-replicated. The only way to capture these "double states" with a single field is to multiple up the states I guess.

sodonnel · 2025-12-20T10:51:44Z

In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated

I remembered why a container has 2 states. Its got a "lifecycle state" - open, closed, closing, qc, qc_stuck, unhealthy

And it has "replication states" - under, over and mis-replicated.

devmadhuu · 2025-12-29T04:24:03Z

Thanks @errose28 and @sodonnel for the review.

devmadhuu · 2025-12-29T04:29:10Z

There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers.

I cannot really come up with a concrete reason as to why I think using container Info for this is wrong, aside from it only being updated by RM with its periodic runs. Perhaps I am over thinking it and its fine.

I think there are also some places that use RM in a read only mode to check container states (decommission maybe), so that may update the containerInfo states between RM runs. I am not sure if that is a problem or not. Probably not as it can only make the state more current.

Aside from the above scanning the PR quickly, the thing I am not sure about is multiplying up the states - like under_replicated, unhealthy_under_replicate, qc_under_replicated ... It leads to a lot more states that may just be more confusing than helpful.

In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated. It can be missing and under-replicated I think. Missing is kind of an extreme version of under-replicated. The only way to capture these "double states" with a single field is to multiple up the states I guess.

Yes , with this new way, ContainerInfo object will hold its state (multiple or just single) and there is a possibility of changing it between two different RM runs. But in real time, that may be good also as it will reflect the current state even in read only mode. Could you please summarize your points to get better understanding how the current behavior may be an issue, As per my understanding, that should not be an issue, but still looking for deeper understanding from your contextual thinking.

devmadhuu · 2025-12-29T04:30:48Z

private short healthStateValue; // 2 bytes - memory efficient!

Looks like an AI optimization, which means it has no context about real world clusters. I recently worked on a cluster with 32 million containers and 200PB+ of data running with 127GB of heap. Storing this as a short will consume 32Mil * 2B = 64MB of space. Storing it as an enum, which is much easier for devs to work with, will consume 32Mil * 8B = 256MB of space. (256mb - 64mb)/127gb = 0.001

So all this inconvenience is for a 0.1% reduction in heap usage at scale.

I do think we should track the container health state in the container info object in memory for easier queries, but based on real world numbers we can just use an enum for this.

Thanks @errose28 for providing your real time large sized cluster based computation on memory foot print. I agree with your explanation. We can work with enum itself rather short value.

errose28 · 2026-01-05T20:45:03Z

There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers.

This is true of almost all in-memory state that SCM has. A lot of the metrics we are tracking to follow the cluster state are derived from container report information. Same with Recon which also has an API to list all unhealthy containers. I think the container report still has merit as a point in time snapshot of the replication manager and we should leave it as is, but I don't think that should exclude us from adding additional functionality like querying SCM's current in-memory view of the cluster.

If there's a different place to maintain this information which still supports querying containers by health state we can use that instead, but to me the in-memory ContainerInfo object looks like the best place to store it.

devmadhuu · 2026-01-07T05:37:00Z

@errose28 @sodonnel , I have removed short byte value and now using enum, pushed the code. Kindly have a relook. Also current PR is having sampling present to provide only first 100 unhealthy containers, but in future PRs, we can remove sampling considering not much impact on memory as we saw with above computation analysis. so that in future , below CLI can provide any specific unhealthy state containers with some limit option, something like below:

ozone admin container report --health-state UNDER_REPLICATED --limit 100

Because with every RM run, we are updating the state in ContainerInfo also in this PR, so not sure if need both.
Based on your input if we need to keep sampling , I need to do code changes in this PR : #9258 to keep consistency of container health data in sync with SCM report.

devmadhuu · 2026-01-12T04:48:53Z

@errose28 can you pls re-review ?

HDDS-14119. Ozone SCM - Capture all container replication status in c…

34bcf8e

…ontainer info.

devmadhuu requested a review from sumitagrawl December 10, 2025 08:16

adoroszlai changed the title ~~HDDS-14119. Ozone SCM - Capture all container replication status in container info.~~ HDDS-14119. Capture all container replication status in SCM container info Dec 10, 2025

Devesh Kumar Singh added 3 commits December 10, 2025 14:19

HDDS-14119. Ozone SCM - Capture all container replication status in c…

f6ad8aa

…ontainer info.

HDDS-14119. Ozone SCM - Capture all container replication status in c…

1d46e21

…ontainer info.

HDDS-14119. Checkstyle fixes.

c270d6a

jojochuang requested a review from sodonnel December 15, 2025 18:35

sumitagrawl reviewed Dec 17, 2025

View reviewed changes

Devesh Kumar Singh added 3 commits December 17, 2025 12:34

HDDS-14119. Review comments and findbug fixes.

bb1e564

HDDS-14119. Review comments and findbug fixes.

0ce7852

HDDS-14119. Review comments and findbug fixes.

83510bc

adoroszlai reviewed Dec 17, 2025

View reviewed changes

Devesh Kumar Singh added 2 commits December 17, 2025 14:21

HDDS-14119. Review comments and findbug fixes.

fde0c8a

HDDS-14119. Review comments and findbug fixes.

0841920

sumitagrawl reviewed Dec 18, 2025

View reviewed changes

HDDS-14119. Test failures fixing and review comments.

eed094e

devmadhuu requested review from adoroszlai and sumitagrawl December 18, 2025 11:11

sumitagrawl approved these changes Dec 18, 2025

View reviewed changes

adoroszlai removed their request for review December 18, 2025 13:43

HDDS-14119. Test failures fixing and review comments.

2393ca1

devmadhuu marked this pull request as ready for review December 19, 2025 04:38

HDDS-14119. Updated the healthstate to short value instead of full en…

1f60569

…um reference and saving memory.

errose28 requested changes Dec 19, 2025

View reviewed changes

HDDS-14119. Updated the healthstate to full enum reference.

8a6559e

HDDS-14119. Capture all container replication status in SCM container info #9472

Are you sure you want to change the base?

HDDS-14119. Capture all container replication status in SCM container info #9472

Uh oh!

Conversation

devmadhuu commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What This PR Implements

Core Feature: Health State in ContainerInfo

Memory Impact Analysis

Before This PR (Baseline)

After This PR

Memory Sizing by Cluster Size

Scenario: 2 Million Unhealthy Containers

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adoroszlai Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

adoroszlai Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

devmadhuu Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Dec 18, 2025

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Dec 19, 2025

Uh oh!

sodonnel commented Dec 20, 2025

Uh oh!

devmadhuu commented Dec 29, 2025

Uh oh!

devmadhuu commented Dec 29, 2025

Uh oh!

devmadhuu commented Dec 29, 2025

Uh oh!

errose28 commented Jan 5, 2026

Uh oh!

devmadhuu commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devmadhuu commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

devmadhuu commented Dec 10, 2025 •

edited

Loading

devmadhuu commented Jan 7, 2026 •

edited

Loading