Skip to content

Conversation

@devmadhuu
Copy link
Contributor

@devmadhuu devmadhuu commented Dec 10, 2025

What changes were proposed in this pull request?

This PR adds container health state tracking to Apache Ozone SCM, storing a 2-byte health state value in each ContainerInfo object. This enables operators to query all unhealthy containers programmatically, provides foundation for REST APIs, and supports automation.

Key Metrics:

  • Memory overhead: 20 MB for 10M containers (0.25% of 8GB heap)
  • CPU overhead: 0.15% of ReplicationManager cycle
  • Enables: Complete queries, REST APIs, automation, real-time monitoring

What This PR Implements

Core Feature: Health State in ContainerInfo

Every container now stores its health state:

public class ContainerInfo {
  private short healthStateValue;  // 2 bytes - memory efficient!
  
  // External API - returns enum for clean usage
  public ContainerHealthState getHealthState() {
    return ContainerHealthState.fromValue(healthStateValue);
  }
  
  // Internal API - stores short value directly
  public void setHealthState(short stateValue) {
    this.healthStateValue = stateValue;
  }
}

10 Individual States (values 0-9):

  • HEALTHY (0), UNDER_REPLICATED (1), MIS_REPLICATED (2), OVER_REPLICATED (3)
  • MISSING (4), UNHEALTHY (5), EMPTY (6), OPEN_UNHEALTHY (7)
  • QUASI_CLOSED_STUCK (8), OPEN_WITHOUT_PIPELINE (9)

6 Combination States (values 100-105):

  • UNHEALTHY_UNDER_REPLICATED (100)
  • UNHEALTHY_OVER_REPLICATED (101)
  • MISSING_UNDER_REPLICATED (102)
  • QUASI_CLOSED_STUCK_UNDER_REPLICATED (103)
  • QUASI_CLOSED_STUCK_OVER_REPLICATED (104)
  • QUASI_CLOSED_STUCK_MISSING (105)

Memory Impact Analysis

Before This PR (Baseline)

What Was Tracked:

ReplicationManagerReport samples:
  - First 100 container IDs per health state
  - Maximum: 100 × 16 states = 1,600 container IDs
  
Memory: 1,600 × 8 bytes = 12.8 KB

Limitation: Only 100 containers visible per state


After This PR

What Is Tracked:

ContainerInfo.healthStateValue:
  - Health state for EVERY container
  - ALL 10 million containers tracked
  
Memory: 10M × 2 bytes = 20 MB

Capability: ALL containers queryable

Also Keep:

ReplicationManagerReport samples:
  - First 100 per state (for CLI backward compatibility)
  
Memory: 12.8 KB

Total Memory:

Health states: 20 MB
Samples: 12.8 KB
══════════════════════
Total: 20 MB

Increase: +20 MB vs baseline

Memory Sizing by Cluster Size

Cluster Size Health State Memory % of Heap Total Heap
100K containers 200 KB 0.005% 4 GB
1M containers 2 MB 0.05% 4 GB
10M containers 20 MB 0.25% 8 GB
100M containers 200 MB 1.25% 16 GB

Conclusion: Scales linearly, stays under 2% even for massive clusters


Scenario: 2 Million Unhealthy Containers

Cluster:

  • Total containers: 10 million
  • Unhealthy: 2 million (20% - worst case)
  • SCM heap: 8 GB

Memory Breakdown:

Component Memory Notes
Health states (ALL 10M) 20 MB Every container tracked
Report samples 12.8 KB First 100 per state
Total 20 MB 0.25% of 8GB heap

Important: Memory is fixed at 20 MB regardless of unhealthy count

  • 100K unhealthy: Still 20 MB
  • 2M unhealthy: Still 20 MB
  • Depends on total containers, not unhealthy count

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14119

How was this patch tested?

This patch is tested using new test cases:
TestContainerHealthState covering:
- All individual states and combinations
- Conversion from ReplicationManager states
- Protobuf serialization/deserialization
- Health state filtering and querying
- Edge cases and invalid states

Manual testing in local docker:

image
bash-5.1$ ozone admin container info 2
Container id: 2
Pipeline id: 58417e48-d16e-41b2-9d4b-ac728c93ef18
Write PipelineId: 9ad07e69-3af9-44cd-b05c-fa119ea93d83
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []

bash-5.1$ ozone admin container info 1
Container id: 1
Pipeline id: fdca751b-9417-4be5-9e32-02d8f087ad44
Write PipelineId: 02f015ca-7ffc-4480-b60b-a886330f487d
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []

bash-5.1$ ozone admin container info 3
Container id: 3
Pipeline id: 4d061122-1319-481a-955f-b895f7e43bac
Write PipelineId: 0f9a52a6-3023-48c4-9f42-2cfcafd7a590
Write Pipeline State: CLOSED
Container State: CLOSED
Datanodes: [ea3521ce-ef01-4d4a-b308-19c4904dffaa/ozone-datanode-1.ozone_default]
Replicas: [State: CLOSED; ReplicaIndex: 0; Origin: ea3521ce-ef01-4d4a-b308-19c4904dffaa; Location: ea3521ce-ef01-4d4a-b308-19c4904dffaa/ozone-datanode-1.ozone_default]

bash-5.1$ ozone admin container info 4
Container id: 4
Pipeline id: e572b173-cbe0-4797-85c1-752f1cbb8f1b
Write PipelineId: 479e2f1a-bc09-47fd-b339-6b299b1fc61d
Write Pipeline State: CLOSED
Container State: CLOSED
Datanodes: [130e3b26-712c-487b-b1bb-677dbab05f18/ozone-datanode-4.ozone_default]
Replicas: [State: CLOSED; ReplicaIndex: 0; Origin: 130e3b26-712c-487b-b1bb-677dbab05f18; Location: 130e3b26-712c-487b-b1bb-677dbab05f18/ozone-datanode-4.ozone_default]

bash-5.1$ ozone admin container info 5
Container id: 5
Pipeline id: 6176af26-5c8f-4b59-980b-d2884375b8fd
Write PipelineId: 8242a1a1-d399-44ba-b80d-853a4e2571b3
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []
bash-5.1$ ozone admin container info 6
Container id: 6
Pipeline id: 731777c2-3886-4d1f-8f17-6f5346f8f6b3
Write PipelineId: 3bfce619-21f6-4b54-b627-60ac960d4efc
Write Pipeline State: CLOSED
Container State: CLOSING
Datanodes: []
Replicas: []
bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-12-18T11:04:43Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 4
QUASI_CLOSED: 0
CLOSED: 2
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 4
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

First 100 MISSING containers:
#1, #2, #5, #6

image
bash-5.1$ ozone admin container report
Container Summary Report generated at 2025-12-18T11:06:58Z
==========================================================

Container State Summary
=======================
OPEN: 0
CLOSING: 0
QUASI_CLOSED: 0
CLOSED: 6
DELETING: 0
DELETED: 0
RECOVERING: 0

Container Health Summary
========================
HEALTHY: 0
UNDER_REPLICATED: 0
MIS_REPLICATED: 0
OVER_REPLICATED: 0
MISSING: 0
UNHEALTHY: 0
EMPTY: 0
OPEN_UNHEALTHY: 0
QUASI_CLOSED_STUCK: 0
OPEN_WITHOUT_PIPELINE: 0
UNHEALTHY_UNDER_REPLICATED: 0
UNHEALTHY_OVER_REPLICATED: 0
MISSING_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_UNDER_REPLICATED: 0
QUASI_CLOSED_STUCK_OVER_REPLICATED: 0
QUASI_CLOSED_STUCK_MISSING: 0

@adoroszlai adoroszlai changed the title HDDS-14119. Ozone SCM - Capture all container replication status in container info. HDDS-14119. Capture all container replication status in SCM container info Dec 10, 2025
@jojochuang jojochuang requested a review from sodonnel December 15, 2025 18:35
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu given few comments

.setDeleteTransactionId(info.getDeleteTransactionId())
.setReplicationConfig(config)
.setSequenceId(info.getSequenceId())
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice find, but it should be fixed separately so that we can backport it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created HDDS-14196.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure, I have reverted to original code, Will push it separately. Thank you for your review.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adoroszlai
Copy link
Contributor

@sodonnel please take a look

@adoroszlai adoroszlai removed their request for review December 18, 2025 13:43
@devmadhuu devmadhuu marked this pull request as ready for review December 19, 2025 04:38
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private short healthStateValue; // 2 bytes - memory efficient!

Looks like an AI optimization, which means it has no context about real world clusters. I recently worked on a cluster with 32 million containers and 200PB+ of data running with 127GB of heap. Storing this as a short will consume 32Mil * 2B = 64MB of space. Storing it as an enum, which is much easier for devs to work with, will consume 32Mil * 8B = 256MB of space. (256mb - 64mb)/127gb = 0.001

So all this inconvenience is for a 0.1% reduction in heap usage at scale.

I do think we should track the container health state in the container info object in memory for easier queries, but based on real world numbers we can just use an enum for this.

@sodonnel
Copy link
Contributor

There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers.

I cannot really come up with a concrete reason as to why I think using container Info for this is wrong, aside from it only being updated by RM with its periodic runs. Perhaps I am over thinking it and its fine.

I think there are also some places that use RM in a read only mode to check container states (decommission maybe), so that may update the containerInfo states between RM runs. I am not sure if that is a problem or not. Probably not as it can only make the state more current.

Aside from the above scanning the PR quickly, the thing I am not sure about is multiplying up the states - like under_replicated, unhealthy_under_replicate, qc_under_replicated ... It leads to a lot more states that may just be more confusing than helpful.

In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated. It can be missing and under-replicated I think. Missing is kind of an extreme version of under-replicated. The only way to capture these "double states" with a single field is to multiple up the states I guess.

@sodonnel
Copy link
Contributor

In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated

I remembered why a container has 2 states. Its got a "lifecycle state" - open, closed, closing, qc, qc_stuck, unhealthy

And it has "replication states" - under, over and mis-replicated.

@devmadhuu
Copy link
Contributor Author

Thanks @errose28 and @sodonnel for the review.

@devmadhuu
Copy link
Contributor Author

There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers.

I cannot really come up with a concrete reason as to why I think using container Info for this is wrong, aside from it only being updated by RM with its periodic runs. Perhaps I am over thinking it and its fine.

I think there are also some places that use RM in a read only mode to check container states (decommission maybe), so that may update the containerInfo states between RM runs. I am not sure if that is a problem or not. Probably not as it can only make the state more current.

Aside from the above scanning the PR quickly, the thing I am not sure about is multiplying up the states - like under_replicated, unhealthy_under_replicate, qc_under_replicated ... It leads to a lot more states that may just be more confusing than helpful.

In the RM report, we tried to only have a container in a single state, but it can be unhealthy and under / over replicated. It can be missing and under-replicated I think. Missing is kind of an extreme version of under-replicated. The only way to capture these "double states" with a single field is to multiple up the states I guess.

Yes , with this new way, ContainerInfo object will hold its state (multiple or just single) and there is a possibility of changing it between two different RM runs. But in real time, that may be good also as it will reflect the current state even in read only mode. Could you please summarize your points to get better understanding how the current behavior may be an issue, As per my understanding, that should not be an issue, but still looking for deeper understanding from your contextual thinking.

@devmadhuu
Copy link
Contributor Author

private short healthStateValue; // 2 bytes - memory efficient!

Looks like an AI optimization, which means it has no context about real world clusters. I recently worked on a cluster with 32 million containers and 200PB+ of data running with 127GB of heap. Storing this as a short will consume 32Mil * 2B = 64MB of space. Storing it as an enum, which is much easier for devs to work with, will consume 32Mil * 8B = 256MB of space. (256mb - 64mb)/127gb = 0.001

So all this inconvenience is for a 0.1% reduction in heap usage at scale.

I do think we should track the container health state in the container info object in memory for easier queries, but based on real world numbers we can just use an enum for this.

Thanks @errose28 for providing your real time large sized cluster based computation on memory foot print. I agree with your explanation. We can work with enum itself rather short value.

@errose28
Copy link
Contributor

errose28 commented Jan 5, 2026

There is something about storing the health state in ContainerInfo that doesn't feel correct to me. The state is captured at a point in time and then its stale soon afterwards. It doesn't get updated until the next run of RM. The thing about the report object, was that it captured the stats of a complete run of RM, but in the new way, the container infos get changed as RM runs. I guess it should be pretty fast, but it could lead to kind of unstable numbers.

This is true of almost all in-memory state that SCM has. A lot of the metrics we are tracking to follow the cluster state are derived from container report information. Same with Recon which also has an API to list all unhealthy containers. I think the container report still has merit as a point in time snapshot of the replication manager and we should leave it as is, but I don't think that should exclude us from adding additional functionality like querying SCM's current in-memory view of the cluster.

If there's a different place to maintain this information which still supports querying containers by health state we can use that instead, but to me the in-memory ContainerInfo object looks like the best place to store it.

@devmadhuu
Copy link
Contributor Author

devmadhuu commented Jan 7, 2026

@errose28 @sodonnel , I have removed short byte value and now using enum, pushed the code. Kindly have a relook. Also current PR is having sampling present to provide only first 100 unhealthy containers, but in future PRs, we can remove sampling considering not much impact on memory as we saw with above computation analysis. so that in future , below CLI can provide any specific unhealthy state containers with some limit option, something like below:

ozone admin container report --health-state UNDER_REPLICATED --limit 100

Because with every RM run, we are updating the state in ContainerInfo also in this PR, so not sure if need both.
Based on your input if we need to keep sampling , I need to do code changes in this PR : #9258 to keep consistency of container health data in sync with SCM report.

@devmadhuu
Copy link
Contributor Author

@errose28 can you pls re-review ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants