HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary by devmadhuu · Pull Request #10074 · apache/ozone

devmadhuu · 2026-04-13T06:26:00Z

What changes were proposed in this pull request?

This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.

Largely the PR addresses two issues:

Recon may completely miss containers that SCM knows about.
Recon may know the container, but keep it in an older lifecycle state such as:
OPEN, CLOSING, or QUASI_CLOSED after SCM has already advanced it.

The implementation now has two SCM sync mechanisms:

Full snapshot sync
This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
checkpoint on the existing snapshot schedule.
Incremental targeted sync
This now runs on its own schedule and decides between:
- NO_ACTION
- TARGETED_SYNC
- FULL_SNAPSHOT

The targeted sync is implemented as a four-pass workflow:

Pass 1: CLOSED
Add missing CLOSED containers and correct stale OPEN/CLOSING/QUASI_CLOSED
containers to CLOSED.
Pass 2: OPEN
Add missing OPEN containers only. No downgrades and no state correction.
Pass 3: QUASI_CLOSED
Add missing QUASI_CLOSED containers and correct stale OPEN/CLOSING
containers up to QUASI_CLOSED.
Pass 4: retirement for DELETING/DELETED
Start from Recon's own CLOSED and QUASI_CLOSED containers and move them
forward only when SCM explicitly returns DELETING or DELETED.

Root Causes Addressed

1. DN-report path could not advance beyond `CLOSING`

2. Sync used to be add-only for `CLOSED`

3. Recon could miss `OPEN` and `QUASI_CLOSED` containers entirely

4. Recon never retired stale live states based on SCM deletion progress

5. SCM batch API could drop containers when pipeline lookup failed

6. Recon add path and SCM state manager were not null-pipeline safe

7. Open-container count per pipeline could drift

8. `decideSyncAction()` became a real tiered decision

Current logic:

compare total SCM and Recon container counts
if total drift > ozone.recon.scm.container.threshold:
FULL_SNAPSHOT
else if total drift > 0:
TARGETED_SYNC
else compare per-state drift for:
- OPEN
- QUASI_CLOSED
- derived CLOSED remainder
if any per-state drift exceeds
ozone.recon.scm.per.state.drift.threshold:
TARGETED_SYNC
otherwise:
NO_ACTION

9. Incremental sync got its own schedule

Additional update: Recon SCM sync observability

This patch also adds Hadoop Metrics2/JMX metrics for the Recon SCM container sync decision path. These metrics help operators understand when Recon escalates to a full SCM DB snapshot, how large the drift was when that happened, and whether targeted sync is keeping up.

New metrics include:

Count of full SCM DB snapshot download decisions triggered by non-OPEN container drift.
Last non-OPEN drift value that triggered a full SCM DB snapshot.
Time between the last two full SCM DB snapshot decisions.
Last per-state drift values for OPEN, QUASI_CLOSED, and CLOSED containers when targeted sync is triggered.
Targeted sync status:
- 0: idle
- 1: in progress
- 2: success
- 3: failure
Duration of the last targeted sync run.

These metrics are intended to help tune:

ozone.recon.scm.container.threshold
ozone.recon.scm.container.sync.task.interval.delay
ozone.recon.scm.per.state.drift.threshold

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14758

How was this patch tested?

Tests added or updated

[TestReconSCMContainerSyncIntegration.java]
[TestReconStorageContainerSyncHelper.java]
[TestTriggerDBSyncEndpoint.java]
[TestUnhealthyContainersDerbyPerformance.java]
[TestReconContainerHealthSummaryEndToEnd.java]

…d Container Summary Totals - Initial Commit.

sumitagrawl

@devmadhuu Thanks for working over this, few points to be considered,

We support moving back container from deleted to closed / quasi-closed based on state from DN
open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
Closing may not be required as its temporary state for few minutes, DN sync can help over this.
For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

devmadhuu · 2026-04-15T16:20:52Z

@devmadhuu Thanks for working over this, few points to be considered,

1. We support moving back container from deleted to closed / quasi-closed based on state from DN

2. open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.

3. Closing may not be required as its temporary state for few minutes, DN sync can help over this.

4. For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested.

priyeshkaratha

Thanks @devmadhuu for working on this. Overall changes LGTM

…mplementation for getContainerCount API.

sumitagrawl · 2026-04-21T11:37:07Z

        containers.addContainer(container);
-        if (container.getState() == LifeCycleState.OPEN) {
+        if (container.getState() == LifeCycleState.OPEN
+            && container.getPipelineID() != null) {


any specific reason for adding this check? Do possible that container open is reported but pipeline does not exist (closed due to some error), can have some impact with the check

Earlier this path would throw NPE from PipelineStateMap because addContainerToPipelineSCMStart eventually requires a non-null PipelineID. The check is only to avoid failing SCM/Recon startup/reinitialize
when an OPEN container record exists without pipelineID. The container is still added to the in-memory container map; only pipeline-to-container registration is skipped because there is no pipeline to register against.
For normal OPEN containers with a pipelineID, behavior is unchanged. For a non-null but missing pipelineID, we still call addContainerToPipelineSCMStart and keep the existing PipelineNotFoundException handling, so Replication Manager's OpenContainerHandler can move it forward later.

sumitagrawl · 2026-04-21T11:38:41Z

              equals(LifeCycleState.OPEN)) {
-            // Pipeline should exist, but not
-            throw new PipelineNotFoundException();
+            if (pipelineID != null) {


Null pipelineId also represent PipelineNotFound Exception

The throw is for the non-null pipelineID case where containsPipeline(pipelineID) is false. That means the container references a pipeline, but the pipeline is not present in the pipeline manager, so preserving PipelineNotFoundException is intentional and matches the old behavior.

The null pipelineID case is handled separately because there is no pipeline key to look up or register against. For Recon sync we still want to record the container so it does not remain absent from Recon, but skip pipeline tracking and log a warning.

sumitagrawl · 2026-04-21T11:40:30Z

+        try {
+          ContainerInfo info = scm.getContainerManager()
+              .getContainer(ContainerID.valueOf(containerID));
+          cpList.add(new ContainerWithPipeline(info, null));


are we returning pipeline as null here ?

Yeah, this can be problem, because the protobuf requires pipeline, so the whole RPC response can fail with NPE. Actually now we need to think , if we want Recon to receive container metadata without pipeline, we might need to change the API contract.

add a new response type with ContainerInfoProto and optional Pipeline

or add a Recon-specific API that returns existing container metadata independently from pipeline

Excluding that container means Recon will not add that specific container in this batch pass if SCM cannot resolve a pipeline for it. That is a limitation.

IMO, we should add a new response type with ContainerInfoProto and optional Pipeline. What do you think ?

rakeshadr · 2026-04-23T09:23:21Z

Thanks a lot @devmadhuu for the detailed explanation and continuous efforts in handling this.

General comments:

IIUC, Full snapshot is a costly operation and the cost vary depends on the SCM database size, I'd suggest not to trigger automatic FULL_SNAP instead analyze the overhead of snapshot datatransfer, snapshot loading, GC Pressure & potential system resource utilization based on existing known SCM database size(say if the database size is 275GB and 300M containers). Then will think about the solution and proposal to the users.
Could you please come up with the set of meaningful metrics to understand the TARGETED_SYNC. For example, per state based metrics which can be used by the users/customers to determine the nonOpenDrift acceptable discrepancies and tune the TARGETED_SYNC interval. Another one can be the time taken etc.
Can we think of a warning or alert if the decrecepencies grows beyond certain threshold or limit?

Something like, non_open_container_decrecepencies <= 10K and the TARGETED_SYNC interval is 1hr

Apr 22nd 00:01 AM : non_open_container_decrecepencies 10K
Apr 22nd 00:02 AM : non_open_container_decrecepencies 5K
Apr 22nd 00:03 AM : non_open_container_decrecepencies 15K WARN
Apr 22nd 00:04 AM : non_open_container_decrecepencies 12K WARN
Apr 22nd 00:05 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:06 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:07 AM : non_open_container_decrecepencies 2K

devmadhuu · 2026-04-27T07:42:53Z

Thanks a lot @devmadhuu for the detailed explanation and continuous efforts in handling this.

General comments:

1. IIUC, `Full snapshot` is a costly operation and the cost vary depends on the SCM database size, I'd suggest `not to trigger automatic FULL_SNAP `instead analyze the overhead of snapshot datatransfer, snapshot loading, GC Pressure & potential system resource utilization based on existing known SCM database size(say if the database size is 275GB and 300M containers). Then will think about the solution and proposal to the users.

2. Could you please come up with the set of meaningful metrics to understand the TARGETED_SYNC. For example, per state based metrics which can be used by the users/customers to determine the nonOpenDrift acceptable discrepancies and tune the TARGETED_SYNC interval. Another one can be the time taken etc.

3. Can we think of a warning or alert if the decrecepencies grows beyond certain threshold or limit?

Something like, non_open_container_decrecepencies <= 10K and the TARGETED_SYNC interval is 1hr

Apr 22nd 00:01 AM : non_open_container_decrecepencies 10K
Apr 22nd 00:02 AM : non_open_container_decrecepencies 5K
Apr 22nd 00:03 AM : non_open_container_decrecepencies 15K WARN
Apr 22nd 00:04 AM : non_open_container_decrecepencies 12K WARN
Apr 22nd 00:05 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:06 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:07 AM : non_open_container_decrecepencies 2K

Yes, full SCM DB snapshot download is a time consuming with high latency operation. As we have experienced, some large clusters can even reach with 500+ GB of SCM DB and continue to grow, so it is always good to avoid this full SCM DB download situation and keep doing TARGETED_SYNC at regular intervals with default interval of 1H which is configurable.

How we arrive at every 1 hour check of determining sync action -> TARGETED_SYNC or FULL SCM DB SNAPSHOT download. ?
— We arrived based on below math for 100 million containers and impact of syncing in Recon by making RPC calls from Recon to SCM every 1 hour and why it is justified.
— In high concurrent clusters with high workload, if we increase the frequency from 1 hour to 2 hours, we may observe the increase in discrepancy of containers between SCM and Recon in worst case, so this may force full snapshot download which we want to avoid.

Why we need this TARGETED_SYNC every 1H when Recon is already having logic of processing of DN container reports which is an existing behvaior in Recon and why those existing ICR and FCRs are not sufficient for SCM-only containers.?

ICR and periodic FCR do not guarantee Recon will learn about every SCM container.
A container can exist in SCM but never appear in DN reports if the client fails before any replica is created or before DNs persist/report it.
In that case, Recon can miss an OPEN empty container entirely.
Later, SCM may keep that container OPEN for some time and eventually move it toward deletion, while Recon still has no signal from DN-side reporting.
That is exactly the class of drift this SCM-driven sync is meant to repair.

# Recon SCM Sync Latency Projections

Scope

This note derives projected latency, RPC count, transfer volume, and SCM heap impact for Recon syncing container IDs from SCM using a batch size of 500,000 container IDs per RPC.

The math is based on the following observed log sample from Recon:

/var/log/hadoop-ozone/ozone-recon.log:2026-03-18 10:41:40,896 WARN [pool-53-thread-1]-org.apache.hadoop.ozone.recon.scm.ReconStorageContainerSyncHelper: BASELINE: Starting sync - totalContainers=123409, batchSize=123409
/var/log/hadoop-ozone/ozone-recon.log:2026-03-18 10:41:41,269 WARN [pool-53-thread-1]-org.apache.hadoop.ozone.recon.scm.ReconStorageContainerSyncHelper: BASELINE: Sync complete - totalTime=516ms, rpcCalls=1, avgRpcTime=283ms, totalRpcTime=283ms, newContainers=0

Code References

The sizing assumption for one serialized ContainerID comes from:

ReconStorageContainerSyncHelper.java

The batch-size logic comes from:

ReconStorageContainerSyncHelper.java

The SCM RPC serving the state-filtered container ID list comes from:

SCMClientProtocolServer.java

The SCM side state-map list creation comes from:

ContainerStateMap.java

The protobuf response building on SCM comes from:

StorageContainerLocationProtocolServerSideTranslatorPB.java

Assumptions

Each serialized ContainerID on the wire is approximately 12 bytes.

Formula:

size_per_container_id = 12 bytes

The proposed application-level batch size is:

batch_size = 500,000 container IDs per RPC

The latency sample is treated as a baseline for linear projection:

baseline_container_count = 123,409
baseline_rpc_calls = 1
baseline_avg_rpc_time = 283 ms
baseline_total_time = 516 ms

Two time models are projected:

RPC time only
Uses avgRpcTime=283ms from the log.
End-to-end sync time
Uses totalTime=516ms from the log.

These projections are for the state-list RPC that transfers container IDs,
not full ContainerWithPipeline metadata payloads.

Maximum Theoretical IDs Per 128 MB RPC

If we only consider the serialized wire size:

max_rpc_message_size = 128 MB
                       = 128 * 1024 * 1024
                       = 134,217,728 bytes

max_container_ids_per_128mb_rpc =
    134,217,728 / 12
  = 11,184,810.67

Therefore:

max_container_ids_per_128mb_rpc ~= 11.18 million

This is only a wire-size upper bound. It is not a recommended operational
batch size because SCM serialization cost, protobuf object creation, and
transient heap pressure grow with the list size.

Proposed Operational Batch Size

Proposed batch size:

batch_size = 500,000

This yields per-RPC wire payload:

payload_per_rpc_bytes = 500,000 * 12
                      = 6,000,000 bytes

Conversions:

payload_per_rpc_decimal_mb = 6,000,000 / 1,000,000 = 6.0 MB
payload_per_rpc_binary_mib = 6,000,000 / 1,048,576 = 5.72 MiB

So:

payload_per_rpc ~= 6.0 MB ~= 5.72 MiB

Baseline-Derived Time Per Container

RPC Time Only

Using:

baseline_avg_rpc_time = 283 ms
baseline_container_count = 123,409

Per-container RPC time:

rpc_time_per_container =
    283 ms / 123,409
  = 0.0022931878 ms
  = 2.2931878 microseconds

End-to-End Sync Time

Using:

baseline_total_time = 516 ms
baseline_container_count = 123,409

Per-container end-to-end time:

total_time_per_container =
    516 ms / 123,409
  = 0.0041804077 ms
  = 4.1804077 microseconds

Time Per 500K RPC

RPC Time Only

rpc_time_per_500k =
    500,000 * 0.0022931878 ms
  = 1,146.5939 ms
  = 1.1466 s

End-to-End Sync Time

total_time_per_500k =
    500,000 * 0.0041804077 ms
  = 2,090.2039 ms
  = 2.0902 s

General Projection Formulas

For any container count N:

Number of RPC Calls

rpc_calls(N) = ceil(N / 500,000)

Total Transfer Volume

transfer_bytes(N) = N * 12

Projected RPC Time Only

projected_rpc_time_ms(N) =
    N * (283 / 123,409)

Projected End-to-End Sync Time

projected_total_time_ms(N) =
    N * (516 / 123,409)

4 Million Containers

RPC Calls

rpc_calls =
    ceil(4,000,000 / 500,000)
  = 8

Transfer Volume

transfer_bytes =
    4,000,000 * 12
  = 48,000,000 bytes

transfer_decimal_mb = 48,000,000 / 1,000,000 = 48 MB
transfer_binary_mib = 48,000,000 / 1,048,576 = 45.78 MiB

RPC Time Only

projected_rpc_time_ms =
    4,000,000 * (283 / 123,409)
  = 9,172.7511 ms
  = 9.17 s

End-to-End Sync Time

projected_total_time_ms =
    4,000,000 * (516 / 123,409)
  = 16,721.6312 ms
  = 16.72 s

10 Million Containers

RPC Calls

rpc_calls =
    ceil(10,000,000 / 500,000)
  = 20

Transfer Volume

transfer_bytes =
    10,000,000 * 12
  = 120,000,000 bytes

transfer_decimal_mb = 120 MB
transfer_binary_mib = 114.44 MiB

RPC Time Only

projected_rpc_time_ms =
    10,000,000 * (283 / 123,409)
  = 22,931.8778 ms
  = 22.93 s

End-to-End Sync Time

projected_total_time_ms =
    10,000,000 * (516 / 123,409)
  = 41,804.0779 ms
  = 41.80 s

50 Million Containers

RPC Calls

rpc_calls =
    ceil(50,000,000 / 500,000)
  = 100

Transfer Volume

transfer_bytes =
    50,000,000 * 12
  = 600,000,000 bytes

transfer_decimal_mb = 600 MB
transfer_binary_mib = 572.20 MiB

RPC Time Only

projected_rpc_time_ms =
    50,000,000 * (283 / 123,409)
  = 114,659.3891 ms
  = 114.66 s
  = 1.91 min

End-to-End Sync Time

projected_total_time_ms =
    50,000,000 * (516 / 123,409)
  = 209,020.3893 ms
  = 209.02 s
  = 3.48 min

100 Million Containers

RPC Calls

rpc_calls =
    ceil(100,000,000 / 500,000)
  = 200

Transfer Volume

transfer_bytes =
    100,000,000 * 12
  = 1,200,000,000 bytes

transfer_decimal_gb = 1.2 GB
transfer_binary_gib = 1,200,000,000 / 1,073,741,824 = 1.118 GiB

RPC Time Only

projected_rpc_time_ms =
    100,000,000 * (283 / 123,409)
  = 229,318.7783 ms
  = 229.32 s
  = 3.82 min

End-to-End Sync Time

projected_total_time_ms =
    100,000,000 * (516 / 123,409)
  = 418,040.7788 ms
  = 418.04 s
  = 6.97 min

Summary Table

Containers	RPC Calls	Payload / RPC	Total Payload	Est. RPC Time Only	Est. End-to-End Sync Time
`4M`	`8`	`~6.0 MB`	`~48 MB`	`~9.2 s`	`~16.7 s`
`10M`	`20`	`~6.0 MB`	`~120 MB`	`~22.9 s`	`~41.8 s`
`50M`	`100`	`~6.0 MB`	`~600 MB`	`~114.7 s`	`~209.0 s`
`100M`	`200`	`~6.0 MB`	`~1.2 GB`	`~229.3 s`	`~418.0 s`

SCM Heap Estimate

The 12 bytes figure is only the serialized protobuf wire size for one
ContainerID. It is not the Java heap size of all transient objects involved
in serving the RPC.

Important implementation details:

SCM obtains the IDs from the state map and collects them into a List.
SCM then converts each ContainerID into protobuf and adds it to the
response builder.

This means SCM heap cost per RPC is mostly:

the list of references returned for the batch
protobuf response objects / builder state
transient serialization buffers

It is not equivalent to allocating 500,000 * 12 bytes on heap.

Practical Per-RPC Heap Estimate for 500K IDs

Wire payload:

500,000 * 12 = 6,000,000 bytes ~= 6 MB

Reasonable transient heap estimate at SCM:

list/reference overhead: ~2 MB to 4 MB
protobuf payload/builder/buffers: ~6 MB+
additional temporary object overhead: several MB

Practical planning range:

estimated_transient_heap_per_500k_rpc ~= 10 MB to 25 MB

A good midpoint estimate is:

estimated_transient_heap_per_500k_rpc ~= 15 MB

Key Takeaways

A 128 MB RPC envelope could theoretically carry about 11.18 million
container IDs if only wire size is considered.
A 500K operational batch is much safer and produces only ~6 MB of
payload per RPC.
For 100 million containers:

rpc_calls = 200
total_transfer_volume = 1.2 GB
estimated_rpc_time_only = 229.32 s ~= 3.82 min
estimated_end_to_end_sync_time = 418.04 s ~= 6.97 min

SCM transient heap should remain bounded per call with 500K batching and
is expected to be in the 10 MB to 25 MB range per RPC rather than scaling
toward the full 100 million container population.

Caveats

These numbers are first-order projections, not guarantees.

They assume near-linear scaling from the observed baseline. Real behavior may
shift due to:

SCM CPU saturation
protobuf serialization overhead
JVM GC behavior
Recon consumer-side processing
network bandwidth and latency variability
concurrent SCM load from other clients

On your 3rd and last point related to alerts/notifications :

Currently, we don't have any alert framework inherently supported in Recon. I discussed and raised this point earlier, but since users can have various external ways also based on hadoop metrics, so we didn't carry on with this idea.

So based on existing hadoop metrics, we can have following metrics:

Count of occurrences in cumulative way, how many events of full snapshot got hit rather than real execution of full snapshot.
in last such event of full snapshot, size of last drift of that last event. If value comes out 50K or any other value.
Time between last 2 such events of full SCM DB snapshot download.
We can add separate metrics per state drift. CLOSED, QUASI_CLOSED etc, and size of drift.
Status of Targeted sync if it is in progress , success, failure.
Timetaken by last targeted sync.

…doop.hdds.scm.server.SCMClientProtocolServer#getExistContainerWithPipelinesInBatch.

…targeted sync execution.

rakeshadr · 2026-05-04T09:01:15Z

+        ReconStorageContainerSyncHelper.SyncAction action =
+            containerSyncHelper.decideSyncAction();
+        switch (action) {
+        case FULL_SNAPSHOT:


How about splitting the PR into two ? This would help to reach consensus asap and merge TARGETED_SYNC.

PR-1 -> TARGETED_SYNC &
PR-2 -> FULL_SNAPSHOT

I think, just for full snapshot, it might not be feasible to open a new PR, because that both are part of same scheduler thread and I have already removed the separate scheduler which was blindly doing full scm db download. (full snapshot) every 24 hours. Now every 12 hours, a check will happen if full snapshot or targeted sync, full snapshot threshold also increased now from 10k to 1M

… comments.

rakeshadr · 2026-05-04T09:11:30Z

+        switch (action) {
+        case FULL_SNAPSHOT:
+          LOG.info("Tiered sync decision: FULL_SNAPSHOT. "
+              + "Replacing Recon SCM DB with fresh SCM checkpoint.");


With TARGETED_SYNC running every hour by default(configurable and can increase/decrease interval) keeping drift minimal, FULL_SNAPSHOT should rarely be needed in a steady state. The only real case where it can leads to FULL_SNAPSHOT, Recon was completely down for long hours to days, came back up and the drift will be more.

Since FULL_SNAPSHOT is considerably high resource centric, how about introducing command line option to the users?

ozone admin recon trigger-scm-snapshot # prints: # WARNING: This downloads the full SCM checkpoint. On large clusters this # can be several GB and take minutes. SCM will be under I/O load. # Are you sure? (yes/no): _

ozone admin recon scm-snapshot-status # Output: # Status: IN_PROGRESS # Started: 14:28:03 # Phase: Downloading checkpoint from SCM (file transfer) # Duration so far: 4m 32s # Cancel: run 'ozone admin recon cancel-scm-snapshot' (safe until DB swap begins)

ozone admin recon cancel-scm-snapshot # SCM streams the RocksDB checkpoint files over RPC/HTTP to Recon. This is interruptible: # Sample Code showing the cancellation task. // scmServiceProvider.getSCMDBSnapshot() is a blocking call DBCheckpoint dbSnapshot = scmServiceProvider.getSCMDBSnapshot(); If you run this on a dedicated Future / Thread, you can call future.cancel(true) which sends a thread interrupt. The underlying socket read will throw InterruptedIOException and the download stops immediately. This is efficient and clean — no partial writes to the final DB location, the temp checkpoint dir can be deleted. Future<?> snapshotFuture = executor.submit(() -> { DBCheckpoint snap = scmServiceProvider.getSCMDBSnapshot(); initializeNewRdbStore(snap.getCheckpointLocation().toFile()); }); // cancel command arrives: snapshotFuture.cancel(true); // interrupts the download thread

Based on current logic, we have increased the threshold from 10K to 1M, and we have also added metrics when Recon will fall behind by 1M containers which is a very rare case, so having it manually using CLI also brings the same impact to user's env where downloading the full scm db may take time.

My thinking is not to take any auto-trigger-path, which is considerably increase SCM's I/O load . What if there was an issue due to the auto trigger and how can they recover it and understand the status/progression?

@rakeshadr One refinement I can think of is to make the automatic FULL_SNAPSHOT action configurable/opt-in. When the non-OPEN drift exceeds the configured threshold, the default behavior can be to log a warning and expose it through metrics instead of downloading the SCM checkpoint automatically. This PR already adds Recon SCM container sync metrics such as full snapshot trigger count, last non-OPEN drift, interval since last full snapshot event, per-state drift, targeted sync status, and targeted sync duration, so operators can alert on the condition.

For the explicit trigger/status/cancel flow, instead of adding CLI support directly, I think it fits better as Recon admin REST APIs because Recon already has
TriggerDBSyncEndpoint for OM sync and targeted SCM sync:

- `POST /api/v1/triggerdbsync/scm/snapshot` to trigger full SCM DB snapshot download - `GET /api/v1/triggerdbsync/scm/snapshot/status` to return status/phase/start time/duration - `POST /api/v1/triggerdbsync/scm/snapshot/cancel` to cancel while the checkpoint download is still in progress

A CLI command can later be a thin wrapper over these REST APIs if needed.

If you think, it makes sense, then I can update this PR to avoid automatic full snapshot by default and keep the large-drift condition visible through logs + metrics. The explicit trigger/status/cancel REST API may be better handled as a follow-up JIRA unless you feel it should be part of this change.

I have created a separate JIRA for full snapshot sync: HDDS-15165

Devesh Kumar Singh added 2 commits April 13, 2026 11:38

HDDS-14758. Recon - Mismatch Between Cluster State Container Count an…

bbb982c

…d Container Summary Totals - Initial Commit.

Merge remote-tracking branch 'origin/master' into HDDS-14758

31bcdd7

devmadhuu requested a review from ArafatKhan2198 April 13, 2026 06:26

adoroszlai changed the title ~~HDDS-14758. Recon - Mismatch Between Cluster State Container Count and Container Summary Totals.~~ HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary Apr 13, 2026

adoroszlai added the recon label Apr 13, 2026

Devesh Kumar Singh added 2 commits April 13, 2026 12:14

HDDS-14758. PMD Issues Fixed.

0cd6001

HDDS-14758. Findbugs Issues Fixed.

9c748d8

devmadhuu requested a review from sumitagrawl April 13, 2026 07:48

Devesh Kumar Singh added 2 commits April 13, 2026 17:35

HDDS-14758. Test cases failures Fixed.

1a45621

HDDS-14758. checkstyle issues Fixed.

7b23ef2

sreejasahithi reviewed Apr 13, 2026

View reviewed changes

Comment thread hadoop-ozone/recon/src/main/java/org/apache/hadoop/ozone/recon/ReconServerConfigKeys.java Outdated

Comment thread ...e/recon/src/main/java/org/apache/hadoop/ozone/recon/scm/ReconStorageContainerSyncHelper.java Outdated

HDDS-14758. robot tests failures fixed.

d89f224

devmadhuu marked this pull request as ready for review April 14, 2026 04:08

HDDS-14758. Review comments fixed.

c900add

sumitagrawl reviewed Apr 15, 2026

View reviewed changes

HDDS-14758. Review comments fixed.

8bce456

devmadhuu requested a review from sumitagrawl April 16, 2026 07:14

devmadhuu marked this pull request as draft April 16, 2026 07:14

HDDS-14758. pmd issues fixed.

a7817a0

devmadhuu marked this pull request as ready for review April 16, 2026 09:29

priyeshkaratha approved these changes Apr 17, 2026

View reviewed changes

devmadhuu requested a review from rakeshadr April 17, 2026 08:41

HDDS-14758. CLOSED container sync issues fixed due to incorrect API i…

4b411b8

…mplementation for getContainerCount API.

sumitagrawl reviewed Apr 21, 2026

View reviewed changes

Devesh Kumar Singh added 3 commits April 28, 2026 15:46

HDDS-14758. Handle review comments for null pipeline in org.apache.ha…

6955120

…doop.hdds.scm.server.SCMClientProtocolServer#getExistContainerWithPipelinesInBatch.

HDDS-14758. Added Metrics for Recon SCM container sync decisions and …

65e0d7f

…targeted sync execution.

HDDS-14758. Changed log level.

3b7445e

devmadhuu requested a review from sumitagrawl April 28, 2026 14:13

HDDS-14758. Fixed Recon Integration Test.

8159c65

devmadhuu marked this pull request as draft April 29, 2026 05:34

devmadhuu marked this pull request as ready for review April 29, 2026 07:00

rakeshadr reviewed May 4, 2026

View reviewed changes

HDDS-14758. Increased full scm db download threshold level and review…

9236059

… comments.

devmadhuu mentioned this pull request May 4, 2026

HDDS-15051. Incorrect DN replica reporting for unhealthy and QUASI CLOSED stuck containers in Recon. #10101

Closed

rakeshadr reviewed May 4, 2026

View reviewed changes

HDDS-14758. Removed full scm db download actual execution.

b5e0b41

devmadhuu mentioned this pull request May 5, 2026

HDDS-15165. Recon: Add admin REST APIs to trigger, monitor, and cancel SCM DB snapshot sync. #10186

Open

Conversation

devmadhuu commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Root Causes Addressed

1. DN-report path could not advance beyond CLOSING

2. Sync used to be add-only for CLOSED

3. Recon could miss OPEN and QUASI_CLOSED containers entirely

4. Recon never retired stale live states based on SCM deletion progress

5. SCM batch API could drop containers when pipeline lookup failed

6. Recon add path and SCM state manager were not null-pipeline safe

7. Open-container count per pipeline could drift

8. decideSyncAction() became a real tiered decision

9. Incremental sync got its own schedule

Additional update: Recon SCM sync observability

What is the link to the Apache JIRA

How was this patch tested?

Tests added or updated

Uh oh!

Uh oh!

Uh oh!

sumitagrawl left a comment

Choose a reason for hiding this comment

Uh oh!

devmadhuu commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

priyeshkaratha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rakeshadr commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devmadhuu commented Apr 27, 2026

Scope

Code References

Assumptions

Maximum Theoretical IDs Per 128 MB RPC

Proposed Operational Batch Size

Baseline-Derived Time Per Container

RPC Time Only

End-to-End Sync Time

Time Per 500K RPC

RPC Time Only

End-to-End Sync Time

General Projection Formulas

Number of RPC Calls

Total Transfer Volume

Projected RPC Time Only

Projected End-to-End Sync Time

4 Million Containers

RPC Calls

Transfer Volume

RPC Time Only

End-to-End Sync Time

10 Million Containers

RPC Calls

Transfer Volume

RPC Time Only

End-to-End Sync Time

50 Million Containers

RPC Calls

Transfer Volume

RPC Time Only

End-to-End Sync Time

100 Million Containers

RPC Calls

Transfer Volume

devmadhuu commented Apr 13, 2026 •

edited

Loading

1. DN-report path could not advance beyond `CLOSING`

2. Sync used to be add-only for `CLOSED`

3. Recon could miss `OPEN` and `QUASI_CLOSED` containers entirely

8. `decideSyncAction()` became a real tiered decision

devmadhuu commented Apr 15, 2026 •

edited

Loading

rakeshadr commented Apr 23, 2026 •

edited

Loading

rakeshadr May 4, 2026 •

edited

Loading

devmadhuu May 4, 2026 •

edited

Loading