Skip to content

HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074

Open
devmadhuu wants to merge 17 commits intoapache:masterfrom
devmadhuu:HDDS-14758
Open

HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074
devmadhuu wants to merge 17 commits intoapache:masterfrom
devmadhuu:HDDS-14758

Conversation

@devmadhuu
Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu commented Apr 13, 2026

What changes were proposed in this pull request?

This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.

Largely the PR addresses two issues:

  1. Recon may completely miss containers that SCM knows about.
  2. Recon may know the container, but keep it in an older lifecycle state such as:
    OPEN, CLOSING, or QUASI_CLOSED after SCM has already advanced it.

The implementation now has two SCM sync mechanisms:

  1. Full snapshot sync
    This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
    checkpoint on the existing snapshot schedule.

  2. Incremental targeted sync
    This now runs on its own schedule and decides between:

    • NO_ACTION
    • TARGETED_SYNC
    • FULL_SNAPSHOT

The targeted sync is implemented as a four-pass workflow:

  1. Pass 1: CLOSED
    Add missing CLOSED containers and correct stale OPEN/CLOSING/QUASI_CLOSED
    containers to CLOSED.

  2. Pass 2: OPEN
    Add missing OPEN containers only. No downgrades and no state correction.

  3. Pass 3: QUASI_CLOSED
    Add missing QUASI_CLOSED containers and correct stale OPEN/CLOSING
    containers up to QUASI_CLOSED.

  4. Pass 4: retirement for DELETING/DELETED
    Start from Recon's own CLOSED and QUASI_CLOSED containers and move them
    forward only when SCM explicitly returns DELETING or DELETED.

Root Causes Addressed

1. DN-report path could not advance beyond CLOSING

2. Sync used to be add-only for CLOSED

3. Recon could miss OPEN and QUASI_CLOSED containers entirely

4. Recon never retired stale live states based on SCM deletion progress

5. SCM batch API could drop containers when pipeline lookup failed

6. Recon add path and SCM state manager were not null-pipeline safe

7. Open-container count per pipeline could drift

8. decideSyncAction() became a real tiered decision

Current logic:

  1. compare total SCM and Recon container counts
  2. if total drift > ozone.recon.scm.container.threshold:
    FULL_SNAPSHOT
  3. else if total drift > 0:
    TARGETED_SYNC
  4. else compare per-state drift for:
    • OPEN
    • QUASI_CLOSED
    • derived CLOSED remainder
  5. if any per-state drift exceeds
    ozone.recon.scm.per.state.drift.threshold:
    TARGETED_SYNC
  6. otherwise:
    NO_ACTION

9. Incremental sync got its own schedule

Additional update: Recon SCM sync observability

This patch also adds Hadoop Metrics2/JMX metrics for the Recon SCM container sync decision path. These metrics help operators understand when Recon escalates to a full SCM DB snapshot, how large the drift was when that happened, and whether targeted sync is keeping up.

New metrics include:

  • Count of full SCM DB snapshot download decisions triggered by non-OPEN container drift.
  • Last non-OPEN drift value that triggered a full SCM DB snapshot.
  • Time between the last two full SCM DB snapshot decisions.
  • Last per-state drift values for OPEN, QUASI_CLOSED, and CLOSED containers when targeted sync is triggered.
  • Targeted sync status:
    • 0: idle
    • 1: in progress
    • 2: success
    • 3: failure
  • Duration of the last targeted sync run.

These metrics are intended to help tune:

  • ozone.recon.scm.container.threshold
  • ozone.recon.scm.container.sync.task.interval.delay
  • ozone.recon.scm.per.state.drift.threshold

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14758

How was this patch tested?

Tests added or updated

  • [TestReconSCMContainerSyncIntegration.java]
  • [TestReconStorageContainerSyncHelper.java]
  • [TestTriggerDBSyncEndpoint.java]
  • [TestUnhealthyContainersDerbyPerformance.java]
  • [TestReconContainerHealthSummaryEndToEnd.java]

@adoroszlai adoroszlai changed the title HDDS-14758. Recon - Mismatch Between Cluster State Container Count and Container Summary Totals. HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary Apr 13, 2026
@devmadhuu devmadhuu requested a review from sumitagrawl April 13, 2026 07:48
@devmadhuu devmadhuu marked this pull request as ready for review April 14, 2026 04:08
Copy link
Copy Markdown
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu Thanks for working over this, few points to be considered,

  1. We support moving back container from deleted to closed / quasi-closed based on state from DN
  2. open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
  3. Closing may not be required as its temporary state for few minutes, DN sync can help over this.
  4. For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

@devmadhuu
Copy link
Copy Markdown
Contributor Author

devmadhuu commented Apr 15, 2026

@devmadhuu Thanks for working over this, few points to be considered,

1. We support moving back container from deleted to closed / quasi-closed based on state from DN

2. open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.

3. Closing may not be required as its temporary state for few minutes, DN sync can help over this.

4. For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs

Do we really need full db sync for quasi-closed/closed only OR for Open container state ?

Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested.

@devmadhuu devmadhuu requested a review from sumitagrawl April 16, 2026 07:14
@devmadhuu devmadhuu marked this pull request as draft April 16, 2026 07:14
@devmadhuu devmadhuu marked this pull request as ready for review April 16, 2026 09:29
Copy link
Copy Markdown
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @devmadhuu for working on this. Overall changes LGTM

@devmadhuu devmadhuu requested a review from rakeshadr April 17, 2026 08:41
containers.addContainer(container);
if (container.getState() == LifeCycleState.OPEN) {
if (container.getState() == LifeCycleState.OPEN
&& container.getPipelineID() != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason for adding this check? Do possible that container open is reported but pipeline does not exist (closed due to some error), can have some impact with the check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier this path would throw NPE from PipelineStateMap because addContainerToPipelineSCMStart eventually requires a non-null PipelineID. The check is only to avoid failing SCM/Recon startup/reinitialize
when an OPEN container record exists without pipelineID. The container is still added to the in-memory container map; only pipeline-to-container registration is skipped because there is no pipeline to register against.
For normal OPEN containers with a pipelineID, behavior is unchanged. For a non-null but missing pipelineID, we still call addContainerToPipelineSCMStart and keep the existing PipelineNotFoundException handling, so Replication Manager's OpenContainerHandler can move it forward later.

equals(LifeCycleState.OPEN)) {
// Pipeline should exist, but not
throw new PipelineNotFoundException();
if (pipelineID != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null pipelineId also represent PipelineNotFound Exception

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The throw is for the non-null pipelineID case where containsPipeline(pipelineID) is false. That means the container references a pipeline, but the pipeline is not present in the pipeline manager, so preserving PipelineNotFoundException is intentional and matches the old behavior.

The null pipelineID case is handled separately because there is no pipeline key to look up or register against. For Recon sync we still want to record the container so it does not remain absent from Recon, but skip pipeline tracking and log a warning.

try {
ContainerInfo info = scm.getContainerManager()
.getContainer(ContainerID.valueOf(containerID));
cpList.add(new ContainerWithPipeline(info, null));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we returning pipeline as null here ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this can be problem, because the protobuf requires pipeline, so the whole RPC response can fail with NPE. Actually now we need to think , if we want Recon to receive container metadata without pipeline, we might need to change the API contract.

  • add a new response type with ContainerInfoProto and optional Pipeline
  • or add a Recon-specific API that returns existing container metadata independently from pipeline

Excluding that container means Recon will not add that specific container in this batch pass if SCM cannot resolve a pipeline for it. That is a limitation.

IMO, we should add a new response type with ContainerInfoProto and optional Pipeline. What do you think ?

@rakeshadr
Copy link
Copy Markdown
Contributor

rakeshadr commented Apr 23, 2026

Thanks a lot @devmadhuu for the detailed explanation and continuous efforts in handling this.

General comments:

  1. IIUC, Full snapshot is a costly operation and the cost vary depends on the SCM database size, I'd suggest not to trigger automatic FULL_SNAP instead analyze the overhead of snapshot datatransfer, snapshot loading, GC Pressure & potential system resource utilization based on existing known SCM database size(say if the database size is 275GB and 300M containers). Then will think about the solution and proposal to the users.

  2. Could you please come up with the set of meaningful metrics to understand the TARGETED_SYNC. For example, per state based metrics which can be used by the users/customers to determine the nonOpenDrift acceptable discrepancies and tune the TARGETED_SYNC interval. Another one can be the time taken etc.

  3. Can we think of a warning or alert if the decrecepencies grows beyond certain threshold or limit?

Something like, non_open_container_decrecepencies <= 10K and the TARGETED_SYNC interval is 1hr

Apr 22nd 00:01 AM : non_open_container_decrecepencies 10K
Apr 22nd 00:02 AM : non_open_container_decrecepencies 5K
Apr 22nd 00:03 AM : non_open_container_decrecepencies 15K WARN
Apr 22nd 00:04 AM : non_open_container_decrecepencies 12K WARN
Apr 22nd 00:05 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:06 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:07 AM : non_open_container_decrecepencies 2K

@devmadhuu
Copy link
Copy Markdown
Contributor Author

Thanks a lot @devmadhuu for the detailed explanation and continuous efforts in handling this.

General comments:

1. IIUC, `Full snapshot` is a costly operation and the cost vary depends on the SCM database size, I'd suggest `not to trigger automatic FULL_SNAP `instead analyze the overhead of snapshot datatransfer, snapshot loading, GC Pressure & potential system resource utilization based on existing known SCM database size(say if the database size is 275GB and 300M containers). Then will think about the solution and proposal to the users.

2. Could you please come up with the set of meaningful metrics to understand the TARGETED_SYNC. For example, per state based metrics which can be used by the users/customers to determine the nonOpenDrift acceptable discrepancies and tune the TARGETED_SYNC interval. Another one can be the time taken etc.

3. Can we think of a warning or alert if the decrecepencies grows beyond certain threshold or limit?

Something like, non_open_container_decrecepencies <= 10K and the TARGETED_SYNC interval is 1hr

Apr 22nd 00:01 AM : non_open_container_decrecepencies 10K
Apr 22nd 00:02 AM : non_open_container_decrecepencies 5K
Apr 22nd 00:03 AM : non_open_container_decrecepencies 15K WARN
Apr 22nd 00:04 AM : non_open_container_decrecepencies 12K WARN
Apr 22nd 00:05 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:06 AM : non_open_container_decrecepencies 2K
Apr 22nd 00:07 AM : non_open_container_decrecepencies 2K

Yes, full SCM DB snapshot download is a time consuming with high latency operation. As we have experienced, some large clusters can even reach with 500+ GB of SCM DB and continue to grow, so it is always good to avoid this full SCM DB download situation and keep doing TARGETED_SYNC at regular intervals with default interval of 1H which is configurable.

How we arrive at every 1 hour check of determining sync action -> TARGETED_SYNC or FULL SCM DB SNAPSHOT download. ?
— We arrived based on below math for 100 million containers and impact of syncing in Recon by making RPC calls from Recon to SCM every 1 hour and why it is justified.
— In high concurrent clusters with high workload, if we increase the frequency from 1 hour to 2 hours, we may observe the increase in discrepancy of containers between SCM and Recon in worst case, so this may force full snapshot download which we want to avoid.

Why we need this TARGETED_SYNC every 1H when Recon is already having logic of processing of DN container reports which is an existing behvaior in Recon and why those existing ICR and FCRs are not sufficient for SCM-only containers.?

  • ICR and periodic FCR do not guarantee Recon will learn about every SCM container.
  • A container can exist in SCM but never appear in DN reports if the client fails before any replica is created or before DNs persist/report it.
  • In that case, Recon can miss an OPEN empty container entirely.
  • Later, SCM may keep that container OPEN for some time and eventually move it toward deletion, while Recon still has no signal from DN-side reporting.
  • That is exactly the class of drift this SCM-driven sync is meant to repair.

# Recon SCM Sync Latency Projections

Scope

This note derives projected latency, RPC count, transfer volume, and SCM heap impact for Recon syncing container IDs from SCM using a batch size of 500,000 container IDs per RPC.

The math is based on the following observed log sample from Recon:

/var/log/hadoop-ozone/ozone-recon.log:2026-03-18 10:41:40,896 WARN [pool-53-thread-1]-org.apache.hadoop.ozone.recon.scm.ReconStorageContainerSyncHelper: BASELINE: Starting sync - totalContainers=123409, batchSize=123409
/var/log/hadoop-ozone/ozone-recon.log:2026-03-18 10:41:41,269 WARN [pool-53-thread-1]-org.apache.hadoop.ozone.recon.scm.ReconStorageContainerSyncHelper: BASELINE: Sync complete - totalTime=516ms, rpcCalls=1, avgRpcTime=283ms, totalRpcTime=283ms, newContainers=0

Code References

The sizing assumption for one serialized ContainerID comes from:

The batch-size logic comes from:

The SCM RPC serving the state-filtered container ID list comes from:

The SCM side state-map list creation comes from:

The protobuf response building on SCM comes from:

Assumptions

  1. Each serialized ContainerID on the wire is approximately 12 bytes.

Formula:

size_per_container_id = 12 bytes
  1. The proposed application-level batch size is:
batch_size = 500,000 container IDs per RPC
  1. The latency sample is treated as a baseline for linear projection:
baseline_container_count = 123,409
baseline_rpc_calls = 1
baseline_avg_rpc_time = 283 ms
baseline_total_time = 516 ms
  1. Two time models are projected:
  • RPC time only
    Uses avgRpcTime=283ms from the log.
  • End-to-end sync time
    Uses totalTime=516ms from the log.
  1. These projections are for the state-list RPC that transfers container IDs,
    not full ContainerWithPipeline metadata payloads.

Maximum Theoretical IDs Per 128 MB RPC

If we only consider the serialized wire size:

max_rpc_message_size = 128 MB
                       = 128 * 1024 * 1024
                       = 134,217,728 bytes

max_container_ids_per_128mb_rpc =
    134,217,728 / 12
  = 11,184,810.67

Therefore:

max_container_ids_per_128mb_rpc ~= 11.18 million

This is only a wire-size upper bound. It is not a recommended operational
batch size because SCM serialization cost, protobuf object creation, and
transient heap pressure grow with the list size.

Proposed Operational Batch Size

Proposed batch size:

batch_size = 500,000

This yields per-RPC wire payload:

payload_per_rpc_bytes = 500,000 * 12
                      = 6,000,000 bytes

Conversions:

payload_per_rpc_decimal_mb = 6,000,000 / 1,000,000 = 6.0 MB
payload_per_rpc_binary_mib = 6,000,000 / 1,048,576 = 5.72 MiB

So:

payload_per_rpc ~= 6.0 MB ~= 5.72 MiB

Baseline-Derived Time Per Container

RPC Time Only

Using:

baseline_avg_rpc_time = 283 ms
baseline_container_count = 123,409

Per-container RPC time:

rpc_time_per_container =
    283 ms / 123,409
  = 0.0022931878 ms
  = 2.2931878 microseconds

End-to-End Sync Time

Using:

baseline_total_time = 516 ms
baseline_container_count = 123,409

Per-container end-to-end time:

total_time_per_container =
    516 ms / 123,409
  = 0.0041804077 ms
  = 4.1804077 microseconds

Time Per 500K RPC

RPC Time Only

rpc_time_per_500k =
    500,000 * 0.0022931878 ms
  = 1,146.5939 ms
  = 1.1466 s

End-to-End Sync Time

total_time_per_500k =
    500,000 * 0.0041804077 ms
  = 2,090.2039 ms
  = 2.0902 s

General Projection Formulas

For any container count N:

Number of RPC Calls

rpc_calls(N) = ceil(N / 500,000)

Total Transfer Volume

transfer_bytes(N) = N * 12

Projected RPC Time Only

projected_rpc_time_ms(N) =
    N * (283 / 123,409)

Projected End-to-End Sync Time

projected_total_time_ms(N) =
    N * (516 / 123,409)

4 Million Containers

RPC Calls

rpc_calls =
    ceil(4,000,000 / 500,000)
  = 8

Transfer Volume

transfer_bytes =
    4,000,000 * 12
  = 48,000,000 bytes

transfer_decimal_mb = 48,000,000 / 1,000,000 = 48 MB
transfer_binary_mib = 48,000,000 / 1,048,576 = 45.78 MiB

RPC Time Only

projected_rpc_time_ms =
    4,000,000 * (283 / 123,409)
  = 9,172.7511 ms
  = 9.17 s

End-to-End Sync Time

projected_total_time_ms =
    4,000,000 * (516 / 123,409)
  = 16,721.6312 ms
  = 16.72 s

10 Million Containers

RPC Calls

rpc_calls =
    ceil(10,000,000 / 500,000)
  = 20

Transfer Volume

transfer_bytes =
    10,000,000 * 12
  = 120,000,000 bytes

transfer_decimal_mb = 120 MB
transfer_binary_mib = 114.44 MiB

RPC Time Only

projected_rpc_time_ms =
    10,000,000 * (283 / 123,409)
  = 22,931.8778 ms
  = 22.93 s

End-to-End Sync Time

projected_total_time_ms =
    10,000,000 * (516 / 123,409)
  = 41,804.0779 ms
  = 41.80 s

50 Million Containers

RPC Calls

rpc_calls =
    ceil(50,000,000 / 500,000)
  = 100

Transfer Volume

transfer_bytes =
    50,000,000 * 12
  = 600,000,000 bytes

transfer_decimal_mb = 600 MB
transfer_binary_mib = 572.20 MiB

RPC Time Only

projected_rpc_time_ms =
    50,000,000 * (283 / 123,409)
  = 114,659.3891 ms
  = 114.66 s
  = 1.91 min

End-to-End Sync Time

projected_total_time_ms =
    50,000,000 * (516 / 123,409)
  = 209,020.3893 ms
  = 209.02 s
  = 3.48 min

100 Million Containers

RPC Calls

rpc_calls =
    ceil(100,000,000 / 500,000)
  = 200

Transfer Volume

transfer_bytes =
    100,000,000 * 12
  = 1,200,000,000 bytes

transfer_decimal_gb = 1.2 GB
transfer_binary_gib = 1,200,000,000 / 1,073,741,824 = 1.118 GiB

RPC Time Only

projected_rpc_time_ms =
    100,000,000 * (283 / 123,409)
  = 229,318.7783 ms
  = 229.32 s
  = 3.82 min

End-to-End Sync Time

projected_total_time_ms =
    100,000,000 * (516 / 123,409)
  = 418,040.7788 ms
  = 418.04 s
  = 6.97 min

Summary Table

Containers RPC Calls Payload / RPC Total Payload Est. RPC Time Only Est. End-to-End Sync Time
4M 8 ~6.0 MB ~48 MB ~9.2 s ~16.7 s
10M 20 ~6.0 MB ~120 MB ~22.9 s ~41.8 s
50M 100 ~6.0 MB ~600 MB ~114.7 s ~209.0 s
100M 200 ~6.0 MB ~1.2 GB ~229.3 s ~418.0 s

SCM Heap Estimate

The 12 bytes figure is only the serialized protobuf wire size for one
ContainerID. It is not the Java heap size of all transient objects involved
in serving the RPC.

Important implementation details:

  1. SCM obtains the IDs from the state map and collects them into a List.
  2. SCM then converts each ContainerID into protobuf and adds it to the
    response builder.

This means SCM heap cost per RPC is mostly:

  • the list of references returned for the batch
  • protobuf response objects / builder state
  • transient serialization buffers

It is not equivalent to allocating 500,000 * 12 bytes on heap.

Practical Per-RPC Heap Estimate for 500K IDs

Wire payload:

500,000 * 12 = 6,000,000 bytes ~= 6 MB

Reasonable transient heap estimate at SCM:

  • list/reference overhead: ~2 MB to 4 MB
  • protobuf payload/builder/buffers: ~6 MB+
  • additional temporary object overhead: several MB

Practical planning range:

estimated_transient_heap_per_500k_rpc ~= 10 MB to 25 MB

A good midpoint estimate is:

estimated_transient_heap_per_500k_rpc ~= 15 MB

Key Takeaways

  1. A 128 MB RPC envelope could theoretically carry about 11.18 million
    container IDs if only wire size is considered.
  2. A 500K operational batch is much safer and produces only ~6 MB of
    payload per RPC.
  3. For 100 million containers:
rpc_calls = 200
total_transfer_volume = 1.2 GB
estimated_rpc_time_only = 229.32 s ~= 3.82 min
estimated_end_to_end_sync_time = 418.04 s ~= 6.97 min
  1. SCM transient heap should remain bounded per call with 500K batching and
    is expected to be in the 10 MB to 25 MB range per RPC rather than scaling
    toward the full 100 million container population.

Caveats

These numbers are first-order projections, not guarantees.

They assume near-linear scaling from the observed baseline. Real behavior may
shift due to:

  • SCM CPU saturation
  • protobuf serialization overhead
  • JVM GC behavior
  • Recon consumer-side processing
  • network bandwidth and latency variability
  • concurrent SCM load from other clients

On your 3rd and last point related to alerts/notifications :

Currently, we don't have any alert framework inherently supported in Recon. I discussed and raised this point earlier, but since users can have various external ways also based on hadoop metrics, so we didn't carry on with this idea.

So based on existing hadoop metrics, we can have following metrics:

  1. Count of occurrences in cumulative way, how many events of full snapshot got hit rather than real execution of full snapshot.
  2. in last such event of full snapshot, size of last drift of that last event. If value comes out 50K or any other value.
  3. Time between last 2 such events of full SCM DB snapshot download.
  4. We can add separate metrics per state drift. CLOSED, QUASI_CLOSED etc, and size of drift.
  5. Status of Targeted sync if it is in progress , success, failure.
  6. Timetaken by last targeted sync.

Devesh Kumar Singh added 3 commits April 28, 2026 15:46
…doop.hdds.scm.server.SCMClientProtocolServer#getExistContainerWithPipelinesInBatch.
@devmadhuu devmadhuu requested a review from sumitagrawl April 28, 2026 14:13
@devmadhuu devmadhuu marked this pull request as draft April 29, 2026 05:34
@devmadhuu devmadhuu marked this pull request as ready for review April 29, 2026 07:00
ReconStorageContainerSyncHelper.SyncAction action =
containerSyncHelper.decideSyncAction();
switch (action) {
case FULL_SNAPSHOT:
Copy link
Copy Markdown
Contributor

@rakeshadr rakeshadr May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about splitting the PR into two  ? This would help to reach consensus asap and merge TARGETED_SYNC.

 PR-1 -> TARGETED_SYNC &
 PR-2 -> FULL_SNAPSHOT

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, just for full snapshot, it might not be feasible to open a new PR, because that both are part of same scheduler thread and I have already removed the separate scheduler which was blindly doing full scm db download. (full snapshot) every 24 hours. Now every 12 hours, a check will happen if full snapshot or targeted sync, full snapshot threshold also increased now from 10k to 1M

switch (action) {
case FULL_SNAPSHOT:
LOG.info("Tiered sync decision: FULL_SNAPSHOT. "
+ "Replacing Recon SCM DB with fresh SCM checkpoint.");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With TARGETED_SYNC running every hour by default(configurable and can increase/decrease interval) keeping drift minimal, FULL_SNAPSHOT should rarely be needed in a steady state. The only real case where it can leads to FULL_SNAPSHOT, Recon was completely down for long hours to days, came back up and the drift will be more.

Since FULL_SNAPSHOT is considerably high resource centric, how about introducing command line option to the users?

ozone admin recon trigger-scm-snapshot
# prints:
# WARNING: This downloads the full SCM checkpoint. On large clusters this
# can be several GB and take minutes. SCM will be under I/O load.
# Are you sure? (yes/no): _
ozone admin recon scm-snapshot-status
# Output:
# Status: IN_PROGRESS
# Started: 14:28:03
# Phase: Downloading checkpoint from SCM (file transfer)
# Duration so far: 4m 32s
# Cancel: run 'ozone admin recon cancel-scm-snapshot' (safe until DB swap begins)
ozone admin recon cancel-scm-snapshot  
# SCM streams the RocksDB checkpoint files over RPC/HTTP to Recon. This is interruptible:

# Sample Code showing the cancellation task.
// scmServiceProvider.getSCMDBSnapshot() is a blocking call
DBCheckpoint dbSnapshot = scmServiceProvider.getSCMDBSnapshot();
If you run this on a dedicated Future / Thread, you can call future.cancel(true) which sends a thread interrupt. The underlying socket read will throw InterruptedIOException and the download stops immediately. This is efficient and clean — no partial writes to the final DB location, the temp checkpoint dir can be deleted.

Future<?> snapshotFuture = executor.submit(() -> {
    DBCheckpoint snap = scmServiceProvider.getSCMDBSnapshot();
    initializeNewRdbStore(snap.getCheckpointLocation().toFile());
});
// cancel command arrives:
snapshotFuture.cancel(true);  // interrupts the download thread

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on current logic, we have increased the threshold from 10K to 1M, and we have also added metrics when Recon will fall behind by 1M containers which is a very rare case, so having it manually using CLI also brings the same impact to user's env where downloading the full scm db may take time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is not to take any auto-trigger-path, which is considerably increase SCM's I/O load . What if there was an issue due to the auto trigger and how can they recover it and understand the status/progression?

Copy link
Copy Markdown
Contributor Author

@devmadhuu devmadhuu May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakeshadr One refinement I can think of is to make the automatic FULL_SNAPSHOT action configurable/opt-in. When the non-OPEN drift exceeds the configured threshold, the default behavior can be to log a warning and expose it through metrics instead of downloading the SCM checkpoint automatically. This PR already adds Recon SCM container sync metrics such as full snapshot trigger count, last non-OPEN drift, interval since last full snapshot event, per-state drift, targeted sync status, and targeted sync duration, so operators can alert on the condition.

For the explicit trigger/status/cancel flow, instead of adding CLI support directly, I think it fits better as Recon admin REST APIs because Recon already has
TriggerDBSyncEndpoint for OM sync and targeted SCM sync:

  - `POST /api/v1/triggerdbsync/scm/snapshot` to trigger full SCM DB snapshot download
  - `GET /api/v1/triggerdbsync/scm/snapshot/status` to return status/phase/start time/duration
  - `POST /api/v1/triggerdbsync/scm/snapshot/cancel` to cancel while the checkpoint download is still in progress

A CLI command can later be a thin wrapper over these REST APIs if needed.

If you think, it makes sense, then I can update this PR to avoid automatic full snapshot by default and keep the large-drift condition visible through logs + metrics. The explicit trigger/status/cancel REST API may be better handled as a follow-up JIRA unless you feel it should be part of this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a separate JIRA for full snapshot sync: HDDS-15165

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants