HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074
HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074devmadhuu wants to merge 17 commits intoapache:masterfrom
Conversation
…d Container Summary Totals - Initial Commit.
sumitagrawl
left a comment
There was a problem hiding this comment.
@devmadhuu Thanks for working over this, few points to be considered,
- We support moving back container from deleted to closed / quasi-closed based on state from DN
- open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
- Closing may not be required as its temporary state for few minutes, DN sync can help over this.
- For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs
Do we really need full db sync for quasi-closed/closed only OR for Open container state ?
Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested. |
priyeshkaratha
left a comment
There was a problem hiding this comment.
Thanks @devmadhuu for working on this. Overall changes LGTM
…mplementation for getContainerCount API.
| containers.addContainer(container); | ||
| if (container.getState() == LifeCycleState.OPEN) { | ||
| if (container.getState() == LifeCycleState.OPEN | ||
| && container.getPipelineID() != null) { |
There was a problem hiding this comment.
any specific reason for adding this check? Do possible that container open is reported but pipeline does not exist (closed due to some error), can have some impact with the check
There was a problem hiding this comment.
Earlier this path would throw NPE from PipelineStateMap because addContainerToPipelineSCMStart eventually requires a non-null PipelineID. The check is only to avoid failing SCM/Recon startup/reinitialize
when an OPEN container record exists without pipelineID. The container is still added to the in-memory container map; only pipeline-to-container registration is skipped because there is no pipeline to register against.
For normal OPEN containers with a pipelineID, behavior is unchanged. For a non-null but missing pipelineID, we still call addContainerToPipelineSCMStart and keep the existing PipelineNotFoundException handling, so Replication Manager's OpenContainerHandler can move it forward later.
| equals(LifeCycleState.OPEN)) { | ||
| // Pipeline should exist, but not | ||
| throw new PipelineNotFoundException(); | ||
| if (pipelineID != null) { |
There was a problem hiding this comment.
Null pipelineId also represent PipelineNotFound Exception
There was a problem hiding this comment.
The throw is for the non-null pipelineID case where containsPipeline(pipelineID) is false. That means the container references a pipeline, but the pipeline is not present in the pipeline manager, so preserving PipelineNotFoundException is intentional and matches the old behavior.
The null pipelineID case is handled separately because there is no pipeline key to look up or register against. For Recon sync we still want to record the container so it does not remain absent from Recon, but skip pipeline tracking and log a warning.
| try { | ||
| ContainerInfo info = scm.getContainerManager() | ||
| .getContainer(ContainerID.valueOf(containerID)); | ||
| cpList.add(new ContainerWithPipeline(info, null)); |
There was a problem hiding this comment.
are we returning pipeline as null here ?
There was a problem hiding this comment.
Yeah, this can be problem, because the protobuf requires pipeline, so the whole RPC response can fail with NPE. Actually now we need to think , if we want Recon to receive container metadata without pipeline, we might need to change the API contract.
- add a new response type with ContainerInfoProto and optional Pipeline
- or add a Recon-specific API that returns existing container metadata independently from pipeline
Excluding that container means Recon will not add that specific container in this batch pass if SCM cannot resolve a pipeline for it. That is a limitation.
IMO, we should add a new response type with ContainerInfoProto and optional Pipeline. What do you think ?
|
Thanks a lot @devmadhuu for the detailed explanation and continuous efforts in handling this. General comments:
Something like, non_open_container_decrecepencies <= 10K and the TARGETED_SYNC interval is 1hr |
Yes, full SCM DB snapshot download is a time consuming with high latency operation. As we have experienced, some large clusters can even reach with 500+ GB of SCM DB and continue to grow, so it is always good to avoid this full SCM DB download situation and keep doing
# Recon SCM Sync Latency Projections ScopeThis note derives projected latency, RPC count, transfer volume, and SCM heap impact for Recon syncing container IDs from SCM using a batch size of The math is based on the following observed log sample from Recon: Code ReferencesThe sizing assumption for one serialized The batch-size logic comes from: The SCM RPC serving the state-filtered container ID list comes from: The SCM side state-map list creation comes from: The protobuf response building on SCM comes from: Assumptions
Formula:
Maximum Theoretical IDs Per 128 MB RPCIf we only consider the serialized wire size: Therefore: This is only a wire-size upper bound. It is not a recommended operational Proposed Operational Batch SizeProposed batch size: This yields per-RPC wire payload: Conversions: So: Baseline-Derived Time Per ContainerRPC Time OnlyUsing: Per-container RPC time: End-to-End Sync TimeUsing: Per-container end-to-end time: Time Per 500K RPCRPC Time OnlyEnd-to-End Sync TimeGeneral Projection FormulasFor any container count Number of RPC CallsTotal Transfer VolumeProjected RPC Time OnlyProjected End-to-End Sync Time4 Million ContainersRPC CallsTransfer VolumeRPC Time OnlyEnd-to-End Sync Time10 Million ContainersRPC CallsTransfer VolumeRPC Time OnlyEnd-to-End Sync Time50 Million ContainersRPC CallsTransfer VolumeRPC Time OnlyEnd-to-End Sync Time100 Million ContainersRPC CallsTransfer VolumeRPC Time OnlyEnd-to-End Sync TimeSummary Table
SCM Heap EstimateThe Important implementation details:
This means SCM heap cost per RPC is mostly:
It is not equivalent to allocating Practical Per-RPC Heap Estimate for 500K IDsWire payload: Reasonable transient heap estimate at SCM:
Practical planning range: A good midpoint estimate is: Key Takeaways
CaveatsThese numbers are first-order projections, not guarantees. They assume near-linear scaling from the observed baseline. Real behavior may
On your 3rd and last point related to alerts/notifications : Currently, we don't have any alert framework inherently supported in Recon. I discussed and raised this point earlier, but since users can have various external ways also based on hadoop metrics, so we didn't carry on with this idea. So based on existing hadoop metrics, we can have following metrics:
|
…doop.hdds.scm.server.SCMClientProtocolServer#getExistContainerWithPipelinesInBatch.
…targeted sync execution.
| ReconStorageContainerSyncHelper.SyncAction action = | ||
| containerSyncHelper.decideSyncAction(); | ||
| switch (action) { | ||
| case FULL_SNAPSHOT: |
There was a problem hiding this comment.
How about splitting the PR into two ? This would help to reach consensus asap and merge TARGETED_SYNC.
PR-1 -> TARGETED_SYNC &
PR-2 -> FULL_SNAPSHOT
There was a problem hiding this comment.
I think, just for full snapshot, it might not be feasible to open a new PR, because that both are part of same scheduler thread and I have already removed the separate scheduler which was blindly doing full scm db download. (full snapshot) every 24 hours. Now every 12 hours, a check will happen if full snapshot or targeted sync, full snapshot threshold also increased now from 10k to 1M
| switch (action) { | ||
| case FULL_SNAPSHOT: | ||
| LOG.info("Tiered sync decision: FULL_SNAPSHOT. " | ||
| + "Replacing Recon SCM DB with fresh SCM checkpoint."); |
There was a problem hiding this comment.
With TARGETED_SYNC running every hour by default(configurable and can increase/decrease interval) keeping drift minimal, FULL_SNAPSHOT should rarely be needed in a steady state. The only real case where it can leads to FULL_SNAPSHOT, Recon was completely down for long hours to days, came back up and the drift will be more.
Since FULL_SNAPSHOT is considerably high resource centric, how about introducing command line option to the users?
ozone admin recon trigger-scm-snapshot
# prints:
# WARNING: This downloads the full SCM checkpoint. On large clusters this
# can be several GB and take minutes. SCM will be under I/O load.
# Are you sure? (yes/no): _
ozone admin recon scm-snapshot-status
# Output:
# Status: IN_PROGRESS
# Started: 14:28:03
# Phase: Downloading checkpoint from SCM (file transfer)
# Duration so far: 4m 32s
# Cancel: run 'ozone admin recon cancel-scm-snapshot' (safe until DB swap begins)
ozone admin recon cancel-scm-snapshot
# SCM streams the RocksDB checkpoint files over RPC/HTTP to Recon. This is interruptible:
# Sample Code showing the cancellation task.
// scmServiceProvider.getSCMDBSnapshot() is a blocking call
DBCheckpoint dbSnapshot = scmServiceProvider.getSCMDBSnapshot();
If you run this on a dedicated Future / Thread, you can call future.cancel(true) which sends a thread interrupt. The underlying socket read will throw InterruptedIOException and the download stops immediately. This is efficient and clean — no partial writes to the final DB location, the temp checkpoint dir can be deleted.
Future<?> snapshotFuture = executor.submit(() -> {
DBCheckpoint snap = scmServiceProvider.getSCMDBSnapshot();
initializeNewRdbStore(snap.getCheckpointLocation().toFile());
});
// cancel command arrives:
snapshotFuture.cancel(true); // interrupts the download thread
There was a problem hiding this comment.
Based on current logic, we have increased the threshold from 10K to 1M, and we have also added metrics when Recon will fall behind by 1M containers which is a very rare case, so having it manually using CLI also brings the same impact to user's env where downloading the full scm db may take time.
There was a problem hiding this comment.
My thinking is not to take any auto-trigger-path, which is considerably increase SCM's I/O load . What if there was an issue due to the auto trigger and how can they recover it and understand the status/progression?
There was a problem hiding this comment.
@rakeshadr One refinement I can think of is to make the automatic FULL_SNAPSHOT action configurable/opt-in. When the non-OPEN drift exceeds the configured threshold, the default behavior can be to log a warning and expose it through metrics instead of downloading the SCM checkpoint automatically. This PR already adds Recon SCM container sync metrics such as full snapshot trigger count, last non-OPEN drift, interval since last full snapshot event, per-state drift, targeted sync status, and targeted sync duration, so operators can alert on the condition.
For the explicit trigger/status/cancel flow, instead of adding CLI support directly, I think it fits better as Recon admin REST APIs because Recon already has
TriggerDBSyncEndpoint for OM sync and targeted SCM sync:
- `POST /api/v1/triggerdbsync/scm/snapshot` to trigger full SCM DB snapshot download
- `GET /api/v1/triggerdbsync/scm/snapshot/status` to return status/phase/start time/duration
- `POST /api/v1/triggerdbsync/scm/snapshot/cancel` to cancel while the checkpoint download is still in progress
A CLI command can later be a thin wrapper over these REST APIs if needed.
If you think, it makes sense, then I can update this PR to avoid automatic full snapshot by default and keep the large-drift condition visible through logs + metrics. The explicit trigger/status/cancel REST API may be better handled as a follow-up JIRA unless you feel it should be part of this change.
There was a problem hiding this comment.
I have created a separate JIRA for full snapshot sync: HDDS-15165
What changes were proposed in this pull request?
This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.
Largely the PR addresses two issues:
OPEN,CLOSING, orQUASI_CLOSEDafter SCM has already advanced it.The implementation now has two SCM sync mechanisms:
Full snapshot sync
This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
checkpoint on the existing snapshot schedule.
Incremental targeted sync
This now runs on its own schedule and decides between:
NO_ACTIONTARGETED_SYNCFULL_SNAPSHOTThe targeted sync is implemented as a four-pass workflow:
Pass 1:
CLOSEDAdd missing
CLOSEDcontainers and correct staleOPEN/CLOSING/QUASI_CLOSEDcontainers to
CLOSED.Pass 2:
OPENAdd missing
OPENcontainers only. No downgrades and no state correction.Pass 3:
QUASI_CLOSEDAdd missing
QUASI_CLOSEDcontainers and correct staleOPEN/CLOSINGcontainers up to
QUASI_CLOSED.Pass 4: retirement for
DELETING/DELETEDStart from Recon's own
CLOSEDandQUASI_CLOSEDcontainers and move themforward only when SCM explicitly returns
DELETINGorDELETED.Root Causes Addressed
1. DN-report path could not advance beyond
CLOSING2. Sync used to be add-only for
CLOSED3. Recon could miss
OPENandQUASI_CLOSEDcontainers entirely4. Recon never retired stale live states based on SCM deletion progress
5. SCM batch API could drop containers when pipeline lookup failed
6. Recon add path and SCM state manager were not null-pipeline safe
7. Open-container count per pipeline could drift
8.
decideSyncAction()became a real tiered decisionCurrent logic:
>ozone.recon.scm.container.threshold:FULL_SNAPSHOT> 0:TARGETED_SYNCOPENQUASI_CLOSEDCLOSEDremainderozone.recon.scm.per.state.drift.threshold:TARGETED_SYNCNO_ACTION9. Incremental sync got its own schedule
Additional update: Recon SCM sync observability
This patch also adds Hadoop Metrics2/JMX metrics for the Recon SCM container sync decision path. These metrics help operators understand when Recon escalates to a full SCM DB snapshot, how large the drift was when that happened, and whether targeted sync is keeping up.
New metrics include:
0: idle1: in progress2: success3: failureThese metrics are intended to help tune:
ozone.recon.scm.container.thresholdozone.recon.scm.container.sync.task.interval.delayozone.recon.scm.per.state.drift.thresholdWhat is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14758
How was this patch tested?
Tests added or updated
TestReconSCMContainerSyncIntegration.java]TestReconStorageContainerSyncHelper.java]TestTriggerDBSyncEndpoint.java]TestUnhealthyContainersDerbyPerformance.java]TestReconContainerHealthSummaryEndToEnd.java]