HDDS-14829. Split snapshot diff job into separate rpc calls for submitting job and getting the report.#9985
HDDS-14829. Split snapshot diff job into separate rpc calls for submitting job and getting the report.#9985SaketaChalamchala wants to merge 3 commits intoapache:masterfrom
Conversation
…tting job and getting the report.
jojochuang
left a comment
There was a problem hiding this comment.
make sense to me from a quick look
| private boolean cancel; | ||
|
|
||
| @CommandLine.Option(names = {"-r", "--get-report"}, | ||
| description = "Get the snapshot diff report is available; " + |
There was a problem hiding this comment.
| description = "Get the snapshot diff report is available; " + | |
| description = "Get the snapshot diff report if available; " + |
| org.apache.hadoop.ozone.snapshot.SnapshotDiffResponse response; | ||
|
|
||
| if (snapshotDiffRequest.hasForceFullDiff() && snapshotDiffRequest.hasDisableNativeDiff()) { | ||
| response = impl.snapshotDiff( |
There was a problem hiding this comment.
call impl.submitSnapshotDiff() instead?
There was a problem hiding this comment.
We cannot do that because this call would be from an older client that expects to submit the snapshot diff and receive the report when the snapshot diff job is done. impl.submitSnapshotDiff() only submits the snapshot diff and will not return the result and the older client will not have the --get-report option to get the report.
| "volume can't be null or empty."); | ||
| Preconditions.checkArgument(StringUtils.isNotBlank(bucketName), | ||
| "bucket can't be null or empty."); | ||
| return ozoneManagerClient.snapshotDiff(volumeName, bucketName, |
There was a problem hiding this comment.
call ozoneManagerClient.submitSnapshotDiff() instead?
There was a problem hiding this comment.
We can't do that because of the same reason mentioned above.
| } else if (getReport) { | ||
| getSnapshotDiff(client.getObjectStore(), volumeName, bucketName); | ||
| } else { | ||
| submitSnapshotDiff(client.getObjectStore(), volumeName, bucketName); |
| cancelSnapshotDiff(client.getObjectStore(), volumeName, bucketName); | ||
| } else { | ||
| } else if (getReport) { | ||
| getSnapshotDiff(client.getObjectStore(), volumeName, bucketName); |
jojochuang
left a comment
There was a problem hiding this comment.
Compatibility:
New client vs. Old OM:
getReport: old code path, no change.
submitSnapshotDiff: not recognized, will fail.
Old client vs. New OM:
getReport: old code path, no change.
getSnapshotDiff: old code path, deprecated, but will still work as is.
Nice to have: client submitSnapshotDiff falls back to getSnapshotDiff() if OM doesn't recognize it.
| optional bool disableNativeDiff = 8; | ||
| } | ||
|
|
||
| message SubmitSnapshotDiffRequest { |
There was a problem hiding this comment.
maybe mark message SnapshotDiffRequest fields forceFullDiff, disableNativeDiff
as [deprecated = true]
There was a problem hiding this comment.
Pull request overview
This PR refactors the snapshot diff workflow to decouple job submission from report retrieval by introducing a dedicated SubmitSnapshotDiff RPC, and updating the CLI/tests/docs to follow the new async flow (submit then --get-report polling).
Changes:
- Add new OM protocol RPC + client plumbing for
submitSnapshotDiff, and a report-onlysnapshotDiffoverload (token/pageSize only). - Update server-side snapshot diff manager to support report-only retrieval and introduce
NOT_FOUNDjob status. - Update
ozone sh snapshot diffCLI, smoketests, and docs for the new--get-reportflow.
Reviewed changes
Copilot reviewed 20 out of 23 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| hadoop-ozone/s3gateway/src/test/java/org/apache/hadoop/ozone/client/ClientProtocolStub.java | Updates test stub to include new RPC method signatures. |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/protocolPB/TestOzoneManagerRequestHandler.java | Adds unit test validating routing between legacy vs report-only snapshotDiff calls. |
| hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java | Adds tests for new submit/report-only behavior and edge statuses. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerRequestHandler.java | Adds SubmitSnapshotDiff request handling and legacy/report-only routing logic. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java | Implements report-only retrieval and new submit entrypoint; adds NOT_FOUND handling. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java | Exposes new server-side submit/report-only APIs with auditing + ACL checks. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java | Adds new entrypoints for report-only retrieval and job submission. |
| hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/audit/OMAction.java | Adds new audit action for submit snapshot diff. |
| hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto | Introduces SubmitSnapshotDiff command + request/response protos and NOT_FOUND status. |
| hadoop-ozone/dist/src/main/smoketest/snapshot/snapshot-sh.robot | Updates smoketests to submit then poll with --get-report. |
| hadoop-ozone/dist/src/main/smoketest/omha/data-validation-after-om-bootstrap.robot | Updates bootstrap validation flow for new submit/get-report split. |
| hadoop-ozone/dist/src/main/smoketest/compatibility/write.robot | Adds compatibility test logic supporting both new and legacy clients. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SubmitSnapshotDiffResponse.java | Adds new POJO response for submit RPC user feedback. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SnapshotDiffResponse.java | Adds NOT_FOUND and report-only messaging behavior. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java | Categorizes new RPC as read-only and prevents follower routing. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java | Adds client-side translator plumbing for new submit RPC + report-only snapshotDiff. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java | Adds new protocol methods / marks legacy snapshotDiff deprecated. |
| hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/SnapshotDiffJob.java | Persists/restores job failure reason into proto message field. |
| hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java | Exposes new client APIs for submit + report-only diff retrieval. |
| hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/protocol/ClientProtocol.java | Adds new client protocol methods and deprecates legacy one. |
| hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/ObjectStore.java | Adds ObjectStore wrappers for new submit/report-only flow. |
| hadoop-ozone/cli-shell/src/main/java/org/apache/hadoop/ozone/shell/snapshot/SnapshotDiffHandler.java | Updates CLI to use submit vs --get-report report retrieval. |
| hadoop-hdds/docs/content/feature/Snapshot.md | Updates documentation to describe submit vs get-report usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case FAILED: | ||
| case REJECTED: | ||
| case CANCELLED: | ||
| LOG.info("Submitting snapshot diff generation request for" + | ||
| " volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}", | ||
| volumeName, bucketName, fromSnapshotName, toSnapshotName); | ||
| try { | ||
| updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS); | ||
| snapDiffExecutor.execute(() -> generateSnapshotDiffReport(snapDiffJobKey, snapDiffJob.getJobId(), | ||
| volumeName, bucketName, fromSnapshotName, toSnapshotName, | ||
| forceFullDiff, disableNativeDiff)); | ||
| if (prevStatus == QUEUED) { | ||
| return new SubmitSnapshotDiffResponse(defaultWaitTime, null, null); | ||
| } else { | ||
| return new SubmitSnapshotDiffResponse(defaultWaitTime, prevStatus, prevReason); | ||
| } |
There was a problem hiding this comment.
When resubmitting after a FAILED/REJECTED/CANCELLED job, this reuses the existing SnapshotDiffJob (including its jobId) and starts a new generation run with the same jobId. Since diff report keys and temp paths are jobId-based, stale report entries/temp state from the previous attempt can be mixed into the new run (eg, leftover keys beyond the new total), producing incorrect reports. Consider creating a fresh SnapshotDiffJob with a new jobId (and resetting totals/subStatus/progress/reason) on these resubmits, or explicitly deleting all prior report entries for the previous jobId before starting the new run.
| case FAILED: | |
| case REJECTED: | |
| case CANCELLED: | |
| LOG.info("Submitting snapshot diff generation request for" + | |
| " volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}", | |
| volumeName, bucketName, fromSnapshotName, toSnapshotName); | |
| try { | |
| updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS); | |
| snapDiffExecutor.execute(() -> generateSnapshotDiffReport(snapDiffJobKey, snapDiffJob.getJobId(), | |
| volumeName, bucketName, fromSnapshotName, toSnapshotName, | |
| forceFullDiff, disableNativeDiff)); | |
| if (prevStatus == QUEUED) { | |
| return new SubmitSnapshotDiffResponse(defaultWaitTime, null, null); | |
| } else { | |
| return new SubmitSnapshotDiffResponse(defaultWaitTime, prevStatus, prevReason); | |
| } | |
| LOG.info("Submitting snapshot diff generation request for" + | |
| " volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}", | |
| volumeName, bucketName, fromSnapshotName, toSnapshotName); | |
| try { | |
| updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS); | |
| snapDiffExecutor.execute(() -> generateSnapshotDiffReport( | |
| snapDiffJobKey, snapDiffJob.getJobId(), | |
| volumeName, bucketName, fromSnapshotName, toSnapshotName, | |
| forceFullDiff, disableNativeDiff)); | |
| return new SubmitSnapshotDiffResponse(defaultWaitTime, null, null); | |
| } catch (RejectedExecutionException exception) { | |
| updateJobStatus(snapDiffJobKey, IN_PROGRESS, REJECTED); | |
| LOG.warn("Exceeded the snapDiff parallel requests processing " + | |
| "limit. Rejecting the snapDiff job: {}.", snapDiffJobKey); | |
| return new SubmitSnapshotDiffResponse("Snapshot diff job was rejected because parallel processing of " + | |
| "snap diff jobs exceeded the limit. Please try again in " + defaultWaitTime + " ms."); | |
| } catch (Exception exception) { | |
| updateJobStatusToFailed(snapDiffJobKey, exception.getMessage()); | |
| LOG.error("Snapshot diff job: {} failed.", snapDiffJobKey, exception); | |
| return new SubmitSnapshotDiffResponse("Snapshot diff job submission failed. Reason: " + exception.getMessage()); | |
| } | |
| case FAILED: | |
| case REJECTED: | |
| case CANCELLED: | |
| LOG.info("Resubmitting snapshot diff generation request for" + | |
| " volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}", | |
| volumeName, bucketName, fromSnapshotName, toSnapshotName); | |
| try { | |
| updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS); | |
| final String resubmittedJobId = UUID.randomUUID().toString(); | |
| snapDiffExecutor.execute(() -> generateSnapshotDiffReport( | |
| snapDiffJobKey, resubmittedJobId, | |
| volumeName, bucketName, fromSnapshotName, toSnapshotName, | |
| forceFullDiff, disableNativeDiff)); | |
| return new SubmitSnapshotDiffResponse(defaultWaitTime, prevStatus, prevReason); |
There was a problem hiding this comment.
Addressed this comment.
| private boolean cancel; | ||
|
|
||
| @CommandLine.Option(names = {"-r", "--get-report"}, | ||
| description = "Get the snapshot diff report is available; " + |
There was a problem hiding this comment.
The --get-report option description has a grammar issue ("Get the snapshot diff report is available"). Consider changing it to "Get the snapshot diff report if available" or similar to avoid confusing CLI help output.
| description = "Get the snapshot diff report is available; " + | |
| description = "Get the snapshot diff report if available; " + |
| public SnapshotDiffResponse getSnapshotDiffResponse(final String volume, | ||
| final String bucket, | ||
| final String fromSnapshot, | ||
| final String toSnapshot, | ||
| final String token, |
There was a problem hiding this comment.
The new report-only path (getSnapshotDiffResponse) no longer validates that both snapshots exist and are ACTIVE (the deprecated getSnapshotDiffReport() did optimistic validation before and after generating the response). Without this, a report could be returned for deleted/inactive snapshots or while snapshots are being deleted concurrently. Consider re-adding validateSnapshotsExistAndActive(volume, bucket, fromSnapshot, toSnapshot) before and after calling snapshotDiffManager.getSnapshotDiffReport().
There was a problem hiding this comment.
Added the validation.
| * @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, Boolean)} | ||
| * instead |
There was a problem hiding this comment.
Javadoc @deprecated link references a snapshotDiff(...) signature that does not exist (extra boolean/Boolean params). Update the @link to point to the actual replacement APIs (eg, submitSnapshotDiff(...) for submission and snapshotDiff(..., token, pageSize) for report retrieval) so Javadoc generation and published docs are correct.
| * @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, Boolean)} | |
| * instead | |
| * @deprecated Use {@link #submitSnapshotDiff(String, String, String, String, boolean, boolean)} | |
| * to submit a snapshot diff job and | |
| * {@link #snapshotDiff(String, String, String, String, String, int)} to | |
| * retrieve the diff report instead. |
| * @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, boolean)} | ||
| * instead to submit a new snapshot diff job or just return the existing job result. |
There was a problem hiding this comment.
The @deprecated Javadoc link points to a non-existent snapshotDiff(...) overload (extra boolean parameter). Update the @link to the actual replacement APIs (submitSnapshotDiff for submission and the report-only snapshotDiff overload) to avoid broken/misleading generated docs.
| * @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, boolean)} | |
| * instead to submit a new snapshot diff job or just return the existing job result. | |
| * @deprecated Use {@link #submitSnapshotDiff(String, String, String, String, boolean, boolean)} | |
| * to submit a new snapshot diff job, or | |
| * {@link #snapshotDiff(String, String, String, String, String, int)} to | |
| * return an existing job result. |
| if (cancel) { | ||
| cancelSnapshotDiff(client.getObjectStore(), volumeName, bucketName); | ||
| } else { | ||
| } else if (getReport) { | ||
| getSnapshotDiff(client.getObjectStore(), volumeName, bucketName); | ||
| } else { | ||
| submitSnapshotDiff(client.getObjectStore(), volumeName, bucketName); |
There was a problem hiding this comment.
When --get-report is used, this code path ignores forceFullDiff/disableNativeDiff; and when submitting (no --get-report), token/page-size are ignored. Since these options are still accepted on the CLI, it’s easy for users to think they’re taking effect when they aren’t. Consider validating and rejecting incompatible option combinations (or emitting an explicit warning).
| break; | ||
| case NOT_FOUND: | ||
| str.append("No snapshot diff job found. Job may not have been submitted or was removed during cleanup.\n" + | ||
| "Submit a new snapshot diff job without using the --get-report option.\n"); |
There was a problem hiding this comment.
| "Submit a new snapshot diff job without using the --get-report option.\n"); | |
| "Submit a new snapshot diff job without using the --get-report option.\n"); | |
| if (!isReportOnly) { | |
| LOG.error("Invalid state for submit snapshot diff call"); | |
| } |
|
Same flow with this PR. Probably could be added to the PR description |
What changes were proposed in this pull request?
Currently, the snapshot diff process ties snapshot diff job cleanup with re-submission i.e., new snapshot diff cannot be submitted until old snapshot diff job is cleaned up from the DB as there is a single rpc call used for submitting snap diff and gettng the report. Hence, the cleanup has to be inline with the default wait time (1m) because,
-- When the user tries to get the report/status at T1.5 they do not know about the failure and just resubmit.
-- This will continue to go on at T2.5 and so on...
The proposal is to refactor the snapshot diff workflow by splitting the process into two distinct RPC calls and removing the dependency between cleanup and submission:
RPC Layer: Updated the Ozone Manager protocol to support separate submitSnapshotDiff and getSnapshotDiffReport requests.
Client Side: Updated the ozone sh snapshot diff command to handle the new asynchronous flow. It now submits the job and provides instructions to the user on how to poll for the report using the --get-report option.
Response Handling: Introduced SubmitSnapshotDiffResponse to provide clear feedback to the user regarding the job status (e.g., if a job is already in progress, completed, or newly submitted).
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14829
How was this patch tested?
Acceptance tests and Unit tests.