Skip to content

HDDS-14829. Split snapshot diff job into separate rpc calls for submitting job and getting the report.#9985

Draft
SaketaChalamchala wants to merge 3 commits intoapache:masterfrom
SaketaChalamchala:HDDS-14829
Draft

HDDS-14829. Split snapshot diff job into separate rpc calls for submitting job and getting the report.#9985
SaketaChalamchala wants to merge 3 commits intoapache:masterfrom
SaketaChalamchala:HDDS-14829

Conversation

@SaketaChalamchala
Copy link
Copy Markdown
Contributor

@SaketaChalamchala SaketaChalamchala commented Mar 27, 2026

What changes were proposed in this pull request?

Currently, the snapshot diff process ties snapshot diff job cleanup with re-submission i.e., new snapshot diff cannot be submitted until old snapshot diff job is cleaned up from the DB as there is a single rpc call used for submitting snap diff and gettng the report. Hence, the cleanup has to be inline with the default wait time (1m) because,

  • CLI requests are asked to wait default wait time to retrieve the report/status of the job.
  • If cleanup runs at T0, T1 and so on... snapshot diff job that is submitted at T0.5 which fails immediately is cleaned up from the DB at T1.
    -- When the user tries to get the report/status at T1.5 they do not know about the failure and just resubmit.
    -- This will continue to go on at T2.5 and so on...

The proposal is to refactor the snapshot diff workflow by splitting the process into two distinct RPC calls and removing the dependency between cleanup and submission:

  • Submit Job: Initiates the snapshot diff job and returns immediately with a status.
  • Get Report: A separate call to retrieve the results once the job is complete.

RPC Layer: Updated the Ozone Manager protocol to support separate submitSnapshotDiff and getSnapshotDiffReport requests.
Client Side: Updated the ozone sh snapshot diff command to handle the new asynchronous flow. It now submits the job and provides instructions to the user on how to poll for the report using the --get-report option.
Response Handling: Introduced SubmitSnapshotDiffResponse to provide clear feedback to the user regarding the job status (e.g., if a job is already in progress, completed, or newly submitted).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14829

How was this patch tested?

Acceptance tests and Unit tests.

@errose28 errose28 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to me from a quick look

private boolean cancel;

@CommandLine.Option(names = {"-r", "--get-report"},
description = "Get the snapshot diff report is available; " +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description = "Get the snapshot diff report is available; " +
description = "Get the snapshot diff report if available; " +

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

org.apache.hadoop.ozone.snapshot.SnapshotDiffResponse response;

if (snapshotDiffRequest.hasForceFullDiff() && snapshotDiffRequest.hasDisableNativeDiff()) {
response = impl.snapshotDiff(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call impl.submitSnapshotDiff() instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do that because this call would be from an older client that expects to submit the snapshot diff and receive the report when the snapshot diff job is done. impl.submitSnapshotDiff() only submits the snapshot diff and will not return the result and the older client will not have the --get-report option to get the report.

"volume can't be null or empty.");
Preconditions.checkArgument(StringUtils.isNotBlank(bucketName),
"bucket can't be null or empty.");
return ozoneManagerClient.snapshotDiff(volumeName, bucketName,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call ozoneManagerClient.submitSnapshotDiff() instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that because of the same reason mentioned above.

} else if (getReport) {
getSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
} else {
submitSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test coverage

cancelSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
} else {
} else if (getReport) {
getSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need test coverage.

Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compatibility:

New client vs. Old OM:
getReport: old code path, no change.
submitSnapshotDiff: not recognized, will fail.

Old client vs. New OM:
getReport: old code path, no change.
getSnapshotDiff: old code path, deprecated, but will still work as is.

Nice to have: client submitSnapshotDiff falls back to getSnapshotDiff() if OM doesn't recognize it.

optional bool disableNativeDiff = 8;
}

message SubmitSnapshotDiffRequest {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mark message SnapshotDiffRequest fields forceFullDiff, disableNativeDiff
as [deprecated = true]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SaketaChalamchala SaketaChalamchala marked this pull request as ready for review April 9, 2026 21:38
Copilot AI review requested due to automatic review settings April 9, 2026 21:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the snapshot diff workflow to decouple job submission from report retrieval by introducing a dedicated SubmitSnapshotDiff RPC, and updating the CLI/tests/docs to follow the new async flow (submit then --get-report polling).

Changes:

  • Add new OM protocol RPC + client plumbing for submitSnapshotDiff, and a report-only snapshotDiff overload (token/pageSize only).
  • Update server-side snapshot diff manager to support report-only retrieval and introduce NOT_FOUND job status.
  • Update ozone sh snapshot diff CLI, smoketests, and docs for the new --get-report flow.

Reviewed changes

Copilot reviewed 20 out of 23 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
hadoop-ozone/s3gateway/src/test/java/org/apache/hadoop/ozone/client/ClientProtocolStub.java Updates test stub to include new RPC method signatures.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/protocolPB/TestOzoneManagerRequestHandler.java Adds unit test validating routing between legacy vs report-only snapshotDiff calls.
hadoop-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotDiffManager.java Adds tests for new submit/report-only behavior and edge statuses.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/protocolPB/OzoneManagerRequestHandler.java Adds SubmitSnapshotDiff request handling and legacy/report-only routing logic.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotDiffManager.java Implements report-only retrieval and new submit entrypoint; adds NOT_FOUND handling.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java Exposes new server-side submit/report-only APIs with auditing + ACL checks.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OmSnapshotManager.java Adds new entrypoints for report-only retrieval and job submission.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/audit/OMAction.java Adds new audit action for submit snapshot diff.
hadoop-ozone/interface-client/src/main/proto/OmClientProtocol.proto Introduces SubmitSnapshotDiff command + request/response protos and NOT_FOUND status.
hadoop-ozone/dist/src/main/smoketest/snapshot/snapshot-sh.robot Updates smoketests to submit then poll with --get-report.
hadoop-ozone/dist/src/main/smoketest/omha/data-validation-after-om-bootstrap.robot Updates bootstrap validation flow for new submit/get-report split.
hadoop-ozone/dist/src/main/smoketest/compatibility/write.robot Adds compatibility test logic supporting both new and legacy clients.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SubmitSnapshotDiffResponse.java Adds new POJO response for submit RPC user feedback.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/snapshot/SnapshotDiffResponse.java Adds NOT_FOUND and report-only messaging behavior.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/OmUtils.java Categorizes new RPC as read-only and prevents follower routing.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java Adds client-side translator plumbing for new submit RPC + report-only snapshotDiff.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java Adds new protocol methods / marks legacy snapshotDiff deprecated.
hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/SnapshotDiffJob.java Persists/restores job failure reason into proto message field.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java Exposes new client APIs for submit + report-only diff retrieval.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/protocol/ClientProtocol.java Adds new client protocol methods and deprecates legacy one.
hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/ObjectStore.java Adds ObjectStore wrappers for new submit/report-only flow.
hadoop-ozone/cli-shell/src/main/java/org/apache/hadoop/ozone/shell/snapshot/SnapshotDiffHandler.java Updates CLI to use submit vs --get-report report retrieval.
hadoop-hdds/docs/content/feature/Snapshot.md Updates documentation to describe submit vs get-report usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +646 to +661
case FAILED:
case REJECTED:
case CANCELLED:
LOG.info("Submitting snapshot diff generation request for" +
" volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}",
volumeName, bucketName, fromSnapshotName, toSnapshotName);
try {
updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS);
snapDiffExecutor.execute(() -> generateSnapshotDiffReport(snapDiffJobKey, snapDiffJob.getJobId(),
volumeName, bucketName, fromSnapshotName, toSnapshotName,
forceFullDiff, disableNativeDiff));
if (prevStatus == QUEUED) {
return new SubmitSnapshotDiffResponse(defaultWaitTime, null, null);
} else {
return new SubmitSnapshotDiffResponse(defaultWaitTime, prevStatus, prevReason);
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When resubmitting after a FAILED/REJECTED/CANCELLED job, this reuses the existing SnapshotDiffJob (including its jobId) and starts a new generation run with the same jobId. Since diff report keys and temp paths are jobId-based, stale report entries/temp state from the previous attempt can be mixed into the new run (eg, leftover keys beyond the new total), producing incorrect reports. Consider creating a fresh SnapshotDiffJob with a new jobId (and resetting totals/subStatus/progress/reason) on these resubmits, or explicitly deleting all prior report entries for the previous jobId before starting the new run.

Suggested change
case FAILED:
case REJECTED:
case CANCELLED:
LOG.info("Submitting snapshot diff generation request for" +
" volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}",
volumeName, bucketName, fromSnapshotName, toSnapshotName);
try {
updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS);
snapDiffExecutor.execute(() -> generateSnapshotDiffReport(snapDiffJobKey, snapDiffJob.getJobId(),
volumeName, bucketName, fromSnapshotName, toSnapshotName,
forceFullDiff, disableNativeDiff));
if (prevStatus == QUEUED) {
return new SubmitSnapshotDiffResponse(defaultWaitTime, null, null);
} else {
return new SubmitSnapshotDiffResponse(defaultWaitTime, prevStatus, prevReason);
}
LOG.info("Submitting snapshot diff generation request for" +
" volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}",
volumeName, bucketName, fromSnapshotName, toSnapshotName);
try {
updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS);
snapDiffExecutor.execute(() -> generateSnapshotDiffReport(
snapDiffJobKey, snapDiffJob.getJobId(),
volumeName, bucketName, fromSnapshotName, toSnapshotName,
forceFullDiff, disableNativeDiff));
return new SubmitSnapshotDiffResponse(defaultWaitTime, null, null);
} catch (RejectedExecutionException exception) {
updateJobStatus(snapDiffJobKey, IN_PROGRESS, REJECTED);
LOG.warn("Exceeded the snapDiff parallel requests processing " +
"limit. Rejecting the snapDiff job: {}.", snapDiffJobKey);
return new SubmitSnapshotDiffResponse("Snapshot diff job was rejected because parallel processing of " +
"snap diff jobs exceeded the limit. Please try again in " + defaultWaitTime + " ms.");
} catch (Exception exception) {
updateJobStatusToFailed(snapDiffJobKey, exception.getMessage());
LOG.error("Snapshot diff job: {} failed.", snapDiffJobKey, exception);
return new SubmitSnapshotDiffResponse("Snapshot diff job submission failed. Reason: " + exception.getMessage());
}
case FAILED:
case REJECTED:
case CANCELLED:
LOG.info("Resubmitting snapshot diff generation request for" +
" volume: {}, bucket: {}, fromSnapshot: {} and toSnapshot: {}",
volumeName, bucketName, fromSnapshotName, toSnapshotName);
try {
updateJobStatus(snapDiffJobKey, prevStatus, IN_PROGRESS);
final String resubmittedJobId = UUID.randomUUID().toString();
snapDiffExecutor.execute(() -> generateSnapshotDiffReport(
snapDiffJobKey, resubmittedJobId,
volumeName, bucketName, fromSnapshotName, toSnapshotName,
forceFullDiff, disableNativeDiff));
return new SubmitSnapshotDiffResponse(defaultWaitTime, prevStatus, prevReason);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this comment.

private boolean cancel;

@CommandLine.Option(names = {"-r", "--get-report"},
description = "Get the snapshot diff report is available; " +
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --get-report option description has a grammar issue ("Get the snapshot diff report is available"). Consider changing it to "Get the snapshot diff report if available" or similar to avoid confusing CLI help output.

Suggested change
description = "Get the snapshot diff report is available; " +
description = "Get the snapshot diff report if available; " +

Copilot uses AI. Check for mistakes.
Comment on lines +864 to +868
public SnapshotDiffResponse getSnapshotDiffResponse(final String volume,
final String bucket,
final String fromSnapshot,
final String toSnapshot,
final String token,
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new report-only path (getSnapshotDiffResponse) no longer validates that both snapshots exist and are ACTIVE (the deprecated getSnapshotDiffReport() did optimistic validation before and after generating the response). Without this, a report could be returned for deleted/inactive snapshots or while snapshots are being deleted concurrently. Consider re-adding validateSnapshotsExistAndActive(volume, bucket, fromSnapshot, toSnapshot) before and after calling snapshotDiffManager.getSnapshotDiffReport().

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the validation.

Comment on lines +797 to +798
* @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, Boolean)}
* instead
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc @deprecated link references a snapshotDiff(...) signature that does not exist (extra boolean/Boolean params). Update the @link to point to the actual replacement APIs (eg, submitSnapshotDiff(...) for submission and snapshotDiff(..., token, pageSize) for report retrieval) so Javadoc generation and published docs are correct.

Suggested change
* @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, Boolean)}
* instead
* @deprecated Use {@link #submitSnapshotDiff(String, String, String, String, boolean, boolean)}
* to submit a snapshot diff job and
* {@link #snapshotDiff(String, String, String, String, String, int)} to
* retrieve the diff report instead.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +1251 to +1252
* @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, boolean)}
* instead to submit a new snapshot diff job or just return the existing job result.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @deprecated Javadoc link points to a non-existent snapshotDiff(...) overload (extra boolean parameter). Update the @link to the actual replacement APIs (submitSnapshotDiff for submission and the report-only snapshotDiff overload) to avoid broken/misleading generated docs.

Suggested change
* @deprecated Use {@link #snapshotDiff(String, String, String, String, String, int, boolean, boolean, boolean)}
* instead to submit a new snapshot diff job or just return the existing job result.
* @deprecated Use {@link #submitSnapshotDiff(String, String, String, String, boolean, boolean)}
* to submit a new snapshot diff job, or
* {@link #snapshotDiff(String, String, String, String, String, int)} to
* return an existing job result.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 116 to +121
if (cancel) {
cancelSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
} else {
} else if (getReport) {
getSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
} else {
submitSnapshotDiff(client.getObjectStore(), volumeName, bucketName);
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --get-report is used, this code path ignores forceFullDiff/disableNativeDiff; and when submitting (no --get-report), token/page-size are ignored. Since these options are still accepted on the CLI, it’s easy for users to think they’re taking effect when they aren’t. Consider validating and rejecting incompatible option combinations (or emitting an explicit warning).

Copilot uses AI. Check for mistakes.
@SaketaChalamchala SaketaChalamchala marked this pull request as draft April 9, 2026 22:53
break;
case NOT_FOUND:
str.append("No snapshot diff job found. Job may not have been submitted or was removed during cleanup.\n" +
"Submit a new snapshot diff job without using the --get-report option.\n");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Submit a new snapshot diff job without using the --get-report option.\n");
"Submit a new snapshot diff job without using the --get-report option.\n");
if (!isReportOnly) {
LOG.error("Invalid state for submit snapshot diff call");
}

@sadanand48
Copy link
Copy Markdown
Contributor

Same flow with this PR. Probably could be added to the PR description

T0: Replication job submits a snapshot diff job and is asked to check back in 1min for the report. 
T0.2: The job fails immediately and the status of the job is updated as FAILED in the DB
T0.5: But immediately after that cleanup runs and removes the job from the DB
T1: After a minute the user runs the snapshot diff with get-report again as instructed in Step 1 but since the job disappears from the DB it will output "No snapshot diff job found. Job may not have been submitted or was removed during cleanup"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants