[refactor] Refactor exceptions by zhongkechen · Pull Request #45 · aws/aws-durable-execution-sdk-java

zhongkechen · 2026-02-03T00:50:39Z

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Issue Link, if available

#27

Description

added DurableOperationException class as the base class of all operation exception.
- added Operation and ErrorObject fields to the DurableOperationException to allow users to access the operation details and the raw errors
moved all exception utilities to ExceptionHelper
- added a utility unwrapCompletableFuture to unwrap exceptions wrapped by CompletionException
- added a utility buildErrorObject to build an ErrorObject from a customer exception
added a base class for all Callback exceptions
- added callbackId to the base class so all callback exceptions have the access to the callback id
TestResult.getFailedOperations now returns all operations that could cause the executions to fail (not limited to FAILED operation status).
updated README.md to reflect the new exception hierarchy
hid how StepInterruptedException is converted to ErrorObject in StepInterruptedException

Demo/Screenshots

Checklist

I have filled out every section of the PR template
I have thoroughly tested this change

Testing

Unit Tests

Have unit tests been written for these changes? Yes

Integration Tests

Have integration tests been written for these changes? Updated

Examples

Has a new example been added for the change? (if applicable) No change

dhegberg · 2026-02-03T18:14:09Z

README.md

+| - `DurableOperationException`          | General operation exception                            |                                                                     |
+| -- `StepException`                     | General Step exception                                 |                                                                     |
+| --- `StepFailedException`              | Step exhausted all retry attempts                      | Catch to implement fallback logic or let execution fail             |
+| --- `StepInterruptedException`         | `AT_MOST_ONCE` step was interrupted before completion  | Implement manual recovery (check if operation completed externally) |


"check if operation completed externally"

What is being suggested here? This reads to me like the exception may need to be validated against data from the APIs, is that the case?

For a step with AT_MOST_ONCE semantic, the step will not be retried because the step may have side effect (e.g. a remote call that is not idempotent) and retries are considered unsafe. Users have to check the side effect in their system if it's already done or safe to retries or whatever to recover from the situation.

In this PR I updated only the hierarchy of these exceptions, no semantic changes. @maschnetwork @phipag might have more details to add here.

I'd recommend then that we clarify that "check if operation completed externally" means validate that the non-idempotent mutation in the customer system occurred as expected (with more user friendly wording than mine).

Yes, this explanation is correct @zhongkechen. This exception means that we failed to checkpoint a success after starting the step. For AT_MOST_ONCE semantics the idempotency guarantee would be violated if we re-executed the step logic. Hence, we throw and this manual recovery is needed to understand what we need to do in that case (see payment example below that table).

Updated to be consistent. But how to handle these exceptions seem unclear. We'll keep improving the document

dhegberg · 2026-02-03T18:17:06Z

README.md

+| --- `InvokeTimedoutException`          | Chained invocation timed out                           | Retry the chained invocation or propagate failure                   |
+| --- `InvokeStoppedException`           | Chained invocation stopped                             | Handle the error or propagate failure                               |
+| -- `CallbackException`                 | General callback exception                             |                                                                     |
+| --- `CallbackTimeoutException`         | Callback exceeded its timeout duration                 | Implement fallback logic or escalation                              |


What is the difference between: "Catch to implement fallback logic or let execution fail", "Handle the error or propagate failure" and "Implement fallback logic or escalation" from the user's perspective?

A callback cannot be retried when it's timed out. Users have to create a new callback if they want to do it again, I think that's why we call it fallback instead of retry.

InvokeStoppedException is thrown when a user stopped (StopDurableExecution API) their child Lambda function execution if the invoked function is durable.

What I was getting at is the 3 wordings I listed all sound like examples of general non-retryable exceptions. If there isn't a more specific guidance on handling the exceptions, we should unify the wording.

phipag

Thanks for sending this PR @zhongkechen. Just some minor comments.

One idea: Do you think it might be worth considering to differentiate between retryable and non-retryable SDK errors? I wonder if this might be useful for a feature in the retry strategy where the user can define their own exceptions (by inheriting from something like UnrecoverableException) and those will be excluded by the default retry strategy presets.

README.md

sdk/src/main/java/com/amazonaws/lambda/durable/util/ExceptionHelper.java

sdk-testing/src/main/java/com/amazonaws/lambda/durable/testing/TestResult.java

maschnetwork

Great improvements @zhongkechen . Some comments around consistency.

maschnetwork · 2026-02-04T13:16:01Z

sdk/src/main/java/com/amazonaws/lambda/durable/exception/CallbackException.java

+import software.amazon.awssdk.services.lambda.model.Operation;
+
+public class CallbackException extends DurableOperationException {
+    private final String callbackId;


nit: Let's add an example showing how users can catch CallbackException and use getCallbackId() for error handling (can be added later)

sdk/src/main/java/com/amazonaws/lambda/durable/exception/StepFailedException.java

sdk/src/main/java/com/amazonaws/lambda/durable/operation/InvokeOperation.java

sdk/src/main/java/com/amazonaws/lambda/durable/exception/CallbackTimeoutException.java

sdk/src/main/java/com/amazonaws/lambda/durable/exception/StepInterruptedException.java

sdk/src/main/java/com/amazonaws/lambda/durable/exception/CallbackTimeoutException.java

zhongkechen added 5 commits February 2, 2026 16:32

refactor exceptions

5b1ead9

add more unit tests

e347bdb

Merge remote-tracking branch 'origin/main' into exception

6276749

add doc for exceptions

1ec6d6b

move all exception helper functions to ExceptionHelper

8427ffe

zhongkechen marked this pull request as ready for review February 3, 2026 01:57

zhongkechen requested review from maschnetwork and phipag February 3, 2026 01:57

zhongkechen self-assigned this Feb 3, 2026

phipag linked an issue Feb 3, 2026 that may be closed by this pull request

[Feature]: make exception types consistent across operations #27

Closed

dhegberg reviewed Feb 3, 2026

View reviewed changes

phipag reviewed Feb 4, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

sdk/src/main/java/com/amazonaws/lambda/durable/util/ExceptionHelper.java Show resolved Hide resolved

sdk-testing/src/main/java/com/amazonaws/lambda/durable/testing/TestResult.java Show resolved Hide resolved

maschnetwork reviewed Feb 4, 2026

View reviewed changes

zhongkechen added 3 commits February 4, 2026 12:07

remove errorObject from operation exceptions

d5e0028

update method name for operationId

7888631

Merge remote-tracking branch 'origin/main' into exception

49dfe74

maschnetwork reviewed Feb 5, 2026

View reviewed changes

sdk/src/main/java/com/amazonaws/lambda/durable/exception/CallbackTimeoutException.java Show resolved Hide resolved

phipag approved these changes Feb 5, 2026

View reviewed changes

zhongkechen merged commit 6dedec4 into main Feb 5, 2026
6 of 7 checks passed

zhongkechen deleted the refactor/exception branch February 5, 2026 18:06

Conversation

zhongkechen commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Link, if available

Description

Demo/Screenshots

Checklist

Testing

Unit Tests

Integration Tests

Examples

Uh oh!

dhegberg Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhongkechen Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

dhegberg Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

phipag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

zhongkechen Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

dhegberg Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhongkechen Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhegberg Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

phipag left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maschnetwork left a comment

Choose a reason for hiding this comment

Uh oh!

maschnetwork Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhongkechen commented Feb 3, 2026 •

edited

Loading

zhongkechen Feb 3, 2026 •

edited

Loading

maschnetwork Feb 4, 2026 •

edited

Loading