Skip to content

[refactor] Refactor exceptions#45

Merged
zhongkechen merged 8 commits intomainfrom
refactor/exception
Feb 5, 2026
Merged

[refactor] Refactor exceptions#45
zhongkechen merged 8 commits intomainfrom
refactor/exception

Conversation

@zhongkechen
Copy link
Contributor

@zhongkechen zhongkechen commented Feb 3, 2026

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Issue Link, if available

#27

Description

  • added DurableOperationException class as the base class of all operation exception.
    • added Operation and ErrorObject fields to the DurableOperationException to allow users to access the operation details and the raw errors
  • moved all exception utilities to ExceptionHelper
    • added a utility unwrapCompletableFuture to unwrap exceptions wrapped by CompletionException
    • added a utility buildErrorObject to build an ErrorObject from a customer exception
  • added a base class for all Callback exceptions
    • added callbackId to the base class so all callback exceptions have the access to the callback id
  • TestResult.getFailedOperations now returns all operations that could cause the executions to fail (not limited to FAILED operation status).
  • updated README.md to reflect the new exception hierarchy
  • hid how StepInterruptedException is converted to ErrorObject in StepInterruptedException

Demo/Screenshots

Checklist

  • I have filled out every section of the PR template
  • I have thoroughly tested this change

Testing

Unit Tests

Have unit tests been written for these changes? Yes

Integration Tests

Have integration tests been written for these changes? Updated

Examples

Has a new example been added for the change? (if applicable) No change

@zhongkechen zhongkechen marked this pull request as ready for review February 3, 2026 01:57
@zhongkechen zhongkechen self-assigned this Feb 3, 2026
@phipag phipag linked an issue Feb 3, 2026 that may be closed by this pull request
README.md Outdated
| - `DurableOperationException` | General operation exception | |
| -- `StepException` | General Step exception | |
| --- `StepFailedException` | Step exhausted all retry attempts | Catch to implement fallback logic or let execution fail |
| --- `StepInterruptedException` | `AT_MOST_ONCE` step was interrupted before completion | Implement manual recovery (check if operation completed externally) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"check if operation completed externally"

What is being suggested here? This reads to me like the exception may need to be validated against data from the APIs, is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a step with AT_MOST_ONCE semantic, the step will not be retried because the step may have side effect (e.g. a remote call that is not idempotent) and retries are considered unsafe. Users have to check the side effect in their system if it's already done or safe to retries or whatever to recover from the situation.

In this PR I updated only the hierarchy of these exceptions, no semantic changes. @maschnetwork @phipag might have more details to add here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend then that we clarify that "check if operation completed externally" means validate that the non-idempotent mutation in the customer system occurred as expected (with more user friendly wording than mine).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this explanation is correct @zhongkechen. This exception means that we failed to checkpoint a success after starting the step. For AT_MOST_ONCE semantics the idempotency guarantee would be violated if we re-executed the step logic. Hence, we throw and this manual recovery is needed to understand what we need to do in that case (see payment example below that table).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to be consistent. But how to handle these exceptions seem unclear. We'll keep improving the document

README.md Outdated
| --- `InvokeTimedoutException` | Chained invocation timed out | Retry the chained invocation or propagate failure |
| --- `InvokeStoppedException` | Chained invocation stopped | Handle the error or propagate failure |
| -- `CallbackException` | General callback exception | |
| --- `CallbackTimeoutException` | Callback exceeded its timeout duration | Implement fallback logic or escalation |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between: "Catch to implement fallback logic or let execution fail", "Handle the error or propagate failure" and "Implement fallback logic or escalation" from the user's perspective?

Copy link
Contributor Author

@zhongkechen zhongkechen Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A callback cannot be retried when it's timed out. Users have to create a new callback if they want to do it again, I think that's why we call it fallback instead of retry.

InvokeStoppedException is thrown when a user stopped (StopDurableExecution API) their child Lambda function execution if the invoked function is durable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was getting at is the 3 wordings I listed all sound like examples of general non-retryable exceptions. If there isn't a more specific guidance on handling the exceptions, we should unify the wording.

Copy link
Contributor

@phipag phipag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending this PR @zhongkechen. Just some minor comments.

One idea: Do you think it might be worth considering to differentiate between retryable and non-retryable SDK errors? I wonder if this might be useful for a feature in the retry strategy where the user can define their own exceptions (by inheriting from something like UnrecoverableException) and those will be excluded by the default retry strategy presets.

Copy link
Contributor

@maschnetwork maschnetwork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements @zhongkechen . Some comments around consistency.

import software.amazon.awssdk.services.lambda.model.Operation;

public class CallbackException extends DurableOperationException {
private final String callbackId;
Copy link
Contributor

@maschnetwork maschnetwork Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's add an example showing how users can catch CallbackException and use getCallbackId() for error handling (can be added later)

@zhongkechen zhongkechen merged commit 6dedec4 into main Feb 5, 2026
6 of 7 checks passed
@zhongkechen zhongkechen deleted the refactor/exception branch February 5, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: make exception types consistent across operations

4 participants