Conversation
README.md
Outdated
| | - `DurableOperationException` | General operation exception | | | ||
| | -- `StepException` | General Step exception | | | ||
| | --- `StepFailedException` | Step exhausted all retry attempts | Catch to implement fallback logic or let execution fail | | ||
| | --- `StepInterruptedException` | `AT_MOST_ONCE` step was interrupted before completion | Implement manual recovery (check if operation completed externally) | |
There was a problem hiding this comment.
"check if operation completed externally"
What is being suggested here? This reads to me like the exception may need to be validated against data from the APIs, is that the case?
There was a problem hiding this comment.
For a step with AT_MOST_ONCE semantic, the step will not be retried because the step may have side effect (e.g. a remote call that is not idempotent) and retries are considered unsafe. Users have to check the side effect in their system if it's already done or safe to retries or whatever to recover from the situation.
In this PR I updated only the hierarchy of these exceptions, no semantic changes. @maschnetwork @phipag might have more details to add here.
There was a problem hiding this comment.
I'd recommend then that we clarify that "check if operation completed externally" means validate that the non-idempotent mutation in the customer system occurred as expected (with more user friendly wording than mine).
There was a problem hiding this comment.
Yes, this explanation is correct @zhongkechen. This exception means that we failed to checkpoint a success after starting the step. For AT_MOST_ONCE semantics the idempotency guarantee would be violated if we re-executed the step logic. Hence, we throw and this manual recovery is needed to understand what we need to do in that case (see payment example below that table).
There was a problem hiding this comment.
Updated to be consistent. But how to handle these exceptions seem unclear. We'll keep improving the document
README.md
Outdated
| | --- `InvokeTimedoutException` | Chained invocation timed out | Retry the chained invocation or propagate failure | | ||
| | --- `InvokeStoppedException` | Chained invocation stopped | Handle the error or propagate failure | | ||
| | -- `CallbackException` | General callback exception | | | ||
| | --- `CallbackTimeoutException` | Callback exceeded its timeout duration | Implement fallback logic or escalation | |
There was a problem hiding this comment.
What is the difference between: "Catch to implement fallback logic or let execution fail", "Handle the error or propagate failure" and "Implement fallback logic or escalation" from the user's perspective?
There was a problem hiding this comment.
A callback cannot be retried when it's timed out. Users have to create a new callback if they want to do it again, I think that's why we call it fallback instead of retry.
InvokeStoppedException is thrown when a user stopped (StopDurableExecution API) their child Lambda function execution if the invoked function is durable.
There was a problem hiding this comment.
What I was getting at is the 3 wordings I listed all sound like examples of general non-retryable exceptions. If there isn't a more specific guidance on handling the exceptions, we should unify the wording.
phipag
left a comment
There was a problem hiding this comment.
Thanks for sending this PR @zhongkechen. Just some minor comments.
One idea: Do you think it might be worth considering to differentiate between retryable and non-retryable SDK errors? I wonder if this might be useful for a feature in the retry strategy where the user can define their own exceptions (by inheriting from something like UnrecoverableException) and those will be excluded by the default retry strategy presets.
sdk/src/main/java/com/amazonaws/lambda/durable/util/ExceptionHelper.java
Show resolved
Hide resolved
sdk-testing/src/main/java/com/amazonaws/lambda/durable/testing/TestResult.java
Show resolved
Hide resolved
maschnetwork
left a comment
There was a problem hiding this comment.
Great improvements @zhongkechen . Some comments around consistency.
| import software.amazon.awssdk.services.lambda.model.Operation; | ||
|
|
||
| public class CallbackException extends DurableOperationException { | ||
| private final String callbackId; |
There was a problem hiding this comment.
nit: Let's add an example showing how users can catch CallbackException and use getCallbackId() for error handling (can be added later)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue Link, if available
#27
Description
DurableOperationExceptionclass as the base class of all operation exception.OperationandErrorObjectfields to theDurableOperationExceptionto allow users to access the operation details and the raw errorsExceptionHelperunwrapCompletableFutureto unwrap exceptions wrapped byCompletionExceptionbuildErrorObjectto build an ErrorObject from a customer exceptioncallbackIdto the base class so all callback exceptions have the access to the callback idTestResult.getFailedOperationsnow returns all operations that could cause the executions to fail (not limited to FAILED operation status).StepInterruptedExceptionis converted toErrorObjectinStepInterruptedExceptionDemo/Screenshots
Checklist
Testing
Unit Tests
Have unit tests been written for these changes? Yes
Integration Tests
Have integration tests been written for these changes? Updated
Examples
Has a new example been added for the change? (if applicable) No change