-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I can retry a job if an error was thrown but not caught #6000
Comments
I've tried to write down the problem and solution in my own words. Additional contextErrors can be thrown by:
Errors can be caught by:
If thrown error is uncaught, incident is raised on service task or error end event. ProblemIncident cannot be resolved, because (as part of resolving the incident) the error is rethrown and again uncaught. Main solution ideaResolve the incident, reactivate the associated job ScopingError end event caseSince process version migration is not (yet) supported, it is not possible to solve the incident with the above idea for error events. Therefore, the error end event case is out of scope. Error code expressionsAn alternative idea is to allow defining error codes using FEEL expressions, to allow a more dynamic error code definition. The idea is that this would allow the error to be rethrown and caught by the updated error code of an already existing error catch event. However, since such an expression would be evaluated at activation time of the flow scope of that catch event (i.e. when the error subscription is being created), this value is unchanged at the incident:resolve time and would not allow to solve this problem. Status QuoWhen an error is thrown for a job, but is uncaught:
When incident resolve:
Plan
Open questions
|
@saig0 I've updated the text above, with our discussion, but also with my findings of looking into this in more detail I first looked at the deleted jobs case, and quickly discovered that jobs are not deleted for the uncaught error case. This actually makes the required change even simpler: we only have to the change the As far as I can tell, it would be best to make that change specifically for new versions (checking the version of the @npepinpe as far as I can tell, this would mean that we do not yet have to look at the version that made a specific state change. |
Is your feature request related to a problem? Please describe.
A job worker can throw an error to indicate that a business error occurred during the processing of the job. If the error is not caught in the workflow (e.g. by an error boundary event) then an incident is raising.
Currently, this incident can not be resolved. If the incident is marked as resolved then a new incident similar to the previous one is reached.
Internally, it tries to throw the error event again. Since the workflow is not changed, it will create just a new incident.
Describe the solution you'd like
Fix the business error by adding/modifying a variable, updating/fixing the job worker, or modifying an external system. Then, mark the incident as resolved. The job worker can process the job again and produce a different result.
The behavior is similar to the case when the job is marked as failed without remaining retries.
Since the retries of the job are not decreased, it might not be necessary to set/increase the retries of the job first.
Internally, the job is marked as failed. When the incident is resolved then the job can be activated again. In the meantime, the service task stays active.
Describe alternatives you've considered
Stay with the current behavior. Eventually, it will be possible to modify workflows and migrate workflow instances. Using the migration, it will be possible to fix a couple of these problems.
However, some of these cases will not be covered. For example, if the job worker throws an error accidentally or using the wrong error code (e.g. a typo in the error code) then you don't want to modify your workflow to catch the wrong error. Instead, you want to fix the job worker and continue the workflow instance.
Additional context
Documentation about error events:
https://docs.zeebe.io/bpmn-workflows/error-events/error-events.html#unhandled-errorshttps://docs.camunda.io/docs/reference/bpmn-processes/error-events/error-events#unhandled-errorsThe issue was raised on the support channel.
The text was updated successfully, but these errors were encountered: