I can retry a job if an error was thrown but not caught #6000

saig0 · 2020-12-11T05:15:07Z

Is your feature request related to a problem? Please describe.
A job worker can throw an error to indicate that a business error occurred during the processing of the job. If the error is not caught in the workflow (e.g. by an error boundary event) then an incident is raising.

Currently, this incident can not be resolved. If the incident is marked as resolved then a new incident similar to the previous one is reached.

Internally, it tries to throw the error event again. Since the workflow is not changed, it will create just a new incident.

Describe the solution you'd like
Fix the business error by adding/modifying a variable, updating/fixing the job worker, or modifying an external system. Then, mark the incident as resolved. The job worker can process the job again and produce a different result.

The behavior is similar to the case when the job is marked as failed without remaining retries.

Since the retries of the job are not decreased, it might not be necessary to set/increase the retries of the job first.

Internally, the job is marked as failed. When the incident is resolved then the job can be activated again. In the meantime, the service task stays active.

Describe alternatives you've considered
Stay with the current behavior. Eventually, it will be possible to modify workflows and migrate workflow instances. Using the migration, it will be possible to fix a couple of these problems.

However, some of these cases will not be covered. For example, if the job worker throws an error accidentally or using the wrong error code (e.g. a typo in the error code) then you don't want to modify your workflow to catch the wrong error. Instead, you want to fix the job worker and continue the workflow instance.

Additional context
Documentation about error events: ~~https://docs.zeebe.io/bpmn-workflows/error-events/error-events.html#unhandled-errors~~ https://docs.camunda.io/docs/reference/bpmn-processes/error-events/error-events#unhandled-errors

The issue was raised on the support channel.

korthout · 2021-05-28T14:22:58Z

I've tried to write down the problem and solution in my own words.

Additional context

Errors can be thrown by:

job worker as client command ThrowError (is always associated to a job, so can be seen as thrown by the service task)
error end event (no other bpmn event types)

Errors can be caught by:

error start event (start event of event sub process, no other start events)
error boundary event (no intermediate catch event)

If thrown error is uncaught, incident is raised on service task or error end event.

Problem

Incident cannot be resolved, because (as part of resolving the incident) the error is rethrown and again uncaught.

Main solution idea

Resolve the incident, reactivate the associated job

Scoping

Error end event case

Since process version migration is not (yet) supported, it is not possible to solve the incident with the above idea for error events. Therefore, the error end event case is out of scope.

Error code expressions

An alternative idea is to allow defining error codes using FEEL expressions, to allow a more dynamic error code definition. The idea is that this would allow the error to be rethrown and caught by the updated error code of an already existing error catch event. However, since such an expression would be evaluated at activation time of the flow scope of that catch event (i.e. when the error subscription is being created), this value is unchanged at the incident:resolve time and would not allow to solve this problem.

Status Quo

When an error is thrown for a job, but is uncaught:

job.elementId is set to "no catch event found" (for later reference)
ErrorThrown event is written, resulting in:
- job.state is set to ErrorThrown
- job is made unactivatable
- if (job.elementId != "no catch event found")
  - task.jobKey is set to -1L
  - job is deleted from state
incident is raised with type UNHANDLED_ERROR_EVENT and with associated job key

When incident resolve:

Incident:Resolved is written (always), resulting in:
- if (job.key > 0 && job.state == FAILED) (should also accept job.state == ErrorThrown)
  -> make job activatable again (this is a state change, no event is written for it, could be an idea to change this)
- delete the incident
rejects the Incident:Resolve command (should no longer do this, incident was clearly resolved)
This is actually a bug currently, it should not have happened, instead it should've created a new incident

Plan

create issue for reject incident:resolve command bug (should be fixed on latest stable versions) => Resolving an uncaught error incident does not lead to new incident #7160
inform operate about this change (job retries are not modified in this case) => operate does not change job retries for this incident type
make necessary changes (as described above)
document new behavior

Open questions

Can we deal with existing incidents? The job is already deleted for these cases, so are we able to create a new job for it on incident:resolve? Does this require us to keep track of version in state?
Answer: At first I thought that the job would already have been deleted for existing cases, but on further inspection of the code I seem to have made a reading mistake (I've updated the status quo above accordingly). So the job exists in state for existing incidents. We don't have to create a new job for existing incidents. Just as described, we can simply resolve the incident and write Incident:Resolved, which on writing can make the job activatable again if we change the conditional expression.

korthout · 2021-06-01T14:11:18Z

@saig0 I've updated the text above, with our discussion, but also with my findings of looking into this in more detail

I first looked at the deleted jobs case, and quickly discovered that jobs are not deleted for the uncaught error case. This actually makes the required change even simpler: we only have to the change the IncidentResolvedApplier to also reactive the job for the ErrorThrown case.

As far as I can tell, it would be best to make that change specifically for new versions (checking the version of the Incident:Resolved record). We don't want to do this for previous versions, since that would change the logical state. Do you agree?

@npepinpe as far as I can tell, this would mean that we do not yet have to look at the version that made a specific state change.

npepinpe removed the blocker/stakeholder Marks an issue as blocked, waiting for stakeholder input label May 19, 2021

npepinpe added this to Planned in Zeebe via automation May 19, 2021

korthout self-assigned this May 28, 2021

korthout added this to the Users can retry unhandled error incidents milestone Jun 1, 2021

korthout mentioned this issue Jun 1, 2021

Resolving an uncaught error incident does not lead to new incident #7160

Closed

korthout mentioned this issue Jun 15, 2021

Unhandled error incident on job can be resolved #7292

Merged

9 tasks

npepinpe moved this from Planned to Ready in Zeebe Jun 15, 2021

npepinpe moved this from Ready to Review in progress in Zeebe Jun 15, 2021

korthout mentioned this issue Jun 17, 2021

Document uncaught error incident resolution #7307

Closed

ghost closed this as completed in d09f60c Jun 18, 2021

Zeebe automation moved this from Review in progress to Done Jun 18, 2021

npepinpe added the Release: 1.1.0 label Jul 7, 2021

korthout mentioned this issue Jan 14, 2022

Job of cancelled instance can be activated if an error was thrown on it #8588

Closed

KerstinHebel removed this from Done in Zeebe Mar 23, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I can retry a job if an error was thrown but not caught #6000

I can retry a job if an error was thrown but not caught #6000

saig0 commented Dec 11, 2020 •

edited by korthout

korthout commented May 28, 2021 •

edited

korthout commented Jun 1, 2021

I can retry a job if an error was thrown but not caught #6000

I can retry a job if an error was thrown but not caught #6000

Comments

saig0 commented Dec 11, 2020 • edited by korthout

korthout commented May 28, 2021 • edited

Additional context

Problem

Main solution idea

Scoping

Error end event case

Error code expressions

Status Quo

Plan

Open questions

korthout commented Jun 1, 2021

saig0 commented Dec 11, 2020 •

edited by korthout

korthout commented May 28, 2021 •

edited