Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I can retry a job if an error was thrown but not caught #6000

Closed
saig0 opened this issue Dec 11, 2020 · 2 comments · Fixed by #7292
Closed

I can retry a job if an error was thrown but not caught #6000

saig0 opened this issue Dec 11, 2020 · 2 comments · Fixed by #7292
Assignees
Labels
kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request

Comments

@saig0
Copy link
Member

saig0 commented Dec 11, 2020

Is your feature request related to a problem? Please describe.
A job worker can throw an error to indicate that a business error occurred during the processing of the job. If the error is not caught in the workflow (e.g. by an error boundary event) then an incident is raising.

Currently, this incident can not be resolved. If the incident is marked as resolved then a new incident similar to the previous one is reached.

Internally, it tries to throw the error event again. Since the workflow is not changed, it will create just a new incident.

Describe the solution you'd like
Fix the business error by adding/modifying a variable, updating/fixing the job worker, or modifying an external system. Then, mark the incident as resolved. The job worker can process the job again and produce a different result.

The behavior is similar to the case when the job is marked as failed without remaining retries.

Since the retries of the job are not decreased, it might not be necessary to set/increase the retries of the job first.

Internally, the job is marked as failed. When the incident is resolved then the job can be activated again. In the meantime, the service task stays active.

Describe alternatives you've considered
Stay with the current behavior. Eventually, it will be possible to modify workflows and migrate workflow instances. Using the migration, it will be possible to fix a couple of these problems.

However, some of these cases will not be covered. For example, if the job worker throws an error accidentally or using the wrong error code (e.g. a typo in the error code) then you don't want to modify your workflow to catch the wrong error. Instead, you want to fix the job worker and continue the workflow instance.

Additional context
Documentation about error events: https://docs.zeebe.io/bpmn-workflows/error-events/error-events.html#unhandled-errors https://docs.camunda.io/docs/reference/bpmn-processes/error-events/error-events#unhandled-errors

The issue was raised on the support channel.

@saig0 saig0 added kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog blocker/stakeholder Marks an issue as blocked, waiting for stakeholder input Impact: Usability support Marks an issue as related to a customer support request labels Dec 11, 2020
@npepinpe npepinpe removed the blocker/stakeholder Marks an issue as blocked, waiting for stakeholder input label May 19, 2021
@npepinpe npepinpe added this to Planned in Zeebe via automation May 19, 2021
@korthout korthout self-assigned this May 28, 2021
@korthout
Copy link
Member

korthout commented May 28, 2021

I've tried to write down the problem and solution in my own words.

Additional context

Errors can be thrown by:

  • job worker as client command ThrowError (is always associated to a job, so can be seen as thrown by the service task)
  • error end event (no other bpmn event types)

Errors can be caught by:

  • error start event (start event of event sub process, no other start events)
  • error boundary event (no intermediate catch event)

If thrown error is uncaught, incident is raised on service task or error end event.

Problem

Incident cannot be resolved, because (as part of resolving the incident) the error is rethrown and again uncaught.

Main solution idea

Resolve the incident, reactivate the associated job

Scoping

Error end event case

Since process version migration is not (yet) supported, it is not possible to solve the incident with the above idea for error events. Therefore, the error end event case is out of scope.

Error code expressions

An alternative idea is to allow defining error codes using FEEL expressions, to allow a more dynamic error code definition. The idea is that this would allow the error to be rethrown and caught by the updated error code of an already existing error catch event. However, since such an expression would be evaluated at activation time of the flow scope of that catch event (i.e. when the error subscription is being created), this value is unchanged at the incident:resolve time and would not allow to solve this problem.

Status Quo

When an error is thrown for a job, but is uncaught:

  • job.elementId is set to "no catch event found" (for later reference)
  • ErrorThrown event is written, resulting in:
    • job.state is set to ErrorThrown
    • job is made unactivatable
    • if (job.elementId != "no catch event found")
      • task.jobKey is set to -1L
      • job is deleted from state
  • incident is raised with type UNHANDLED_ERROR_EVENT and with associated job key

When incident resolve:

  • Incident:Resolved is written (always), resulting in:
    • if (job.key > 0 && job.state == FAILED) (should also accept job.state == ErrorThrown)
      -> make job activatable again (this is a state change, no event is written for it, could be an idea to change this)
    • delete the incident
  • rejects the Incident:Resolve command (should no longer do this, incident was clearly resolved)
    This is actually a bug currently, it should not have happened, instead it should've created a new incident

Plan

  • create issue for reject incident:resolve command bug (should be fixed on latest stable versions) => Resolving an uncaught error incident does not lead to new incident #7160
  • inform operate about this change (job retries are not modified in this case) => operate does not change job retries for this incident type
  • make necessary changes (as described above)
  • document new behavior

Open questions

  • Can we deal with existing incidents? The job is already deleted for these cases, so are we able to create a new job for it on incident:resolve? Does this require us to keep track of version in state?
    Answer: At first I thought that the job would already have been deleted for existing cases, but on further inspection of the code I seem to have made a reading mistake (I've updated the status quo above accordingly). So the job exists in state for existing incidents. We don't have to create a new job for existing incidents. Just as described, we can simply resolve the incident and write Incident:Resolved, which on writing can make the job activatable again if we change the conditional expression.

@korthout
Copy link
Member

korthout commented Jun 1, 2021

@saig0 I've updated the text above, with our discussion, but also with my findings of looking into this in more detail

I first looked at the deleted jobs case, and quickly discovered that jobs are not deleted for the uncaught error case. This actually makes the required change even simpler: we only have to the change the IncidentResolvedApplier to also reactive the job for the ErrorThrown case.

As far as I can tell, it would be best to make that change specifically for new versions (checking the version of the Incident:Resolved record). We don't want to do this for previous versions, since that would change the logical state. Do you agree?

@npepinpe as far as I can tell, this would mean that we do not yet have to look at the version that made a specific state change.

@npepinpe npepinpe moved this from Planned to Ready in Zeebe Jun 15, 2021
@npepinpe npepinpe moved this from Ready to Review in progress in Zeebe Jun 15, 2021
@ghost ghost closed this as completed in d09f60c Jun 18, 2021
Zeebe automation moved this from Review in progress to Done Jun 18, 2021
@KerstinHebel KerstinHebel removed this from Done in Zeebe Mar 23, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes an issue or PR as a feature, i.e. new behavior scope/broker Marks an issue or PR to appear in the broker section of the changelog support Marks an issue as related to a customer support request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants