Fix race condition between job cancelling and activating #6521

pihme · 2021-03-10T13:52:16Z

Description

This is a follow up to #2816.

The fix for #2816 was on the symptom level. It added some extra logic in DbJobState, including a clean up state change in case that the job could not be found:

    activatableColumnFamily.whileEqualPrefix(
        jobTypeKey,
        ((compositeKey, zbNil) -> {
          final long jobKey = compositeKey.getSecond().getValue();
          return visitJob(jobKey, callback, () -> activatableColumnFamily.delete(compositeKey));
        }));
  }

  boolean visitJob(
      long jobKey, BiFunction<Long, JobRecord, Boolean> callback, Runnable cleanupRunnable) {
    final JobRecord job = getJob(jobKey);
    if (job == null) {
      LOG.error("Expected to find job with key {}, but no job found", jobKey);
      cleanupRunnable.run();
      return true; // we want to continue with the iteration
    }
    return callback.apply(jobKey, job);
  }

As part of #6176 the clean up was disabled to make the migration easier.

    activatableColumnFamily.whileEqualPrefix(
        jobTypeKey,
        ((compositeKey, zbNil) -> {
          final long jobKey = compositeKey.getSecond().getValue();
          // TODO #6521 reconsider race condition and whether or not the cleanup task is needed
          return visitJob(jobKey, callback, () -> {});
        }));

At the end of the migration, we need to check the following:

whether or not the race condition still exists and if so how it can be prevented at the root level
Whether there is still a need for the double-safety clean up, and if so how it can be achieved (preferably outside of the callstack of a processor)

The text was updated successfully, but these errors were encountered:

npepinpe · 2021-07-14T13:34:54Z

@saig0 / @korthout - is this still an issue with the new engine?

korthout · 2021-07-14T16:57:36Z

While writing this analysis, I changed my answer a couple of times. This shows that it was pretty difficult to determine the situation. Some parts of how job timeouts and activations work could definitely be improved.

We no longer need the clean up for 2 reasons:

The race condition no longer exists, because the deadline visitor is able to skip the deadlines for which the job no longer exists, while simultaneously cleaning up this deadline.
I think 2 deadlines for 1 job cannot exist simultaneously, because it can only be created by activation, which can only happen after becoming activatable again at which point I think no deadline can exist (@saig0 please tell me if I'm wrong here, because there might be interactions I missed, but I assume the job's finite state machine protects against this).

The indirection I experienced while looking into this is because of the cleaning up. It feels like a workaround for the 'normal' operation making it possible to leave orphaned deadlines behind in the first place.

Consider again the race condition: a JobBatch:Activate followed by a Job:Cancel command that both interact with the same job. As part of the processing of the Job:Cancel command, the job is deleted which includes deleting the deadline using the exact deadline (not the job key). However, the job's deadline was already changed by the JobBatch:Activate, so it will only delete that 'latest' deadline. Any existing deadlines are not deleted. Which begs the question whether another deadline can exist for a job when the JobBatch:Activate is processed and activates that job. I think this only happens when the job is in the activatable state, at which point I think no deadlines can exist for it, but like I said before I might be wrong here.

We should probably still do what was proposed earlier:

We should try to implement something which deletes all entries for a job key from the deadlines column family no matter the deadline in the job record.

I also think we should write a test to make sure we don't have this race condition and then remove this clean-up logic to reduce the indirection.

domq · 2022-03-21T16:55:39Z

Hello, kindly consider #8949 . We have hundreds of DEADLINE_JOBS entries per job (not just two), and currently no way to fix that, save for a shutdown / doctor the snapshot / restart cycle.

Zelldon · 2023-06-08T07:16:46Z

I'm not 100% sure but it looks like we have seen this again on SaaS.

We see several:

Expected to find job with key 4503599627479187, but no job found

Error group https://console.cloud.google.com/errors/detail/CKe9pf-Tqe-ZZw;service=zeebe;time=P7D?project=camunda-saas-int

I think it can be reproduced with the recent game day (I think it is related to job can't be activated or not send to the client, because of to large payload. It seems to cause these issues.

Zelldon · 2023-06-08T07:18:00Z

Might be also related to #12778

Zelldon · 2023-06-08T11:20:55Z

Another one https://console.cloud.google.com/errors/detail/CKe9pf-Tqe-ZZw;service=zeebe;time=P7D?project=camunda-saas-int-chaos

Might be that this is also me from a different cluster.

korthout · 2023-06-28T08:45:15Z

ZPA triage:

may be resolved by bug fixes @oleschoenburg and @korthout investigated
@korthout will investigate this in relation
pulling it into the current iteration

korthout · 2023-07-14T14:38:01Z

Closed as resolved by:

fix(engine): don't mutate state when checking for job backoffs #13402

pihme added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Mar 10, 2021

pihme added this to the Updatable Workflow Engine milestone Mar 10, 2021

github-actions bot added the Status: Needs Triage label Mar 10, 2021

pihme added Status: Planned and removed Status: Needs Triage labels Mar 10, 2021

pihme mentioned this issue Mar 10, 2021

chore(engine): event sourcing for JobBatchIntent processors #6485

Merged

9 tasks

saig0 changed the title ~~Investigate and fix race condition between cancel and activate~~ Investigate and fix race condition between job cancelling and activating Mar 17, 2021

saig0 changed the title ~~Investigate and fix race condition between job cancelling and activating~~ Fix race condition between job cancelling and activating Mar 17, 2021

npepinpe added this to Planned in Zeebe Mar 24, 2021

npepinpe removed this from the Updatable Workflow Engine milestone Apr 8, 2021

npepinpe removed the Status: Planned label May 6, 2021

korthout self-assigned this Jul 14, 2021

korthout removed their assignment Jul 14, 2021

KerstinHebel removed this from Planned in Zeebe Mar 23, 2022

npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed Impact: Data labels Apr 11, 2022

pihme added the team/process-automation label May 30, 2022

menski removed the team/process-automation label Jul 11, 2022

Zelldon added the component/engine label Dec 27, 2022

lenaschoenburg mentioned this issue Jun 1, 2023

Don't mutate state through JobTimeoutTrigger #12797

Closed

korthout self-assigned this Jun 28, 2023

korthout mentioned this issue Jul 13, 2023

fix(engine): don't mutate state when checking for job backoffs #13402

Merged

korthout closed this as completed Jul 14, 2023

github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024

refactor: rename Dockerfile to operate.Dockerfile (#6521)

cea0ad1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition between job cancelling and activating #6521

Fix race condition between job cancelling and activating #6521

pihme commented Mar 10, 2021 •

edited

npepinpe commented Jul 14, 2021 •

edited

korthout commented Jul 14, 2021

domq commented Mar 21, 2022

Zelldon commented Jun 8, 2023 •

edited

Zelldon commented Jun 8, 2023

Zelldon commented Jun 8, 2023 •

edited

korthout commented Jun 28, 2023

korthout commented Jul 14, 2023

Fix race condition between job cancelling and activating #6521

Fix race condition between job cancelling and activating #6521

Comments

pihme commented Mar 10, 2021 • edited

npepinpe commented Jul 14, 2021 • edited

korthout commented Jul 14, 2021

domq commented Mar 21, 2022

Zelldon commented Jun 8, 2023 • edited

Zelldon commented Jun 8, 2023

Zelldon commented Jun 8, 2023 • edited

korthout commented Jun 28, 2023

korthout commented Jul 14, 2023

pihme commented Mar 10, 2021 •

edited

npepinpe commented Jul 14, 2021 •

edited

Zelldon commented Jun 8, 2023 •

edited

Zelldon commented Jun 8, 2023 •

edited