Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition between job cancelling and activating #6521

Closed
pihme opened this issue Mar 10, 2021 · 8 comments
Closed

Fix race condition between job cancelling and activating #6521

pihme opened this issue Mar 10, 2021 · 8 comments
Assignees
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/engine kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@pihme
Copy link
Contributor

pihme commented Mar 10, 2021

Description

This is a follow up to #2816.

The fix for #2816 was on the symptom level. It added some extra logic in DbJobState, including a clean up state change in case that the job could not be found:

    activatableColumnFamily.whileEqualPrefix(
        jobTypeKey,
        ((compositeKey, zbNil) -> {
          final long jobKey = compositeKey.getSecond().getValue();
          return visitJob(jobKey, callback, () -> activatableColumnFamily.delete(compositeKey));
        }));
  }

  boolean visitJob(
      long jobKey, BiFunction<Long, JobRecord, Boolean> callback, Runnable cleanupRunnable) {
    final JobRecord job = getJob(jobKey);
    if (job == null) {
      LOG.error("Expected to find job with key {}, but no job found", jobKey);
      cleanupRunnable.run();
      return true; // we want to continue with the iteration
    }
    return callback.apply(jobKey, job);
  }

As part of #6176 the clean up was disabled to make the migration easier.

    activatableColumnFamily.whileEqualPrefix(
        jobTypeKey,
        ((compositeKey, zbNil) -> {
          final long jobKey = compositeKey.getSecond().getValue();
          // TODO #6521 reconsider race condition and whether or not the cleanup task is needed
          return visitJob(jobKey, callback, () -> {});
        }));

At the end of the migration, we need to check the following:

  • whether or not the race condition still exists and if so how it can be prevented at the root level
  • Whether there is still a need for the double-safety clean up, and if so how it can be achieved (preferably outside of the callstack of a processor)
@pihme pihme added the kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. label Mar 10, 2021
@pihme pihme added this to the Updatable Workflow Engine milestone Mar 10, 2021
@saig0 saig0 changed the title Investigate and fix race condition between cancel and activate Investigate and fix race condition between job cancelling and activating Mar 17, 2021
@saig0 saig0 changed the title Investigate and fix race condition between job cancelling and activating Fix race condition between job cancelling and activating Mar 17, 2021
@npepinpe npepinpe added this to Planned in Zeebe Mar 24, 2021
@npepinpe npepinpe removed this from the Updatable Workflow Engine milestone Apr 8, 2021
@npepinpe npepinpe added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Data severity/mid Marks a bug as having a noticeable impact but with a known workaround and removed kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. labels Apr 8, 2021
@npepinpe
Copy link
Member

npepinpe commented Jul 14, 2021

@saig0 / @korthout - is this still an issue with the new engine?

@korthout korthout self-assigned this Jul 14, 2021
@korthout
Copy link
Member

While writing this analysis, I changed my answer a couple of times. This shows that it was pretty difficult to determine the situation. Some parts of how job timeouts and activations work could definitely be improved.

We no longer need the clean up for 2 reasons:

  • The race condition no longer exists, because the deadline visitor is able to skip the deadlines for which the job no longer exists, while simultaneously cleaning up this deadline.
  • I think 2 deadlines for 1 job cannot exist simultaneously, because it can only be created by activation, which can only happen after becoming activatable again at which point I think no deadline can exist (@saig0 please tell me if I'm wrong here, because there might be interactions I missed, but I assume the job's finite state machine protects against this).

The indirection I experienced while looking into this is because of the cleaning up. It feels like a workaround for the 'normal' operation making it possible to leave orphaned deadlines behind in the first place.

Consider again the race condition: a JobBatch:Activate followed by a Job:Cancel command that both interact with the same job. As part of the processing of the Job:Cancel command, the job is deleted which includes deleting the deadline using the exact deadline (not the job key). However, the job's deadline was already changed by the JobBatch:Activate, so it will only delete that 'latest' deadline. Any existing deadlines are not deleted. Which begs the question whether another deadline can exist for a job when the JobBatch:Activate is processed and activates that job. I think this only happens when the job is in the activatable state, at which point I think no deadlines can exist for it, but like I said before I might be wrong here.

We should probably still do what was proposed earlier:

  • We should try to implement something which deletes all entries for a job key from the deadlines column family no matter the deadline in the job record.

I also think we should write a test to make sure we don't have this race condition and then remove this clean-up logic to reduce the indirection.

@korthout korthout removed their assignment Jul 14, 2021
@domq
Copy link

domq commented Mar 21, 2022

Hello, kindly consider #8949 . We have hundreds of DEADLINE_JOBS entries per job (not just two), and currently no way to fix that, save for a shutdown / doctor the snapshot / restart cycle.

@KerstinHebel KerstinHebel removed this from Planned in Zeebe Mar 23, 2022
@npepinpe npepinpe added area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed Impact: Data labels Apr 11, 2022
@Zelldon
Copy link
Member

Zelldon commented Jun 8, 2023

I'm not 100% sure but it looks like we have seen this again on SaaS.

We see several:

Expected to find job with key 4503599627479187, but no job found

Error group https://console.cloud.google.com/errors/detail/CKe9pf-Tqe-ZZw;service=zeebe;time=P7D?project=camunda-saas-int

I think it can be reproduced with the recent game day (I think it is related to job can't be activated or not send to the client, because of to large payload. It seems to cause these issues.

@Zelldon
Copy link
Member

Zelldon commented Jun 8, 2023

Might be also related to #12778

@Zelldon
Copy link
Member

Zelldon commented Jun 8, 2023

Another one https://console.cloud.google.com/errors/detail/CKe9pf-Tqe-ZZw;service=zeebe;time=P7D?project=camunda-saas-int-chaos

Might be that this is also me from a different cluster.

@korthout
Copy link
Member

ZPA triage:

  • may be resolved by bug fixes @oleschoenburg and @korthout investigated
  • @korthout will investigate this in relation
  • pulling it into the current iteration

@korthout
Copy link
Member

Closed as resolved by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/engine kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

6 participants