fix(engine): don't mutate state when checking for timed out jobs #13035

lenaschoenburg · 2023-06-09T08:23:16Z

This PR closes #12797, ensuring that job deadlines are reliably cleaned up by event appliers. This is necessary so that the job timeout checker does not have to mutate state.

With 91da1e8, we fix the job yield and recur processors to produce events with the current job state. Previously, these propagated the job record from the command into new follow-up events which may overwrite already persisted changes to a job.
With 330ce0b, it is now clear when deadlines are set and removed. This follows the rules we discussed, essentially adding a deadline on activation and removing whe the job is no longer activated or deleted.
Finally, cfdef6d is removing the illegal state modification from the checker. With the previous changes we can be sure that deadlines are cleaned up reliably and don't need to do redundant cleanup in the checker anymore.

Both command processors were using the job record from the command instead of looking up the current state. This could result in wrong processing based on outdated data and writing wrong events that further propagate the outdated job record. Co-authored-by: Nico Korthout <nico.korthout@camunda.com>

lenaschoenburg · 2023-06-09T08:41:49Z

engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java

+    // This only works because none of the events actually remove the deadline from the job record.
+    // If, say on job failure, the deadline is removed or reset to 0, then we would need to look at
+    // the current state of the job to determine what deadline to remove.
+    removeJobDeadline(updatedValue.getDeadline());


❓ I'm unsure about this part. Is it okay to assume "none of the events actually remove the deadline from the job record"?

lenaschoenburg · 2023-06-09T10:15:15Z

engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java

+    if (newState != State.ACTIVATED) {
+      // This only works because none of the events actually remove the deadline from the job record.
+      // If, say on job failure, the deadline is removed or reset to 0, then we would need to look at
+      // the current state of the job to determine what deadline to remove.
+      removeJobDeadline(key, updatedValue.getDeadline());


❓ I'm not sure about this assumption: "none of the events actually remove the deadline from the job record"

👍 This assumption is correct with the changes in this pull request.

Specifically, by reading the job record from the state and using that in the events: TIMED_OUT, RECURRED_AFTER_BACKOFF, and YIELDED. The other events were already appended with the job record read from the state.

lenaschoenburg · 2023-06-09T10:16:09Z

engine/src/main/java/io/camunda/zeebe/engine/processing/job/JobTimeOutProcessor.java

-      }
+      final var reason =
+          switch (state) {
+            case ACTIVATED -> "it has not timed out";


🙃 A bit inelegant, would be much nicer with switch guards, coming in Java 21.

Job timeouts now follow fairly simple rules: 1. When a job is ACTIVATED, the event contains a job record with a new deadline that is added to the deadline column family. It's not necessary to clean up other deadlines for this job because there are none. 2. When a job transitions away from ACTIVATED to a new state or is deleted, the deadline found in the event is removed from the state. 3. A checker process periodically scans deadlines and writes new TIME_OUT commands with the job record from the state. 4. When processing a TIME_OUT command, verify that the job is still timed out, otherwise reject the command. Write a TIMED_OUT event with the job record from the state. 5. On applying the TIMED_OUT event the job is made ACTIVATABLE again. Rule 2 applies and the deadline is removed. Check if the job is really has timed out before writing a `TIME_OUT` command. This is necessary because of the following scenario:

Scheduled tasks must not mutate state. With previous fixes, this cleanup logic is no longer necessary because throughout the job lifecycle, deadlines are reliably cleaned up and can't accidentally accumulate here.

korthout

Great job @oleschoenburg 👏

👍 I love the usage of switch expression for the rejection reason! It reads like a breeze and highlights the previously missing cases that were wrongfully caught by the default case. And, as you say, it can be improved even further when we're on Java 21.

👍 I love the change to a BiPredicate as a better-defined type

👍 I love the addition of addJobDeadline to improve the DbJobSate code

👍 LGTM

bors merge

korthout · 2023-06-12T14:02:50Z

engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java

-    if (deadline > 0) {
-      removeJobDeadline(deadline);
+    if (newState != State.ACTIVATED) {
+      // This only works because none of the events actually remove the deadline from the job
+      // record.
+      // If, say on job failure, the deadline is removed or reset to 0, then we would need to look
+      // at the current state of the job to determine what deadline to remove.
+      removeJobDeadline(key, updatedValue.getDeadline());


💭 This change might be changing the history of events written in previous versions of zeebe (don't worry, it is not).

Our events must produce the same state changes when they are applied during replay as when they were originally appended. Pretty much all job event appliers call this updateJob method. So when we make a change here, it affects all events that go through those event appliers.

Luckily, the change you made is tiny: no longer delete the job deadline when the newState is ACTIVATED. It so happens that the JobBatchActivatedApplier is the only one that does not call this method. And so, this change does not have any effect other than provide an additional safeguard against future changes. This means we can safely make the change 😌

💭 This does highlight a problem with changes to state classes. Recently we introduced #12814 for this exact reason when making changes to event appliers. But changes to state classes used by event appliers of potentially previously appended events should go through a record version bump as well.

zeebe-bors-camunda · 2023-06-12T14:53:11Z

Build succeeded:

Test summary

korthout · 2023-07-13T16:14:38Z

@oleschoenburg Maybe we should backport this 🤔

lenaschoenburg · 2023-07-14T12:08:00Z

/backport

backport-action · 2023-07-14T12:08:21Z

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-13035-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13035-to-stable/8.0
git checkout -b backport-13035-to-stable/8.0
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f

backport-action · 2023-07-14T12:08:24Z

Backport failed for stable/8.1, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.1
git worktree add -d .worktree/backport-13035-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13035-to-stable/8.1
git checkout -b backport-13035-to-stable/8.1
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f

backport-action · 2023-07-14T12:08:27Z

Backport failed for stable/8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.2
git worktree add -d .worktree/backport-13035-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13035-to-stable/8.2
git checkout -b backport-13035-to-stable/8.2
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f

13499: [Backport stable/8.0] Don't mutate state when checking for job timeouts and backoffs r=oleschoenburg a=oleschoenburg Manual backport of: - #13035 - #13402 - #13126 Merge conflicts because command processors use an older API and some classes were renamed or moved. Nothing major as far as I can tell. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

13510: [Backport stable/8.2] Don't mutate state when checking for job timeouts and backoffs r=korthout a=oleschoenburg Manual backport of: - #13035 - #13402 - #13126 - #13508 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

13509: [Backport stable/8.1] Don't mutate state when checking for job timeouts and backoffs r=oleschoenburg a=oleschoenburg Manual backport of: - #13035 - #13402 - #13126 - #13508 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>

lenaschoenburg and others added 3 commits June 9, 2023 09:45

refactor(engine): simplify function types

90a3f26

refactor(engine): improve rejection message when job cannot timeout

b033992

lenaschoenburg force-pushed the os/12797-immutable-job-timeout branch from fb78661 to d329d27 Compare June 9, 2023 10:13

lenaschoenburg commented Jun 9, 2023

View reviewed changes

lenaschoenburg force-pushed the os/12797-immutable-job-timeout branch from d329d27 to 7570612 Compare June 9, 2023 10:23

lenaschoenburg added 2 commits June 9, 2023 12:38

fix(engine): don't mutate state in scheduled deadline checker job

cfdef6d

Scheduled tasks must not mutate state. With previous fixes, this cleanup logic is no longer necessary because throughout the job lifecycle, deadlines are reliably cleaned up and can't accidentally accumulate here.

lenaschoenburg force-pushed the os/12797-immutable-job-timeout branch from 7570612 to cfdef6d Compare June 9, 2023 10:39

lenaschoenburg marked this pull request as ready for review June 9, 2023 10:47

lenaschoenburg requested a review from korthout June 9, 2023 10:48

korthout approved these changes Jun 12, 2023

View reviewed changes

zeebe-bors-camunda bot merged commit a63aaa3 into main Jun 12, 2023
30 checks passed

zeebe-bors-camunda bot deleted the os/12797-immutable-job-timeout branch June 12, 2023 14:53

lenaschoenburg mentioned this pull request Jul 14, 2023

fix(engine): don't mutate state when checking for job backoffs #13402

Merged

lenaschoenburg added backport stable/8.0 backport stable/8.2 Backport a pull request to 8.2.x labels Jul 14, 2023

lenaschoenburg mentioned this pull request Jul 14, 2023

[Backport stable/8.0] Don't mutate state when checking for job timeouts and backoffs #13499

Merged

This was referenced Jul 17, 2023

[Backport stable/8.1] Don't mutate state when checking for job timeouts and backoffs #13509

Merged

[Backport stable/8.2] Don't mutate state when checking for job timeouts and backoffs #13510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): don't mutate state when checking for timed out jobs #13035

fix(engine): don't mutate state when checking for timed out jobs #13035

lenaschoenburg commented Jun 9, 2023 •

edited

lenaschoenburg Jun 9, 2023

lenaschoenburg Jun 9, 2023

korthout Jun 12, 2023

lenaschoenburg Jun 9, 2023

korthout left a comment

korthout Jun 12, 2023

zeebe-bors-camunda bot commented Jun 12, 2023

korthout commented Jul 13, 2023

lenaschoenburg commented Jul 14, 2023

backport-action commented Jul 14, 2023

backport-action commented Jul 14, 2023

backport-action commented Jul 14, 2023

fix(engine): don't mutate state when checking for timed out jobs #13035

fix(engine): don't mutate state when checking for timed out jobs #13035

Conversation

lenaschoenburg commented Jun 9, 2023 • edited

lenaschoenburg Jun 9, 2023

Choose a reason for hiding this comment

lenaschoenburg Jun 9, 2023

Choose a reason for hiding this comment

korthout Jun 12, 2023

Choose a reason for hiding this comment

lenaschoenburg Jun 9, 2023

Choose a reason for hiding this comment

korthout left a comment

Choose a reason for hiding this comment

korthout Jun 12, 2023

Choose a reason for hiding this comment

zeebe-bors-camunda bot commented Jun 12, 2023

korthout commented Jul 13, 2023

lenaschoenburg commented Jul 14, 2023

backport-action commented Jul 14, 2023

backport-action commented Jul 14, 2023

backport-action commented Jul 14, 2023

lenaschoenburg commented Jun 9, 2023 •

edited