Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(engine): don't mutate state when checking for timed out jobs #13035

Merged
merged 5 commits into from
Jun 12, 2023

Conversation

lenaschoenburg
Copy link
Member

@lenaschoenburg lenaschoenburg commented Jun 9, 2023

This PR closes #12797, ensuring that job deadlines are reliably cleaned up by event appliers. This is necessary so that the job timeout checker does not have to mutate state.

With 91da1e8, we fix the job yield and recur processors to produce events with the current job state. Previously, these propagated the job record from the command into new follow-up events which may overwrite already persisted changes to a job.
With 330ce0b, it is now clear when deadlines are set and removed. This follows the rules we discussed, essentially adding a deadline on activation and removing whe the job is no longer activated or deleted.
Finally, cfdef6d is removing the illegal state modification from the checker. With the previous changes we can be sure that deadlines are cleaned up reliably and don't need to do redundant cleanup in the checker anymore.

lenaschoenburg and others added 3 commits June 9, 2023 09:45
Both command processors were using the job record from the command
instead of looking up the current state. This could result in wrong
processing based on outdated data and writing wrong events that
further propagate the outdated job record.

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@lenaschoenburg lenaschoenburg force-pushed the os/12797-immutable-job-timeout branch from fb78661 to d329d27 Compare June 9, 2023 10:13
Comment on lines 233 to 236
// This only works because none of the events actually remove the deadline from the job record.
// If, say on job failure, the deadline is removed or reset to 0, then we would need to look at
// the current state of the job to determine what deadline to remove.
removeJobDeadline(updatedValue.getDeadline());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I'm unsure about this part. Is it okay to assume "none of the events actually remove the deadline from the job record"?

Comment on lines 233 to 238
if (newState != State.ACTIVATED) {
// This only works because none of the events actually remove the deadline from the job record.
// If, say on job failure, the deadline is removed or reset to 0, then we would need to look at
// the current state of the job to determine what deadline to remove.
removeJobDeadline(key, updatedValue.getDeadline());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I'm not sure about this assumption: "none of the events actually remove the deadline from the job record"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This assumption is correct with the changes in this pull request.

Specifically, by reading the job record from the state and using that in the events: TIMED_OUT, RECURRED_AFTER_BACKOFF, and YIELDED. The other events were already appended with the job record read from the state.

}
final var reason =
switch (state) {
case ACTIVATED -> "it has not timed out";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙃 A bit inelegant, would be much nicer with switch guards, coming in Java 21.

@lenaschoenburg lenaschoenburg force-pushed the os/12797-immutable-job-timeout branch from d329d27 to 7570612 Compare June 9, 2023 10:23
Job timeouts now follow fairly simple rules:
1. When a job is ACTIVATED, the event contains a job record with a new
   deadline that is added to the deadline column family. It's not
   necessary to clean up other deadlines for this job because there are
   none.
2. When a job transitions away from ACTIVATED to a new state or is
   deleted, the deadline found in the event is removed from the state.
3. A checker process periodically scans deadlines and writes new
   TIME_OUT commands with the job record from the state.
4. When processing a TIME_OUT command, verify that the job is still
   timed out, otherwise reject the command. Write a TIMED_OUT event with
   the job record from the state.
5. On applying the TIMED_OUT event the job is made ACTIVATABLE again.
   Rule 2 applies and the deadline is removed.

Check if the job is really has timed out before writing a `TIME_OUT`
command. This is necessary because of the following scenario:
Scheduled tasks must not mutate state. With previous fixes, this
cleanup logic is no longer necessary because throughout the job
lifecycle, deadlines are reliably cleaned up and can't accidentally
accumulate here.
@lenaschoenburg lenaschoenburg force-pushed the os/12797-immutable-job-timeout branch from 7570612 to cfdef6d Compare June 9, 2023 10:39
@lenaschoenburg lenaschoenburg marked this pull request as ready for review June 9, 2023 10:47
Copy link
Member

@korthout korthout left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @oleschoenburg 👏

👍 I love the usage of switch expression for the rejection reason! It reads like a breeze and highlights the previously missing cases that were wrongfully caught by the default case. And, as you say, it can be improved even further when we're on Java 21.

👍 I love the change to a BiPredicate as a better-defined type

👍 I love the addition of addJobDeadline to improve the DbJobSate code

👍 LGTM

bors merge

Comment on lines -237 to +238
if (deadline > 0) {
removeJobDeadline(deadline);
if (newState != State.ACTIVATED) {
// This only works because none of the events actually remove the deadline from the job
// record.
// If, say on job failure, the deadline is removed or reset to 0, then we would need to look
// at the current state of the job to determine what deadline to remove.
removeJobDeadline(key, updatedValue.getDeadline());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 This change might be changing the history of events written in previous versions of zeebe (don't worry, it is not).

Our events must produce the same state changes when they are applied during replay as when they were originally appended. Pretty much all job event appliers call this updateJob method. So when we make a change here, it affects all events that go through those event appliers.

Luckily, the change you made is tiny: no longer delete the job deadline when the newState is ACTIVATED. It so happens that the JobBatchActivatedApplier is the only one that does not call this method. And so, this change does not have any effect other than provide an additional safeguard against future changes. This means we can safely make the change 😌

💭 This does highlight a problem with changes to state classes. Recently we introduced #12814 for this exact reason when making changes to event appliers. But changes to state classes used by event appliers of potentially previously appended events should go through a record version bump as well.

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit a63aaa3 into main Jun 12, 2023
30 checks passed
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the os/12797-immutable-job-timeout branch June 12, 2023 14:53
@korthout
Copy link
Member

@oleschoenburg Maybe we should backport this 🤔

@lenaschoenburg
Copy link
Member Author

/backport

@backport-action
Copy link
Collaborator

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-13035-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13035-to-stable/8.0
git checkout -b backport-13035-to-stable/8.0
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f

@backport-action
Copy link
Collaborator

Backport failed for stable/8.1, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.1
git worktree add -d .worktree/backport-13035-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13035-to-stable/8.1
git checkout -b backport-13035-to-stable/8.1
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f

@backport-action
Copy link
Collaborator

Backport failed for stable/8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.2
git worktree add -d .worktree/backport-13035-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13035-to-stable/8.2
git checkout -b backport-13035-to-stable/8.2
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 14, 2023
13499: [Backport stable/8.0] Don't mutate state when checking for job timeouts and backoffs r=oleschoenburg a=oleschoenburg

Manual backport of:
- #13035
- #13402
- #13126

Merge conflicts because command processors use an older API and some classes were renamed or moved. Nothing major as far as I can tell.

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 17, 2023
13510: [Backport stable/8.2] Don't mutate state when checking for job timeouts and backoffs r=korthout a=oleschoenburg

Manual backport of:
- #13035
- #13402
- #13126
- #13508

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Jul 17, 2023
13509: [Backport stable/8.1] Don't mutate state when checking for job timeouts and backoffs r=oleschoenburg a=oleschoenburg

Manual backport of:
- #13035
- #13402
- #13126
- #13508

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable/8.2 Backport a pull request to 8.2.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't mutate state through JobTimeoutTrigger
3 participants