-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(engine): don't mutate state when checking for timed out jobs #13035
Conversation
Both command processors were using the job record from the command instead of looking up the current state. This could result in wrong processing based on outdated data and writing wrong events that further propagate the outdated job record. Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
fb78661
to
d329d27
Compare
// This only works because none of the events actually remove the deadline from the job record. | ||
// If, say on job failure, the deadline is removed or reset to 0, then we would need to look at | ||
// the current state of the job to determine what deadline to remove. | ||
removeJobDeadline(updatedValue.getDeadline()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I'm unsure about this part. Is it okay to assume "none of the events actually remove the deadline from the job record"?
if (newState != State.ACTIVATED) { | ||
// This only works because none of the events actually remove the deadline from the job record. | ||
// If, say on job failure, the deadline is removed or reset to 0, then we would need to look at | ||
// the current state of the job to determine what deadline to remove. | ||
removeJobDeadline(key, updatedValue.getDeadline()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I'm not sure about this assumption: "none of the events actually remove the deadline from the job record"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 This assumption is correct with the changes in this pull request.
Specifically, by reading the job record from the state and using that in the events: TIMED_OUT
, RECURRED_AFTER_BACKOFF
, and YIELDED
. The other events were already appended with the job record read from the state.
} | ||
final var reason = | ||
switch (state) { | ||
case ACTIVATED -> "it has not timed out"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙃 A bit inelegant, would be much nicer with switch guards, coming in Java 21.
d329d27
to
7570612
Compare
Job timeouts now follow fairly simple rules: 1. When a job is ACTIVATED, the event contains a job record with a new deadline that is added to the deadline column family. It's not necessary to clean up other deadlines for this job because there are none. 2. When a job transitions away from ACTIVATED to a new state or is deleted, the deadline found in the event is removed from the state. 3. A checker process periodically scans deadlines and writes new TIME_OUT commands with the job record from the state. 4. When processing a TIME_OUT command, verify that the job is still timed out, otherwise reject the command. Write a TIMED_OUT event with the job record from the state. 5. On applying the TIMED_OUT event the job is made ACTIVATABLE again. Rule 2 applies and the deadline is removed. Check if the job is really has timed out before writing a `TIME_OUT` command. This is necessary because of the following scenario:
Scheduled tasks must not mutate state. With previous fixes, this cleanup logic is no longer necessary because throughout the job lifecycle, deadlines are reliably cleaned up and can't accidentally accumulate here.
7570612
to
cfdef6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @oleschoenburg 👏
👍 I love the usage of switch expression for the rejection reason! It reads like a breeze and highlights the previously missing cases that were wrongfully caught by the default case. And, as you say, it can be improved even further when we're on Java 21.
👍 I love the change to a BiPredicate as a better-defined type
👍 I love the addition of addJobDeadline
to improve the DbJobSate code
👍 LGTM
bors merge
if (deadline > 0) { | ||
removeJobDeadline(deadline); | ||
if (newState != State.ACTIVATED) { | ||
// This only works because none of the events actually remove the deadline from the job | ||
// record. | ||
// If, say on job failure, the deadline is removed or reset to 0, then we would need to look | ||
// at the current state of the job to determine what deadline to remove. | ||
removeJobDeadline(key, updatedValue.getDeadline()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 This change might be changing the history of events written in previous versions of zeebe (don't worry, it is not).
Our events must produce the same state changes when they are applied during replay as when they were originally appended. Pretty much all job event appliers call this updateJob
method. So when we make a change here, it affects all events that go through those event appliers.
Luckily, the change you made is tiny: no longer delete the job deadline when the newState
is ACTIVATED
. It so happens that the JobBatchActivatedApplier is the only one that does not call this method. And so, this change does not have any effect other than provide an additional safeguard against future changes. This means we can safely make the change 😌
💭 This does highlight a problem with changes to state classes. Recently we introduced #12814 for this exact reason when making changes to event appliers. But changes to state classes used by event appliers of potentially previously appended events should go through a record version bump as well.
Build succeeded: |
@oleschoenburg Maybe we should backport this 🤔 |
/backport |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-13035-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13035-to-stable/8.0
git checkout -b backport-13035-to-stable/8.0
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.1
git worktree add -d .worktree/backport-13035-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13035-to-stable/8.1
git checkout -b backport-13035-to-stable/8.1
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.2
git worktree add -d .worktree/backport-13035-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13035-to-stable/8.2
git checkout -b backport-13035-to-stable/8.2
ancref=$(git merge-base d166007d8fee3fa6f112367ea595d35199807f4f cfdef6d037b78f5c78505049aa74bab00260035f)
git cherry-pick -x $ancref..cfdef6d037b78f5c78505049aa74bab00260035f |
13499: [Backport stable/8.0] Don't mutate state when checking for job timeouts and backoffs r=oleschoenburg a=oleschoenburg Manual backport of: - #13035 - #13402 - #13126 Merge conflicts because command processors use an older API and some classes were renamed or moved. Nothing major as far as I can tell. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
This PR closes #12797, ensuring that job deadlines are reliably cleaned up by event appliers. This is necessary so that the job timeout checker does not have to mutate state.
With 91da1e8, we fix the job yield and recur processors to produce events with the current job state. Previously, these propagated the job record from the command into new follow-up events which may overwrite already persisted changes to a job.
With 330ce0b, it is now clear when deadlines are set and removed. This follows the rules we discussed, essentially adding a deadline on activation and removing whe the job is no longer activated or deleted.
Finally, cfdef6d is removing the illegal state modification from the checker. With the previous changes we can be sure that deadlines are cleaned up reliably and don't need to do redundant cleanup in the checker anymore.