Don't mutate state through `JobTimeoutTrigger` #12797

korthout · 2023-05-17T13:35:08Z

We need to make sure that JobTimeoutTrigger does not have the ability to mutate state.

JobTimeoutTrigger.scheduleDeactivateTimedOutJobsTask is able to mutate state while it should not be allowed to do so. The state may only be mutated by event appliers, see

Apply state changes when writing/reading events #6171

Once this is resolved, we could add an arch unit test to verify that only Event Appliers are using the mutable state.

This is a blocker for:

Scheduled tasks should not block processing #12302

The text was updated successfully, but these errors were encountered:

oleschoenburg · 2023-06-01T11:33:41Z

This is related to #6521

Also, job backoffs appear to have the same issues. The JobBackoffChecker is calls DbJobState#findBackedOffJobs which removes entries from the backoff column familiy:

zeebe/engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java

Lines 307 to 324 in 52e8d96

    
           @Override 
        
           public long findBackedOffJobs(final long timestamp, final BiPredicate<Long, JobRecord> callback) { 
        
             nextBackOffDueDate = -1L; 
        
             backoffColumnFamily.whileTrue( 
        
                 (key, value) -> { 
        
                   final long deadline = key.first().getValue(); 
        
                   boolean consumed = false; 
        
                   if (deadline <= timestamp) { 
        
                     final long jobKey = key.second().inner().getValue(); 
        
                     consumed = visitJob(jobKey, callback, () -> backoffColumnFamily.deleteExisting(key)); 
        
                   } 
        
                   if (!consumed) { 
        
                     nextBackOffDueDate = deadline; 
        
                   } 
        
                   return consumed; 
        
                 }); 
        
             return nextBackOffDueDate; 
        
           }

oleschoenburg · 2023-06-01T13:54:56Z

When @korthout and I discussed this initially, we thought that the solution could be straight forward:

Don't remove the deadline entry when the checker iterates through deadlines to find timed-out jobs, instead remove them when applying the TIMED_OUT event. Additionally, reject TIME_OUT commands if the deadline has changed in the meantime - this would fix an already existing bug!

Problem

However: I think right now, it is not possible to do this correctly. The problem is that we don't have enough information to clean up deadlines reliably. See #6521 for reference - this is tricky to get right!

So let's think about when we add or remove deadlines.

Adding a deadline is easy: Whenever the job is activated, a new deadline is computed and added to the column family.

Removing a deadline needs to happen whenever the job transitions away from activated to a different state (completed, failed etc.) or when the job is simply deleted.

The problem when removing a deadline is that we don't know which deadline to remove! The deadline from the record might be outdated and there is no mapping that provides all deadlines for a given job.

Solution

I think there are two solutions:

Support finding one or multiple deadlines by job key.
Always rely on TIME_OUT commands.

The first option would require a new column family or ineficcient scanning of the entire deadline column family. It would allow us to always have a one-to-one mapping where a job has exactly one deadline. TIME_OUT commands for a deadline that does not exist anymore will be rejected.

The second option would require that we write TIME_OUT commands with just the job key and deadline, without any of the other job data. This is because by the time the checker finds a deadline entry, the corresponding job might have been deleted already. Ultimately, the checker is responsible for cleaning up all deadlines. Eventually, it will find all expired deadlines and write TIME_OUT commands for them, even when the deadline was just a left-over after deleting the job. The drawback here is that I'm not sure if it would be okay to remove the job data from the TIME_OUT commands and that we would more often get into situations where new commands are written for process instances which are already finished, potentially long ago!

Proposal

So in conclusion, I'd lean towards the first option: introducing a new column family that can ensure that there is always exactly one deadline per job which reliably get's cleaned up when the job state changes or the job is deleted. So in addition to our deadline column family which tracks (deadline, job), add a new one for (job, deadline).

I'd like to get a second opinion on this again though, it's possible that I'm missing something or that a different solution would be better.

oleschoenburg · 2023-06-02T13:24:56Z

The important invariant is that there is always a one to one mapping from deadline to jobs. As long as we have that, it is clear when to remove or update a deadline and we don't have to rely on the checker to clean up deadlines.

What I missed previously were two things that @korthout kindly cleared up for me:

We already have a way to map jobs to deadlines. I thought we'd need a new column family but the information is already available in the column family that maps job keys to job records. It's just not specialized for deadlines.
When processing job related events that update deadlines, the event should already include the current job record with the correct deadline.

Note the "should" in there. As of now, this is not the case for all processors, and some events are written with the record taken from the command, not the state. And so the TIMED_OUT, RECURRED_AFTER_BACKOFF and YIELDED events contain deadlines that might not match what's in the state. This is already a bug 🐞 that is (at least partially) compensated for by the cleanup of deadlines when the checker runs.

After we fix this bug, the entire job deadline behavior can be described with the following rules:

When a job is ACTIVATED, the event contains a job record with a new deadline that is added to the deadline column family. It's not necessary to cleanup other deadlines for this job because there are none.
When a job transitions away from ACTIVATED to a new state or is deleted, the deadline found in the event is removed from the state.
A checker process periodically scans deadlines and writes new TIME_OUT commands with the job record from the state.
When processing a TIME_OUT command, verify that command and state deadline match, otherwise reject the command. If they match, write a TIMED_OUT event with the job record from the state.
On applying the TIMED_OUT event the job is made ACTIVATABLE again. Rule 2 applies and the deadline is removed.

Rule 4 is a bit open for interpretation. Instead of requiring an exact match between state and command deadline, we could also process the command as long as the state deadline is not in the future. This is more lenient and jobs would timeout sooner but still not too soon. This might help with long processing queues though, where a new deadline is set before the first TIME_OUT command is processed.
However, I don't think it is necessary for correctness and it feels like cheating the stream processing rules a little bit. After all, before reaching the TIME_OUT command the job could have transitioned to a (hypothetical!) state where it will no longer timeout automatically.
The risk of being not lenient and requiring that command and state deadline match exactly is that with permanently long processing queue, the timeout mechanism would stop functioning if the job would update constantly before the TIME_OUT command is processed. I don't see this as a big problem though, as far as I understand this is not breaking any correctness guarantees.

korthout · 2023-06-02T14:27:50Z

Thanks for another amazing write-up @oleschoenburg! 👏 Let's fix this 🐞

👍 Regarding rule 4, I also lean towards the more lenient alternative.

When processing a TIME_OUT command, verify that command and state deadline match; otherwise, reject the command. If they match, write a TIMED_OUT event with the job record from the state

As you've said: Although this is correct, it is already enough to check that the deadline has passed. IMO, this fits better semantically in the timeout processor:

The command is a request to timeout the job, and the engine verifies whether the job's deadline has actually already passed before deciding to reject this command or accept it by writing job TIMED_OUT event

💭 That not all data in the command is up-to-date with the current state is not so important. In fact, it might be okay to write an empty command (just key and intent are actually relevant). The async nature of the stream processing makes it so that commands may always be written when the state was different from when it is processed. If we would reject all commands containing out-of-date data, then we should also not allow a job worker to COMPLETE a job after it has timed out and was ACTIVATED for another job worker. However, we say that when a job worker completes the job, it does not matter which one. The job was completed either way. We can think about job timeouts in the same way.

⚖️ But, to be fair, both forms of rule 4 work, and I'd be fine with either choice.

After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. The fixes for #13041 and #13041 ensured that deadlines and backoffs are not removed ad-hoc whenever the job no longer exists. These two new migrations ensure that

After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state.

13886: fix(engine): cleanup orphaned job timeouts and backoffs on migration r=koevskinikola a=oleschoenburg After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. closes #13881 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. (cherry picked from commit 1d82e6e)

korthout mentioned this issue May 17, 2023

Scheduled tasks should not block processing #12302

Closed

korthout added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. component/engine labels May 17, 2023

korthout assigned oleschoenburg and korthout May 31, 2023

oleschoenburg mentioned this issue May 31, 2023

Optionally run all scheduled tasks asynchronously #12308

Closed

oleschoenburg mentioned this issue Jun 1, 2023

Failing jobs with job push enabled breaks the job lifecycle #12933

Closed

oleschoenburg mentioned this issue Jun 9, 2023

fix(engine): don't mutate state when checking for timed out jobs #13035

Merged

oleschoenburg added the kind/bug Categorizes an issue or PR as a bug label Jun 9, 2023

oleschoenburg mentioned this issue Jun 9, 2023

Don't mutate state through JobBackoffChecker #13041

Closed

zeebe-bors-camunda bot closed this as completed in a63aaa3 Jun 12, 2023

oleschoenburg mentioned this issue Aug 14, 2023

Upgrading leaves deadline entries without jobs #13881

Closed

oleschoenburg mentioned this issue Aug 14, 2023

fix(engine): cleanup orphaned job timeouts and backoffs on migration #13886

Merged

oleschoenburg mentioned this issue Sep 19, 2023

Bloated DEADLINE_JOBS column family; no way to recover #8949

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't mutate state through `JobTimeoutTrigger` #12797

Don't mutate state through `JobTimeoutTrigger` #12797

korthout commented May 17, 2023 •

edited

oleschoenburg commented Jun 1, 2023 •

edited

oleschoenburg commented Jun 1, 2023

oleschoenburg commented Jun 2, 2023 •

edited

korthout commented Jun 2, 2023

Don't mutate state through JobTimeoutTrigger #12797

Don't mutate state through JobTimeoutTrigger #12797

Comments

korthout commented May 17, 2023 • edited

oleschoenburg commented Jun 1, 2023 • edited

oleschoenburg commented Jun 1, 2023

Problem

Solution

Proposal

oleschoenburg commented Jun 2, 2023 • edited

korthout commented Jun 2, 2023

Don't mutate state through `JobTimeoutTrigger` #12797

Don't mutate state through `JobTimeoutTrigger` #12797

korthout commented May 17, 2023 •

edited

oleschoenburg commented Jun 1, 2023 •

edited

oleschoenburg commented Jun 2, 2023 •

edited