Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't mutate state through JobTimeoutTrigger #12797

Closed
Tracked by #12302
korthout opened this issue May 17, 2023 · 4 comments · Fixed by #13035
Closed
Tracked by #12302

Don't mutate state through JobTimeoutTrigger #12797

korthout opened this issue May 17, 2023 · 4 comments · Fixed by #13035
Assignees
Labels
component/engine kind/bug Categorizes an issue or PR as a bug kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.

Comments

@korthout
Copy link
Member

korthout commented May 17, 2023

We need to make sure that JobTimeoutTrigger does not have the ability to mutate state.

JobTimeoutTrigger.scheduleDeactivateTimedOutJobsTask is able to mutate state while it should not be allowed to do so. The state may only be mutated by event appliers, see

Once this is resolved, we could add an arch unit test to verify that only Event Appliers are using the mutable state.

This is a blocker for:

@korthout korthout added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. component/engine labels May 17, 2023
@oleschoenburg
Copy link
Member

oleschoenburg commented Jun 1, 2023

This is related to #6521

Also, job backoffs appear to have the same issues. The JobBackoffChecker is calls DbJobState#findBackedOffJobs which removes entries from the backoff column familiy:

@Override
public long findBackedOffJobs(final long timestamp, final BiPredicate<Long, JobRecord> callback) {
nextBackOffDueDate = -1L;
backoffColumnFamily.whileTrue(
(key, value) -> {
final long deadline = key.first().getValue();
boolean consumed = false;
if (deadline <= timestamp) {
final long jobKey = key.second().inner().getValue();
consumed = visitJob(jobKey, callback, () -> backoffColumnFamily.deleteExisting(key));
}
if (!consumed) {
nextBackOffDueDate = deadline;
}
return consumed;
});
return nextBackOffDueDate;
}

@oleschoenburg
Copy link
Member

When @korthout and I discussed this initially, we thought that the solution could be straight forward:

Don't remove the deadline entry when the checker iterates through deadlines to find timed-out jobs, instead remove them when applying the TIMED_OUT event. Additionally, reject TIME_OUT commands if the deadline has changed in the meantime - this would fix an already existing bug!

Problem

However: I think right now, it is not possible to do this correctly. The problem is that we don't have enough information to clean up deadlines reliably. See #6521 for reference - this is tricky to get right!

So let's think about when we add or remove deadlines.

Adding a deadline is easy: Whenever the job is activated, a new deadline is computed and added to the column family.

Removing a deadline needs to happen whenever the job transitions away from activated to a different state (completed, failed etc.) or when the job is simply deleted.

The problem when removing a deadline is that we don't know which deadline to remove! The deadline from the record might be outdated and there is no mapping that provides all deadlines for a given job.

Solution

I think there are two solutions:

  1. Support finding one or multiple deadlines by job key.
  2. Always rely on TIME_OUT commands.

The first option would require a new column family or ineficcient scanning of the entire deadline column family. It would allow us to always have a one-to-one mapping where a job has exactly one deadline. TIME_OUT commands for a deadline that does not exist anymore will be rejected.

The second option would require that we write TIME_OUT commands with just the job key and deadline, without any of the other job data. This is because by the time the checker finds a deadline entry, the corresponding job might have been deleted already. Ultimately, the checker is responsible for cleaning up all deadlines. Eventually, it will find all expired deadlines and write TIME_OUT commands for them, even when the deadline was just a left-over after deleting the job. The drawback here is that I'm not sure if it would be okay to remove the job data from the TIME_OUT commands and that we would more often get into situations where new commands are written for process instances which are already finished, potentially long ago!

Proposal

So in conclusion, I'd lean towards the first option: introducing a new column family that can ensure that there is always exactly one deadline per job which reliably get's cleaned up when the job state changes or the job is deleted. So in addition to our deadline column family which tracks (deadline, job), add a new one for (job, deadline).

I'd like to get a second opinion on this again though, it's possible that I'm missing something or that a different solution would be better.

@oleschoenburg
Copy link
Member

oleschoenburg commented Jun 2, 2023

The important invariant is that there is always a one to one mapping from deadline to jobs. As long as we have that, it is clear when to remove or update a deadline and we don't have to rely on the checker to clean up deadlines.

What I missed previously were two things that @korthout kindly cleared up for me:

  1. We already have a way to map jobs to deadlines. I thought we'd need a new column family but the information is already available in the column family that maps job keys to job records. It's just not specialized for deadlines.
  2. When processing job related events that update deadlines, the event should already include the current job record with the correct deadline.

Note the "should" in there. As of now, this is not the case for all processors, and some events are written with the record taken from the command, not the state. And so the TIMED_OUT, RECURRED_AFTER_BACKOFF and YIELDED events contain deadlines that might not match what's in the state. This is already a bug 🐞 that is (at least partially) compensated for by the cleanup of deadlines when the checker runs.

After we fix this bug, the entire job deadline behavior can be described with the following rules:

  1. When a job is ACTIVATED, the event contains a job record with a new deadline that is added to the deadline column family. It's not necessary to cleanup other deadlines for this job because there are none.
  2. When a job transitions away from ACTIVATED to a new state or is deleted, the deadline found in the event is removed from the state.
  3. A checker process periodically scans deadlines and writes new TIME_OUT commands with the job record from the state.
  4. When processing a TIME_OUT command, verify that command and state deadline match, otherwise reject the command. If they match, write a TIMED_OUT event with the job record from the state.
  5. On applying the TIMED_OUT event the job is made ACTIVATABLE again. Rule 2 applies and the deadline is removed.

Rule 4 is a bit open for interpretation. Instead of requiring an exact match between state and command deadline, we could also process the command as long as the state deadline is not in the future. This is more lenient and jobs would timeout sooner but still not too soon. This might help with long processing queues though, where a new deadline is set before the first TIME_OUT command is processed.
However, I don't think it is necessary for correctness and it feels like cheating the stream processing rules a little bit. After all, before reaching the TIME_OUT command the job could have transitioned to a (hypothetical!) state where it will no longer timeout automatically.
The risk of being not lenient and requiring that command and state deadline match exactly is that with permanently long processing queue, the timeout mechanism would stop functioning if the job would update constantly before the TIME_OUT command is processed. I don't see this as a big problem though, as far as I understand this is not breaking any correctness guarantees.

@korthout
Copy link
Member Author

korthout commented Jun 2, 2023

Thanks for another amazing write-up @oleschoenburg! 👏 Let's fix this 🐞

👍 Regarding rule 4, I also lean towards the more lenient alternative.

  1. When processing a TIME_OUT command, verify that command and state deadline match; otherwise, reject the command. If they match, write a TIMED_OUT event with the job record from the state

As you've said: Although this is correct, it is already enough to check that the deadline has passed. IMO, this fits better semantically in the timeout processor:

  • The command is a request to timeout the job, and the engine verifies whether the job's deadline has actually already passed before deciding to reject this command or accept it by writing job TIMED_OUT event

💭 That not all data in the command is up-to-date with the current state is not so important. In fact, it might be okay to write an empty command (just key and intent are actually relevant). The async nature of the stream processing makes it so that commands may always be written when the state was different from when it is processed. If we would reject all commands containing out-of-date data, then we should also not allow a job worker to COMPLETE a job after it has timed out and was ACTIVATED for another job worker. However, we say that when a job worker completes the job, it does not matter which one. The job was completed either way. We can think about job timeouts in the same way.

⚖️ But, to be fair, both forms of rule 4 work, and I'd be fine with either choice.

@oleschoenburg oleschoenburg added the kind/bug Categorizes an issue or PR as a bug label Jun 9, 2023
oleschoenburg added a commit that referenced this issue Aug 14, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.

The fixes for #13041 and
#13041 ensured that deadlines and
backoffs are not removed ad-hoc whenever the job no longer exists.
These two new migrations ensure that
oleschoenburg added a commit that referenced this issue Aug 14, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.
zeebe-bors-camunda bot added a commit that referenced this issue Aug 15, 2023
13886: fix(engine): cleanup orphaned job timeouts and backoffs on migration r=koevskinikola a=oleschoenburg

After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. 
Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly.

This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state.

closes #13881

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
oleschoenburg added a commit that referenced this issue Aug 15, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.

(cherry picked from commit 1d82e6e)
oleschoenburg added a commit that referenced this issue Aug 15, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.

(cherry picked from commit 1d82e6e)
oleschoenburg added a commit that referenced this issue Aug 15, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.

(cherry picked from commit 1d82e6e)
oleschoenburg added a commit that referenced this issue Aug 15, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.

(cherry picked from commit 1d82e6e)
oleschoenburg added a commit that referenced this issue Aug 15, 2023
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.

(cherry picked from commit 1d82e6e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/engine kind/bug Categorizes an issue or PR as a bug kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants