New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't mutate state through JobTimeoutTrigger
#12797
Comments
This is related to #6521 Also, job backoffs appear to have the same issues. The zeebe/engine/src/main/java/io/camunda/zeebe/engine/state/instance/DbJobState.java Lines 307 to 324 in 52e8d96
|
When @korthout and I discussed this initially, we thought that the solution could be straight forward: Don't remove the deadline entry when the checker iterates through deadlines to find timed-out jobs, instead remove them when applying the ProblemHowever: I think right now, it is not possible to do this correctly. The problem is that we don't have enough information to clean up deadlines reliably. See #6521 for reference - this is tricky to get right! So let's think about when we add or remove deadlines. Adding a deadline is easy: Whenever the job is activated, a new deadline is computed and added to the column family. Removing a deadline needs to happen whenever the job transitions away from activated to a different state (completed, failed etc.) or when the job is simply deleted. The problem when removing a deadline is that we don't know which deadline to remove! The deadline from the record might be outdated and there is no mapping that provides all deadlines for a given job. SolutionI think there are two solutions:
The first option would require a new column family or ineficcient scanning of the entire deadline column family. It would allow us to always have a one-to-one mapping where a job has exactly one deadline. The second option would require that we write ProposalSo in conclusion, I'd lean towards the first option: introducing a new column family that can ensure that there is always exactly one deadline per job which reliably get's cleaned up when the job state changes or the job is deleted. So in addition to our deadline column family which tracks I'd like to get a second opinion on this again though, it's possible that I'm missing something or that a different solution would be better. |
The important invariant is that there is always a one to one mapping from deadline to jobs. As long as we have that, it is clear when to remove or update a deadline and we don't have to rely on the checker to clean up deadlines. What I missed previously were two things that @korthout kindly cleared up for me:
Note the "should" in there. As of now, this is not the case for all processors, and some events are written with the record taken from the command, not the state. And so the After we fix this bug, the entire job deadline behavior can be described with the following rules:
Rule 4 is a bit open for interpretation. Instead of requiring an exact match between state and command deadline, we could also process the command as long as the state deadline is not in the future. This is more lenient and jobs would timeout sooner but still not too soon. This might help with long processing queues though, where a new deadline is set before the first |
Thanks for another amazing write-up @oleschoenburg! 👏 Let's fix this 🐞 👍 Regarding rule 4, I also lean towards the more lenient alternative.
As you've said: Although this is correct, it is already enough to check that the deadline has passed. IMO, this fits better semantically in the timeout processor:
💭 That not all data in the command is up-to-date with the current state is not so important. In fact, it might be okay to write an empty command (just key and intent are actually relevant). The async nature of the stream processing makes it so that commands may always be written when the state was different from when it is processed. If we would reject all commands containing out-of-date data, then we should also not allow a job worker to COMPLETE a job after it has timed out and was ACTIVATED for another job worker. However, we say that when a job worker completes the job, it does not matter which one. The job was completed either way. We can think about job timeouts in the same way. ⚖️ But, to be fair, both forms of rule 4 work, and I'd be fine with either choice. |
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. The fixes for #13041 and #13041 ensured that deadlines and backoffs are not removed ad-hoc whenever the job no longer exists. These two new migrations ensure that
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state.
13886: fix(engine): cleanup orphaned job timeouts and backoffs on migration r=koevskinikola a=oleschoenburg After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. closes #13881 Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. (cherry picked from commit 1d82e6e)
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. (cherry picked from commit 1d82e6e)
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. (cherry picked from commit 1d82e6e)
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. (cherry picked from commit 1d82e6e)
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state. (cherry picked from commit 1d82e6e)
We need to make sure that
JobTimeoutTrigger
does not have the ability to mutate state.JobTimeoutTrigger.scheduleDeactivateTimedOutJobsTask
is able to mutate state while it should not be allowed to do so. The state may only be mutated by event appliers, seeOnce this is resolved, we could add an arch unit test to verify that only Event Appliers are using the mutable state.
This is a blocker for:
The text was updated successfully, but these errors were encountered: