-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(engine): cleanup orphaned job timeouts and backoffs on migration #13886
Conversation
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job. Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly. This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state.
final var jobKey = key.second().inner(); | ||
final var deadline = key.first().getValue(); | ||
final var job = jobsColumnFamily.get(jobKey); | ||
if (job == null || job.getRecord().getDeadline() != deadline) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I'm not entirely sure if this is safe. Could there be a case where the deadline of this job is not in the deadline column family and we would end up deleting all deadlines for this job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need this case as there could be multiple entries from the previous behavior and the new behavior only deletes one exact entry, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safest thing to do would be to completely clear the deadline column family and recreate it based on deadlines found for each job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK there isn't a case where an activated job wouldn't have an entry in the deadline column family. Every time a job is activated, its deadline and key are added to the deadline column family.
The safest thing to do would be to completely clear the deadline column family and recreate it based on deadlines found for each job.
It's safest, but it's probably more expensive since we would be going through all the available jobs instread of the activated jobs subset.
final var jobKey = key.second().inner(); | ||
final var backoff = key.first().getValue(); | ||
final var job = jobsColumnFamily.get(jobKey); | ||
if (job == null || job.getRecord().getRetryBackoff() != backoff) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ I'm not entirely sure if this is safe. Could there be a case where the backoff of this job is not in the backoff column family and we would end up deleting all backoffs for this job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job @oleschoenburg 🚀
I left some naming-related comments only. Otherwise, the changes look good.
engine/src/main/java/io/camunda/zeebe/engine/state/mutable/MutableJobState.java
Outdated
Show resolved
Hide resolved
engine/src/main/java/io/camunda/zeebe/engine/state/migration/JobBackoffCleanup.java
Outdated
Show resolved
Hide resolved
engine/src/main/java/io/camunda/zeebe/engine/state/migration/JobTimeoutCleanup.java
Outdated
Show resolved
Hide resolved
final var jobKey = key.second().inner(); | ||
final var deadline = key.first().getValue(); | ||
final var job = jobsColumnFamily.get(jobKey); | ||
if (job == null || job.getRecord().getDeadline() != deadline) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK there isn't a case where an activated job wouldn't have an entry in the deadline column family. Every time a job is activated, its deadline and key are added to the deadline column family.
The safest thing to do would be to completely clear the deadline column family and recreate it based on deadlines found for each job.
It's safest, but it's probably more expensive since we would be going through all the available jobs instread of the activated jobs subset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
bors merge
FYI, the backport of this will most likely fail because of changes I made to migrations recently. Let me know if you need help resolving this. |
Seems like #13537 is popping up again. |
Build succeeded: |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-13886-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13886-to-stable/8.0
git checkout -b backport-13886-to-stable/8.0
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.1
git worktree add -d .worktree/backport-13886-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13886-to-stable/8.1
git checkout -b backport-13886-to-stable/8.1
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.2
git worktree add -d .worktree/backport-13886-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13886-to-stable/8.2
git checkout -b backport-13886-to-stable/8.2
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea |
@remcowesterhoud I was hoping that #13780 would be backported too 😅 |
I can give it a go. It's gonna be a manual one though, because I don't want all the commits. |
/backport |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-13886-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13886-to-stable/8.0
git checkout -b backport-13886-to-stable/8.0
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.1
git worktree add -d .worktree/backport-13886-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13886-to-stable/8.1
git checkout -b backport-13886-to-stable/8.1
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.2
git worktree add -d .worktree/backport-13886-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13886-to-stable/8.2
git checkout -b backport-13886-to-stable/8.2
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea |
13909: [Backport stable/8.2] fix(engine): cleanup orphaned job timeouts and backoffs on migration r=megglos a=oleschoenburg Manual backport of #13886 Minor merge conflicts because the list of migrations changed. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
13908: [Backport stable/8.1] fix(engine): cleanup orphaned job timeouts and backoffs on migration r=megglos a=oleschoenburg Manual backport of #13886 Minor merge conflicts because the list of migrations changed. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
13907: [Backport stable/8.0] fix(engine): cleanup orphaned job timeouts and backoffs on migration r=megglos a=oleschoenburg Manual backport of #13886 Minor merge conflicts because the list of migrations changed and because some classes were renamed. Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job.
Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly.
This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state.
closes #13881