Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(engine): cleanup orphaned job timeouts and backoffs on migration #13886

Merged
merged 3 commits into from
Aug 15, 2023

Conversation

lenaschoenburg
Copy link
Member

After an update from a previous version, both backoff and deadline column families might contain entries without a corresponding job or multiple entries for a single job.
Before fixing #12797 and #13041, these were cleaned up ad-hoc whenever they were found. This is no longer the case because we now prevent the creation of duplicated entries and always cleanup properly.

This adds two necessary migrations that remove orphaned entries that were left by a previous version. The migrations run once and walk through all deadline and backoff entries, removing those without a job and duplicates which don't match the current job state.

closes #13881

lenaschoenburg and others added 2 commits August 14, 2023 18:54
After an update from a previous version, both backoff and deadline
column families might contain entries without a corresponding job or
multiple entries for a single job. Before fixing #12797 and #13041,
these were cleaned up ad-hoc whenever they were found. This is no longer
the case because we now prevent the creation of duplicated entries and
always cleanup properly.
This adds two necessary migrations that remove orphaned entries that
were left by a previous version. The migrations run once and walk
through all deadline and backoff entries, removing those without a job
and duplicates which don't match the current job state.
@lenaschoenburg lenaschoenburg marked this pull request as ready for review August 15, 2023 06:45
@megglos megglos added backport stable/8.0 backport stable/8.2 Backport a pull request to 8.2.x labels Aug 15, 2023
final var jobKey = key.second().inner();
final var deadline = key.first().getValue();
final var job = jobsColumnFamily.get(jobKey);
if (job == null || job.getRecord().getDeadline() != deadline) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I'm not entirely sure if this is safe. Could there be a case where the deadline of this job is not in the deadline column family and we would end up deleting all deadlines for this job?

Copy link
Contributor

@megglos megglos Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need this case as there could be multiple entries from the previous behavior and the new behavior only deletes one exact entry, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safest thing to do would be to completely clear the deadline column family and recreate it based on deadlines found for each job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there isn't a case where an activated job wouldn't have an entry in the deadline column family. Every time a job is activated, its deadline and key are added to the deadline column family.

The safest thing to do would be to completely clear the deadline column family and recreate it based on deadlines found for each job.

It's safest, but it's probably more expensive since we would be going through all the available jobs instread of the activated jobs subset.

final var jobKey = key.second().inner();
final var backoff = key.first().getValue();
final var job = jobsColumnFamily.get(jobKey);
if (job == null || job.getRecord().getRetryBackoff() != backoff) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I'm not entirely sure if this is safe. Could there be a case where the backoff of this job is not in the backoff column family and we would end up deleting all backoffs for this job?

Copy link
Member

@koevskinikola koevskinikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @oleschoenburg 🚀

I left some naming-related comments only. Otherwise, the changes look good.

final var jobKey = key.second().inner();
final var deadline = key.first().getValue();
final var job = jobsColumnFamily.get(jobKey);
if (job == null || job.getRecord().getDeadline() != deadline) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK there isn't a case where an activated job wouldn't have an entry in the deadline column family. Every time a job is activated, its deadline and key are added to the deadline column family.

The safest thing to do would be to completely clear the deadline column family and recreate it based on deadlines found for each job.

It's safest, but it's probably more expensive since we would be going through all the available jobs instread of the activated jobs subset.

Copy link
Member

@koevskinikola koevskinikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

bors merge

@remcowesterhoud
Copy link
Contributor

FYI, the backport of this will most likely fail because of changes I made to migrations recently. Let me know if you need help resolving this.

@koevskinikola
Copy link
Member

Seems like #13537 is popping up again.

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit fe0117e into main Aug 15, 2023
30 of 32 checks passed
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the os/fix-job-entry-leak branch August 15, 2023 08:48
@backport-action
Copy link
Collaborator

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-13886-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13886-to-stable/8.0
git checkout -b backport-13886-to-stable/8.0
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea

@backport-action
Copy link
Collaborator

Backport failed for stable/8.1, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.1
git worktree add -d .worktree/backport-13886-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13886-to-stable/8.1
git checkout -b backport-13886-to-stable/8.1
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea

@backport-action
Copy link
Collaborator

Backport failed for stable/8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.2
git worktree add -d .worktree/backport-13886-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13886-to-stable/8.2
git checkout -b backport-13886-to-stable/8.2
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea

@lenaschoenburg
Copy link
Member Author

@remcowesterhoud I was hoping that #13780 would be backported too 😅
The tracking would be really useful here because we want to run the migration only once and we don't have a good indicator on whether it needs to run or not. Do you think you could backport #13780?

@remcowesterhoud
Copy link
Contributor

remcowesterhoud commented Aug 15, 2023

I can give it a go. It's gonna be a manual one though, because I don't want all the commits.

@lenaschoenburg
Copy link
Member Author

/backport

@backport-action
Copy link
Collaborator

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-13886-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-13886-to-stable/8.0
git checkout -b backport-13886-to-stable/8.0
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea

@backport-action
Copy link
Collaborator

Backport failed for stable/8.1, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.1
git worktree add -d .worktree/backport-13886-to-stable/8.1 origin/stable/8.1
cd .worktree/backport-13886-to-stable/8.1
git checkout -b backport-13886-to-stable/8.1
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea

@backport-action
Copy link
Collaborator

Backport failed for stable/8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.2
git worktree add -d .worktree/backport-13886-to-stable/8.2 origin/stable/8.2
cd .worktree/backport-13886-to-stable/8.2
git checkout -b backport-13886-to-stable/8.2
ancref=$(git merge-base 416f5cd2bfc0aac1daeb85a31089204b79792db9 dcb132b8e7f1c967dc7df77ed183b05133273dea)
git cherry-pick -x $ancref..dcb132b8e7f1c967dc7df77ed183b05133273dea

zeebe-bors-camunda bot added a commit that referenced this pull request Aug 15, 2023
13909: [Backport stable/8.2] fix(engine): cleanup orphaned job timeouts and backoffs on migration r=megglos a=oleschoenburg

Manual backport of #13886 

Minor merge conflicts because the list of migrations changed.

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Aug 15, 2023
13908: [Backport stable/8.1] fix(engine): cleanup orphaned job timeouts and backoffs on migration r=megglos a=oleschoenburg

Manual backport of #13886 

Minor merge conflicts because the list of migrations changed.

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
zeebe-bors-camunda bot added a commit that referenced this pull request Aug 15, 2023
13907: [Backport stable/8.0] fix(engine): cleanup orphaned job timeouts and backoffs on migration r=megglos a=oleschoenburg

Manual backport of #13886 

Minor merge conflicts because the list of migrations changed and because some classes were renamed.

Co-authored-by: Ole Schönburg <ole.schoenburg@gmail.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable/8.2 Backport a pull request to 8.2.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upgrading leaves deadline entries without jobs
5 participants