Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yield control if too many timers due #9249

Merged
merged 8 commits into from
May 2, 2022

Conversation

pihme
Copy link
Contributor

@pihme pihme commented Apr 28, 2022

Description

Adds a mechanism for the DueDateTimeChecker to yield control after some time. This is to stop it from iterating over an unknown number of due timer events and blocking execution while doing so.

Overall, this change should work well in cases where there is a huge backlog of timers. This backlog would then be reduced bit by bit.

The change is potentially bad for cases in which there is a constant and high load with many timers being created all the time. In this case, the change of this PR can lead to due timers continuously growing and the timers triggered will fall more and more behind real time.

Overall, this tradeoff was deemed advantageous. At least it removes that dangers that the iteration blocks the execution for so long that the node is marked as unhealthy. When this situation is reached there is currently no practical recovery possible.

Even before this point is reached, execution will be blocked for long stretches of time, and no progress can be made on that partition. So one faulty process can block all others from executing.

Both issues are addressed by this PR. With this PR it should be always possible to make some progress, albeit small. This would allow users to cancel or change any faulty process, or to reduce the load if needed.

Further work will be needed to figure out a way how to trigger timers without potentially falling further and further behind real time.

Review Hints

This PR has duplicate commits from #9237

Related issues

closes #9238

Definition of Done

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually to be done
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • New content is added to the release announcement
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pihme LGTM 👍 I like the changes 🚀

There are two notes from the Code scanning. Please have a look.

@pihme
Copy link
Contributor Author

pihme commented Apr 29, 2022

Pending work:

  • look at code scanning warnings
  • rebase and replace faulty first commit
  • add feature flag to enable/disable the change.

@pihme pihme marked this pull request as draft April 29, 2022 13:09
@pihme pihme force-pushed the 9238-yield-if-too-many-timers-due branch 2 times, most recently from 3dd5dca to 00b110b Compare May 2, 2022 09:59
@pihme pihme force-pushed the 9238-yield-if-too-many-timers-due branch from 00b110b to 81c1b3e Compare May 2, 2022 12:30
@pihme pihme marked this pull request as ready for review May 2, 2022 12:33
@pihme pihme requested a review from saig0 May 2, 2022 12:33
@pihme
Copy link
Contributor Author

pihme commented May 2, 2022

@saig0 Please have another look. The last commit is new.

Regarding the code check warnings: Yes, the inner classes could be static. But if I make them static then code formatting moves them to the top of the class, which in my mind reduced readability of the tests. Happy to change it if you want.

Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pihme looks good 👍

Please see my comment about enabling the feature by default.

* Be careful with parameter order in constructor calls!
*/

// protected static final boolean FOO_DEFAULT = false;
//

private static final boolean YIELDING_DUE_DATE_CHECKER = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Why do we not enable the yielding by default?

If enabled then it avoids that the partition is unhealthy. For me, it sounds more important that the partition is healthy than the timers are far behind.

If the feature is disabled then we don't get feedback if it is causing troubles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@npepinpe This is one of the feature flags we want to test in dev/int and bring into production as soon as possible.

I agree with Philipp that setting it to on is probably the lesser of two evils. I wanted to err on the side of caution, though.
Also, my manual tests revealed that this change alone does not mitigate the issue that @Zelldon complained about.

I still think it's a step in the right direction, though.

@pihme
Copy link
Contributor Author

pihme commented May 2, 2022

bors merge

@zeebe-bors-camunda
Copy link
Contributor

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 66a9fe8 into main May 2, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the 9238-yield-if-too-many-timers-due branch May 2, 2022 15:22
@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

Backport failed for stable/1.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.3
git worktree add -d .worktree/backport-9249-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-9249-to-stable/1.3
git checkout -b backport-9249-to-stable/1.3
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b

@github-actions
Copy link
Contributor

github-actions bot commented May 2, 2022

Backport failed for stable/8.0, because it was unable to create a new branch.

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-9249-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9249-to-stable/8.0
git checkout -b backport-9249-to-stable/8.0
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b

@pihme
Copy link
Contributor Author

pihme commented May 3, 2022

/backport

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2022

Backport failed for stable/1.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/1.3
git worktree add -d .worktree/backport-9249-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-9249-to-stable/1.3
git checkout -b backport-9249-to-stable/1.3
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2022

Backport failed for stable/8.0, because it was unable to create a new branch.

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-9249-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9249-to-stable/8.0
git checkout -b backport-9249-to-stable/8.0
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b

@pihme
Copy link
Contributor Author

pihme commented May 3, 2022

/backport

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2022

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-9249-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9249-to-stable/8.0
git checkout -b backport-9249-to-stable/8.0
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b

zeebe-bors-camunda bot added a commit that referenced this pull request May 5, 2022
9288: [Backport stable/1.3] make due date checker yield control r=saig0 a=pihme

## Description

This commit is a squash of several commits on main. They were squashed because the codebase had diverted which let to various merge conflicts. Squashing all changes into one was easier than fixing each merge conflict for each commit individually.

## Related issues

backport of #9249

Co-authored-by: pihme <pihme@users.noreply.github.com>
zeebe-bors-camunda bot added a commit that referenced this pull request May 5, 2022
9289: [Backport stable/8.0] Let DueDateTimerChecker yield control r=saig0 a=pihme

backport of #9249

Co-authored-by: pihme <pihme@users.noreply.github.com>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DueDateTimeChecker will block progress if many timers are due
3 participants