-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yield control if too many timers due #9249
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pihme LGTM 👍 I like the changes 🚀
There are two notes from the Code scanning. Please have a look.
Pending work:
|
will later become important in tests
3dd5dca
to
00b110b
Compare
00b110b
to
81c1b3e
Compare
@saig0 Please have another look. The last commit is new. Regarding the code check warnings: Yes, the inner classes could be static. But if I make them static then code formatting moves them to the top of the class, which in my mind reduced readability of the tests. Happy to change it if you want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pihme looks good 👍
Please see my comment about enabling the feature by default.
* Be careful with parameter order in constructor calls! | ||
*/ | ||
|
||
// protected static final boolean FOO_DEFAULT = false; | ||
// | ||
|
||
private static final boolean YIELDING_DUE_DATE_CHECKER = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 Why do we not enable the yielding by default?
If enabled then it avoids that the partition is unhealthy. For me, it sounds more important that the partition is healthy than the timers are far behind.
If the feature is disabled then we don't get feedback if it is causing troubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@npepinpe This is one of the feature flags we want to test in dev/int and bring into production as soon as possible.
I agree with Philipp that setting it to on is probably the lesser of two evils. I wanted to err on the side of caution, though.
Also, my manual tests revealed that this change alone does not mitigate the issue that @Zelldon complained about.
I still think it's a step in the right direction, though.
bors merge |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/1.3
git worktree add -d .worktree/backport-9249-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-9249-to-stable/1.3
git checkout -b backport-9249-to-stable/1.3
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-9249-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9249-to-stable/8.0
git checkout -b backport-9249-to-stable/8.0
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b |
/backport |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/1.3
git worktree add -d .worktree/backport-9249-to-stable/1.3 origin/stable/1.3
cd .worktree/backport-9249-to-stable/1.3
git checkout -b backport-9249-to-stable/1.3
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-9249-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9249-to-stable/8.0
git checkout -b backport-9249-to-stable/8.0
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b |
/backport |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-9249-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-9249-to-stable/8.0
git checkout -b backport-9249-to-stable/8.0
ancref=$(git merge-base 9392721bdc57581437d6ab2d46df1f6f2251f4bd 81c1b3ec61e3310c0d4fe89578a0ccc87c85242b)
git cherry-pick -x $ancref..81c1b3ec61e3310c0d4fe89578a0ccc87c85242b |
9288: [Backport stable/1.3] make due date checker yield control r=saig0 a=pihme ## Description This commit is a squash of several commits on main. They were squashed because the codebase had diverted which let to various merge conflicts. Squashing all changes into one was easier than fixing each merge conflict for each commit individually. ## Related issues backport of #9249 Co-authored-by: pihme <pihme@users.noreply.github.com>
Description
Adds a mechanism for the
DueDateTimeChecker
to yield control after some time. This is to stop it from iterating over an unknown number of due timer events and blocking execution while doing so.Overall, this change should work well in cases where there is a huge backlog of timers. This backlog would then be reduced bit by bit.
The change is potentially bad for cases in which there is a constant and high load with many timers being created all the time. In this case, the change of this PR can lead to due timers continuously growing and the timers triggered will fall more and more behind real time.
Overall, this tradeoff was deemed advantageous. At least it removes that dangers that the iteration blocks the execution for so long that the node is marked as unhealthy. When this situation is reached there is currently no practical recovery possible.
Even before this point is reached, execution will be blocked for long stretches of time, and no progress can be made on that partition. So one faulty process can block all others from executing.
Both issues are addressed by this PR. With this PR it should be always possible to make some progress, albeit small. This would allow users to cancel or change any faulty process, or to reduce the load if needed.
Further work will be needed to figure out a way how to trigger timers without potentially falling further and further behind real time.
Review Hints
This PR has duplicate commits from #9237
Related issues
closes #9238
Definition of Done
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.