-
Notifications
You must be signed in to change notification settings - Fork 568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DueDateTimeChecker will block progress if many timers are due #9238
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
version:1.3.8
version:8.1.0-alpha2
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
Comments
pihme
changed the title
DueDateTimeChecker will block progress if many due timers
DueDateTimeChecker will block progress if many timers are due
Apr 27, 2022
This was referenced Apr 27, 2022
npepinpe
added
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
team/process-automation
labels
Apr 27, 2022
zeebe-bors-camunda bot
added a commit
that referenced
this issue
Apr 29, 2022
9237: refactor(engine): prevent instant rescheduling r=pihme a=pihme ## Description Before this change, the delay calculated to reschedule a task could be negative or close to 0. This lead to the checker being immediately rescheduled. This is bad, because it does not leave room for other tasks to run. With this change, a lower floor is applied when the task is rescheduled. ## Related issues closes #9236 preparation for #9238 <!--- ## Definition of Ready * [X] I've reviewed my own code * [X] I've written a clear changelist description * [X] I've narrowly scoped my changes * [X] I've separated structural from behavioural changes --> ## Definition of Done Code changes: * [X] The changes are backwards compatibility with previous versions * [ ] If it fixes a bug then PRs are created to [backport](https://github.com/camunda/zeebe/compare/stable/0.24...main?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/1.3`) to the PR, in case that fails you need to create backports manually. Testing: * [ ] There are unit/integration tests that verify all acceptance criterias of the issue * [ ] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually * [ ] The change has been verified by a QA run * [ ] The impact of the changes is verified by a benchmark Documentation: * [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.) * [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape) * [ ] If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting. Please refer to our [review guidelines](https://github.com/camunda/zeebe/wiki/Pull-Requests-and-Code-Reviews#code-review-guidelines). Co-authored-by: pihme <pihme@users.noreply.github.com>
This was referenced Apr 29, 2022
zeebe-bors-camunda bot
added a commit
that referenced
this issue
May 2, 2022
9249: Yield control if too many timers due r=pihme a=pihme ## Description Adds a mechanism for the `DueDateTimeChecker` to yield control after some time. This is to stop it from iterating over an unknown number of due timer events and blocking execution while doing so. Overall, this change should work well in cases where there is a huge backlog of timers. This backlog would then be reduced bit by bit. The change is potentially bad for cases in which there is a constant and high load with many timers being created all the time. In this case, the change of this PR can lead to due timers continuously growing and the timers triggered will fall more and more behind real time. Overall, this tradeoff was deemed advantageous. At least it removes that dangers that the iteration blocks the execution for so long that the node is marked as unhealthy. When this situation is reached there is currently no practical recovery possible. Even before this point is reached, execution will be blocked for long stretches of time, and no progress can be made on that partition. So one faulty process can block all others from executing. Both issues are addressed by this PR. With this PR it should be always possible to make some progress, albeit small. This would allow users to cancel or change any faulty process, or to reduce the load if needed. Further work will be needed to figure out a way how to trigger timers without potentially falling further and further behind real time. ## Review Hints This PR has duplicate commits from #9237 ## Related issues <!-- Which issues are closed by this PR or are related --> closes #9238 Co-authored-by: pihme <pihme@users.noreply.github.com>
Zelldon
added
the
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
label
Oct 4, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/reliability
Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected)
kind/bug
Categorizes an issue or PR as a bug
scope/broker
Marks an issue or PR to appear in the broker section of the changelog
severity/high
Marks a bug as having a noticeable impact on the user with no known workaround
version:1.3.8
version:8.1.0-alpha2
version:8.1.0
Marks an issue as being completely or in parts released in 8.1.0
Describe the bug
If there are many due timers to be triggered, 'DueDateTimeChecker` will iterate over them. During this time, all progress is blocked for this partition.
The text was updated successfully, but these errors were encountered: