-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unflake timer start event test #10867
Conversation
Pretty much all timer start event tests increase the engine time to test the timer scheduling. These tests can all suffer from unwanted race conditions becaue the timer is scheduled in a post-commit task. There is no clear way to await for this post-commit task to be executed, except for awaiting the engine to reach the end of the log. Since we generally only want to time travel in tests when there is something related to the timer scheduling, it makes sense to hide this in the `increaseTime` method of the EngineRule. Also note that this only matters when the streamprocessor is currently processing. If the engine is paused, the hasReachedEnd method cannot be called because the relevant actor is also paused. Co-authored-by: Remco Westerhoud <remco@westerhoud.nl> Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
If you want to increase the time, we're now using the streamprocessor directly. But in the shouldAvoidTriggeringMultipleTimes test case, the engine was stopped before increasing the time. When the engine is stopped, the stream processor is removed. However, this test just wants to verify wht happens if the scheduled timer triggers and the upcoming due date would also be in the past. That can be achieved as well by pausing the engine instead of stopping it completely. Note that when the engine is paused, we don't have to reset the recording exporter (on restart it would re-export all records again). Co-authored-by: Remco Westerhoud <remco@westerhoud.nl> Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl> Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
Now that the EngineRule.increaseTime can guarantee that timers are scheduled before time traveling, we no longer need Awaitility to repeatedly check that the timer triggered. Co-authored-by: Remco Westerhoud <remco@westerhoud.nl> Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
bors merge
10867: Unflake timer start event test r=remcowesterhoud a=korthout ## Description <!-- Please explain the changes you made here. --> Most cases in the timer start event test were suffering from flakiness. This stabilizes these tests again. The cause of the flakiness was a race condition between time traveling and scheduling an upcoming timer: - the time would be increased after the deployment is completed - that includes writing all the deployment events (up to and including fully_distributed) - it does not await post-commit task execution - upcoming timers are scheduled in a post-commit task - this post-commit task could be executed after the time traveling - in that case the timer does not trigger after the time travel, as it is scheduled in the future This changes the time travel to guarantee that the post-commit tasks have been executed. It does that by awaiting that the engine reaches the end of the log. ## Related issues <!-- Which issues are closed by this PR or are related --> closes #10272 Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
Build failed: |
There was another timer related test that flaked. We should investigate it before merging this, to avoid this creating a bigger problem than it tries to resolve. |
I had a peek at the test. It is flaky because the test case was flawed to begin with. It relied upon the engine increasing its time before the timer was scheduled. |
engine/src/test/java/io/camunda/zeebe/engine/processing/timer/TimerCatchEventTest.java
Show resolved
Hide resolved
This test was flawed as it creates a timer with a duration of 100ms. Then it continues to verify it has been triggered after 1 second. This worked because the engine increased it's time. This happened before the timer was scheduled so in reality the timer would trigger after 1.1 seconds. With the change in the increase time method to wait until the engine is idle before increasing the time this broke, as the timer would now be scheduled for the 100ms mark as expected.
ab1232a
to
5c467e4
Compare
bors merge |
Build succeeded: |
Backport failed for Please cherry-pick the changes locally. git fetch origin stable/8.0
git worktree add -d .worktree/backport-10867-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-10867-to-stable/8.0
git checkout -b backport-10867-to-stable/8.0
ancref=$(git merge-base 858ee7460b931bd325305e6a730ec3c77b4a7e78 5c467e439fefd179cd86aa0ad8112f1a059a3305)
git cherry-pick -x $ancref..5c467e439fefd179cd86aa0ad8112f1a059a3305 |
Successfully created backport PR #10891 for |
10893: [Backport stable/8.0] Unflake timer start event test r=remcowesterhoud a=korthout # Description Backport of #10867 to `stable/8.0`. relates to #10272 --- Reviewer, I had to resolve a few conflicts: - some import changes - new tests in TimerStartEventTest were changed, but these tests don't exist in 8.0 Co-authored-by: Nico Korthout <nico.korthout@camunda.com> Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Description
Most cases in the timer start event test were suffering from flakiness. This stabilizes these tests again.
The cause of the flakiness was a race condition between time traveling and scheduling an upcoming timer:
This changes the time travel to guarantee that the post-commit tasks have been executed. It does that by awaiting that the engine reaches the end of the log.
Related issues
closes #10272
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
backport stable/1.3
) to the PR, in case that fails you need to create backports manually.Testing:
Documentation:
Please refer to our review guidelines.