Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unflake timer start event test #10867

Merged
merged 5 commits into from
Nov 2, 2022

Conversation

korthout
Copy link
Member

@korthout korthout commented Oct 31, 2022

Description

Most cases in the timer start event test were suffering from flakiness. This stabilizes these tests again.

The cause of the flakiness was a race condition between time traveling and scheduling an upcoming timer:

  • the time would be increased after the deployment is completed
  • that includes writing all the deployment events (up to and including fully_distributed)
  • it does not await post-commit task execution
  • upcoming timers are scheduled in a post-commit task
  • this post-commit task could be executed after the time traveling
  • in that case the timer does not trigger after the time travel, as it is scheduled in the future

This changes the time travel to guarantee that the post-commit tasks have been executed. It does that by awaiting that the engine reaches the end of the log.

Related issues

closes #10272

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Please refer to our review guidelines.

korthout and others added 4 commits October 31, 2022 15:53
Pretty much all timer start event tests increase the engine time to test
the timer scheduling. These tests can all suffer from unwanted race
conditions becaue the timer is scheduled in a post-commit task. There is
no clear way to await for this post-commit task to be executed, except
for awaiting the engine to reach the end of the log.

Since we generally only want to time travel in tests when there is
something related to the timer scheduling, it makes sense to hide this
in the `increaseTime` method of the EngineRule.

Also note that this only matters when the streamprocessor is currently
processing. If the engine is paused, the hasReachedEnd method cannot be
called because the relevant actor is also paused.

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
If you want to increase the time, we're now using the streamprocessor
directly. But in the shouldAvoidTriggeringMultipleTimes test case, the
engine was stopped before increasing the time. When the engine is
stopped, the stream processor is removed.

However, this test just wants to verify wht happens if the scheduled
timer triggers and the upcoming due date would also be in the past. That
can be achieved as well by pausing the engine instead of stopping it
completely.

Note that when the engine is paused, we don't have to reset the
recording exporter (on restart it would re-export all records again).

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
Now that the EngineRule.increaseTime can guarantee that timers are
scheduled before time traveling, we no longer need Awaitility to
repeatedly check that the timer triggered.

Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
Co-authored-by: Philipp Ossler <philipp.ossler@gmail.com>
@korthout korthout marked this pull request as ready for review October 31, 2022 15:29
@github-actions
Copy link
Contributor

github-actions bot commented Oct 31, 2022

Test Results

   951 files  ±    0     951 suites  ±0   1h 44m 17s ⏱️ +33s
7 599 tests  - 237  7 592 ✔️  - 237  7 💤 ±0  0 ±0 
7 797 runs   - 237  7 788 ✔️  - 237  9 💤 ±0  0 ±0 

Results for commit 5c467e4. ± Comparison against base commit 858ee74.

♻️ This comment has been updated with latest results.

Copy link
Member

@saig0 saig0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korthout 👍

Copy link
Contributor

@remcowesterhoud remcowesterhoud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

bors merge

zeebe-bors-camunda bot added a commit that referenced this pull request Nov 1, 2022
10867: Unflake timer start event test r=remcowesterhoud a=korthout

## Description

<!-- Please explain the changes you made here. -->

Most cases in the timer start event test were suffering from flakiness. This stabilizes these tests again.

The cause of the flakiness was a race condition between time traveling and scheduling an upcoming timer:
 - the time would be increased after the deployment is completed
 - that includes writing all the deployment events (up to and including fully_distributed)
 - it does not await post-commit task execution
 - upcoming timers are scheduled in a post-commit task
 - this post-commit task could be executed after the time traveling
 - in that case the timer does not trigger after the time travel, as it is scheduled in the future

This changes the time travel to guarantee that the post-commit tasks have been executed. It does that by awaiting that the engine reaches the end of the log.

## Related issues

<!-- Which issues are closed by this PR or are related -->

closes #10272 



Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@korthout
Copy link
Member Author

korthout commented Nov 1, 2022

There was another timer related test that flaked. We should investigate it before merging this, to avoid this creating a bigger problem than it tries to resolve.

@remcowesterhoud
Copy link
Contributor

remcowesterhoud commented Nov 2, 2022

I had a peek at the test. It is flaky because the test case was flawed to begin with. It relied upon the engine increasing its time before the timer was scheduled.
I have pushed a new commit to fix it. @korthout I can't assign you as reviewer as you created the PR, but please have a look 👀

This test was flawed as it creates a timer with a duration of 100ms. Then it continues to verify it has been triggered after 1 second.

This worked because the engine increased it's time. This happened before the timer was scheduled so in reality the timer would trigger after 1.1 seconds. With the change in the increase time method to wait until the engine is idle before increasing the time this broke, as the timer would now be scheduled for the 100ms mark as expected.
@remcowesterhoud remcowesterhoud force-pushed the flaky-start-timer-event-test-10272 branch from ab1232a to 5c467e4 Compare November 2, 2022 13:07
@remcowesterhoud
Copy link
Contributor

bors merge

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit f93a7e1 into main Nov 2, 2022
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the flaky-start-timer-event-test-10272 branch November 2, 2022 14:00
@backport-action
Copy link
Collaborator

Backport failed for stable/8.0, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin stable/8.0
git worktree add -d .worktree/backport-10867-to-stable/8.0 origin/stable/8.0
cd .worktree/backport-10867-to-stable/8.0
git checkout -b backport-10867-to-stable/8.0
ancref=$(git merge-base 858ee7460b931bd325305e6a730ec3c77b4a7e78 5c467e439fefd179cd86aa0ad8112f1a059a3305)
git cherry-pick -x $ancref..5c467e439fefd179cd86aa0ad8112f1a059a3305

@backport-action
Copy link
Collaborator

Successfully created backport PR #10891 for stable/8.1.

zeebe-bors-camunda bot added a commit that referenced this pull request Nov 2, 2022
10891: [Backport stable/8.1] Unflake timer start event test r=korthout a=backport-action

# Description
Backport of #10867 to `stable/8.1`.

relates to #10272

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
zeebe-bors-camunda bot added a commit that referenced this pull request Nov 2, 2022
10893: [Backport stable/8.0] Unflake timer start event test r=remcowesterhoud a=korthout

# Description
Backport of #10867 to `stable/8.0`.

relates to #10272

---

Reviewer, I had to resolve a few conflicts:
- some import changes
- new tests in TimerStartEventTest were changed, but these tests don't exist in 8.0

Co-authored-by: Nico Korthout <nico.korthout@camunda.com>
Co-authored-by: Remco Westerhoud <remco@westerhoud.nl>
@remcowesterhoud remcowesterhoud added the version:8.1.4 Marks an issue as being completely or in parts released in 8.1.4 label Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.1.4 Marks an issue as being completely or in parts released in 8.1.4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky TimerStartEventTest (multiple cases)
4 participants