Skip to content

Re-retry failing flaky tests from CI pipeline#2367

Merged
NealSun96 merged 3 commits intoapache:masterfrom
rahulrane50:ut_fix_STTimeOut
Feb 14, 2023
Merged

Re-retry failing flaky tests from CI pipeline#2367
NealSun96 merged 3 commits intoapache:masterfrom
rahulrane50:ut_fix_STTimeOut

Conversation

@rahulrane50
Copy link
Contributor

@rahulrane50 rahulrane50 commented Feb 3, 2023

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Fixes #2375

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Problem :
Currently there are many flaky tests (~10) which fails atleast once in 10 runs. After analyzing its failures it looks like that many of those failures are due to some uncontrolled situations like previous tests stale zk client interruption, callbacks. This can be fixed if this tests are re-run independently. Currently all contributors have to manually re-run this failing tests locally and and once it passes it can be shows as a proof.

Solutions :
This fix proposes to use surefire plugin to re-run failing tests. This plugin re-runs failing test independently in same CI pipeline run.

Tests

  • The following tests are written for this issue:

I verified this option on my repo 4 times and all 4 times CI run was successful. Although this doesn't guarantee that CI will never fail due to flaky tests but it gives enough confidence for us to try this option in our CI pipelines.
Screenshot 2023-02-13 at 10 33 02 AM

Yaml validated files.

@jiajunwang
Copy link
Contributor

Thanks for working on improving Helix test. But overall I don't like this kind of changes.

  1. We tried to add retry before, not working well. In many cases, if a test fails, retry will keep failing.
  2. Retry should be added into each test case whenever we determine that a retry fits the test logic. Adding retry to everything blindly just hide problem. It would be much worse if the problem happens in production (where retry won't help).
  3. Based on all the tests that I tried to stablized, the main issue is most possibly in problematic testing logic, like lacking of signal after triggering an async operation, so the test check conditions prematurelly.

@rahulrane50
Copy link
Contributor Author

rahulrane50 commented Feb 3, 2023

Thanks a lot @jiajunwang for the review and valuable feedback! I totally agree with your thought process on retrying tests only when it makes sense but for now we have unstable CI and it's such a pain for contributor to submit a PR then wait for results and if any tests fail then verify it locally (which most of the times it works). This is an attempt to reduce that pain. I totally understand that it might not help or resolve this issue but i can try :)

Hence i have not yet published this PR, first let me try this CI couple of times and if i have enough confidence on this that it's giving positive results then i can submit it for review :)
BTW jFYI i have also picked up few top failing UTs and trying dig deeper to understand fundamental cause of failure so i will continue that effort in parallel to this patch fixes.
Let me know if that makes sense.

@jiajunwang
Copy link
Contributor

Thanks a lot @jiajunwang for the review and valuable feedback! I totally agree with your thought process on retrying tests only when it makes sense but for now we have unstable CI and it's such a pain for contributor to submit a PR then wait for results and if any tests fail then verify it locally (which most of the times it works). This is an attempt to reduce that pain. I totally understand that it might not help or resolve this issue but i can try :)

Hence i have not yet published this PR, first let me try this CI couple of times and if i have enough confidence on this that it's giving positive results then i can submit it for review :) BTW jFYI i have also picked up few top failing UTs and trying dig deeper to understand fundamental cause of failure so i will continue that effort in parallel to this patch fixes. Let me know if that makes sense.

I guess it makes sense because if the test fails today, people are just manually retrying. So it helps to avoid that part of human toils. Given that saying, one thing I belive is necessary is to count the retried tests as unstable tests as well. We don't want to lost track of unstable test signal, even we want to make people's life easier.

rahulrane50 and others added 2 commits February 10, 2023 16:50
Current solution to retry failed tests is not working and in few solutions it's suggesting to pass this flag at mvn clean stage as well. Trying this solution.
@rahulrane50 rahulrane50 changed the title Adding retry for flaky tests. Re-retry failing flaky tests from CI pipeline Feb 13, 2023
@rahulrane50 rahulrane50 marked this pull request as ready for review February 13, 2023 19:01
@rahulrane50
Copy link
Contributor Author

@qqu0127 can i get your feedback/review on this please?

Copy link
Contributor

@qqu0127 qqu0127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, overall LGTM.

@rahulrane50
Copy link
Contributor Author

Ready to merge, Approved by @qqu0127!
Commit message :
Add rerun option for failing flaky tests from CI pipeline.

@NealSun96 NealSun96 merged commit b7c62b5 into apache:master Feb 14, 2023
rahulrane50 added a commit to rahulrane50/helix that referenced this pull request May 31, 2023
Add rerun option for failing flaky tests from CI pipeline.
rahulrane50 added a commit to rahulrane50/helix that referenced this pull request May 31, 2023
Add rerun option for failing flaky tests from CI pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-retry failing flaky tests in CI

4 participants