Re-retry failing flaky tests from CI pipeline by rahulrane50 · Pull Request #2367 · apache/helix

rahulrane50 · 2023-02-03T00:52:51Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

Description

Here are some details about my PR, including screenshots of any UI changes:

Problem :
Currently there are many flaky tests (~10) which fails atleast once in 10 runs. After analyzing its failures it looks like that many of those failures are due to some uncontrolled situations like previous tests stale zk client interruption, callbacks. This can be fixed if this tests are re-run independently. Currently all contributors have to manually re-run this failing tests locally and and once it passes it can be shows as a proof.

Solutions :
This fix proposes to use surefire plugin to re-run failing tests. This plugin re-runs failing test independently in same CI pipeline run.

Tests

The following tests are written for this issue:

I verified this option on my repo 4 times and all 4 times CI run was successful. Although this doesn't guarantee that CI will never fail due to flaky tests but it gives enough confidence for us to try this option in our CI pipelines.

Yaml validated files.

jiajunwang · 2023-02-03T18:01:46Z

Thanks for working on improving Helix test. But overall I don't like this kind of changes.

We tried to add retry before, not working well. In many cases, if a test fails, retry will keep failing.
Retry should be added into each test case whenever we determine that a retry fits the test logic. Adding retry to everything blindly just hide problem. It would be much worse if the problem happens in production (where retry won't help).
Based on all the tests that I tried to stablized, the main issue is most possibly in problematic testing logic, like lacking of signal after triggering an async operation, so the test check conditions prematurelly.

rahulrane50 · 2023-02-03T18:37:32Z

Thanks a lot @jiajunwang for the review and valuable feedback! I totally agree with your thought process on retrying tests only when it makes sense but for now we have unstable CI and it's such a pain for contributor to submit a PR then wait for results and if any tests fail then verify it locally (which most of the times it works). This is an attempt to reduce that pain. I totally understand that it might not help or resolve this issue but i can try :)

Hence i have not yet published this PR, first let me try this CI couple of times and if i have enough confidence on this that it's giving positive results then i can submit it for review :)
BTW jFYI i have also picked up few top failing UTs and trying dig deeper to understand fundamental cause of failure so i will continue that effort in parallel to this patch fixes.
Let me know if that makes sense.

jiajunwang · 2023-02-03T20:38:17Z

Thanks a lot @jiajunwang for the review and valuable feedback! I totally agree with your thought process on retrying tests only when it makes sense but for now we have unstable CI and it's such a pain for contributor to submit a PR then wait for results and if any tests fail then verify it locally (which most of the times it works). This is an attempt to reduce that pain. I totally understand that it might not help or resolve this issue but i can try :)

Hence i have not yet published this PR, first let me try this CI couple of times and if i have enough confidence on this that it's giving positive results then i can submit it for review :) BTW jFYI i have also picked up few top failing UTs and trying dig deeper to understand fundamental cause of failure so i will continue that effort in parallel to this patch fixes. Let me know if that makes sense.

I guess it makes sense because if the test fails today, people are just manually retrying. So it helps to avoid that part of human toils. Given that saying, one thing I belive is necessary is to count the retried tests as unstable tests as well. We don't want to lost track of unstable test signal, even we want to make people's life easier.

Current solution to retry failed tests is not working and in few solutions it's suggesting to pass this flag at mvn clean stage as well. Trying this solution.

rahulrane50 · 2023-02-13T19:01:59Z

@qqu0127 can i get your feedback/review on this please?

qqu0127

One question, overall LGTM.

.github/workflows/Helix-CI.yml

rahulrane50 · 2023-02-13T19:54:52Z

Ready to merge, Approved by @qqu0127!
Commit message :
Add rerun option for failing flaky tests from CI pipeline.

Add rerun option for failing flaky tests from CI pipeline.

rahulrane50 and others added 2 commits February 10, 2023 16:50

Fixing retrying failed tests

e215ac2

Current solution to retry failed tests is not working and in few solutions it's suggesting to pass this flag at mvn clean stage as well. Trying this solution.

Adding retry for flaky tests.

2708225

rahulrane50 force-pushed the ut_fix_STTimeOut branch from 18f35e0 to 2708225 Compare February 13, 2023 18:45

fixing other CI

73f51b2

rahulrane50 changed the title ~~Adding retry for flaky tests.~~ Re-retry failing flaky tests from CI pipeline Feb 13, 2023

rahulrane50 marked this pull request as ready for review February 13, 2023 19:01

qqu0127 approved these changes Feb 13, 2023

View reviewed changes

.github/workflows/Helix-CI.yml Show resolved Hide resolved

NealSun96 merged commit b7c62b5 into apache:master Feb 14, 2023

rahulrane50 added a commit to rahulrane50/helix that referenced this pull request May 31, 2023

Re-retry failing flaky tests from CI pipeline (apache#2367)

2c4c2f7

Add rerun option for failing flaky tests from CI pipeline.

rahulrane50 added a commit to rahulrane50/helix that referenced this pull request May 31, 2023

Re-retry failing flaky tests from CI pipeline (apache#2367)

c73f151

Add rerun option for failing flaky tests from CI pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-retry failing flaky tests from CI pipeline#2367

Re-retry failing flaky tests from CI pipeline#2367
NealSun96 merged 3 commits intoapache:masterfrom
rahulrane50:ut_fix_STTimeOut

rahulrane50 commented Feb 3, 2023 •

edited

Loading

Uh oh!

jiajunwang commented Feb 3, 2023

Uh oh!

rahulrane50 commented Feb 3, 2023 •

edited

Loading

Uh oh!

jiajunwang commented Feb 3, 2023

Uh oh!

rahulrane50 commented Feb 13, 2023

Uh oh!

qqu0127 left a comment

Uh oh!

Uh oh!

rahulrane50 commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rahulrane50 commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

Tests

Uh oh!

jiajunwang commented Feb 3, 2023

Uh oh!

rahulrane50 commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiajunwang commented Feb 3, 2023

Uh oh!

rahulrane50 commented Feb 13, 2023

Uh oh!

qqu0127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rahulrane50 commented Feb 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rahulrane50 commented Feb 3, 2023 •

edited

Loading

rahulrane50 commented Feb 3, 2023 •

edited

Loading