Skip to content

Build: Add test-retry-gradle-plugin to retry flaky tests#3030

Closed
nastra wants to merge 1 commit intoapache:masterfrom
nastra:retry-on-test-flakiness
Closed

Build: Add test-retry-gradle-plugin to retry flaky tests#3030
nastra wants to merge 1 commit intoapache:masterfrom
nastra:retry-on-test-flakiness

Conversation

@nastra
Copy link
Contributor

@nastra nastra commented Aug 26, 2021

I was curious whether this would bring any value to the project as it seems we occasionally run into flaky tests. The latest flaky failure that happened in one of my PRs is reported in #3033 (which passes locally).

The test-retry-gradle-plugin causes failed tests to be retried within the same task. After executing all tests, any failed tests are retried. The process repeats with tests that continue to fail until the maximum specified number of retries has been attempted, or there are no more failing tests.

By default, all failed tests passing on retry prevents the test task from failing. This mode prevents flaky tests from causing build failure. This setting can be changed so that flaky tests cause build failure, which can be used to identify flaky tests.

When something goes badly wrong and all tests start failing, it can be preferable to not keep retrying tests. The plugin can be configured to stop retrying after a certain number of total test failures.

In case we don't want to retry everything in general, there's also the option to only retry a given set of tests, which can be specified via includeClasses / includeAnnotationClasses.

@github-actions github-actions bot added the build label Aug 26, 2021
@nastra nastra force-pushed the retry-on-test-flakiness branch from dcea03d to 202f75a Compare August 26, 2021 08:48
@nastra nastra requested a review from rdblue August 26, 2021 08:53
@pvary
Copy link
Contributor

pvary commented Aug 26, 2021

I would recommend against retrying flaky tests.
In Hive we spent some time on cleaning flaky tests several years ago, and in almost all of the cases it turned out to be a bug which only manifested in some edge cases.
IMHO we should fix, or disable the flaky tests and not waste resources on running them and creating a false security that the code is tested.

@rdblue
Copy link
Contributor

rdblue commented Aug 26, 2021

I strongly agree with @pvary. Flaky tests should be considered bugs in themselves.

@nastra
Copy link
Contributor Author

nastra commented Aug 27, 2021

I absolutely agree with you that flaky tests should be treated like bugs. The reason I brought up this retrying is that non-committers need to seek help from committers to restart CI when it fails due to potential flakiness in other parts that they haven't touched. Therefore it seemed reasonable to me to let CI run through and retry those flaky ones (and report them).

@nastra
Copy link
Contributor Author

nastra commented Aug 27, 2021

How do we want to treat test failures in the future? Do we want to open a ticket when somebody runs into a non-related failing test + disable the test in the meantime until it's fixed?

@pvary
Copy link
Contributor

pvary commented Aug 27, 2021

How do we want to treat test failures in the future? Do we want to open a ticket when somebody runs into a non-related failing test + disable the test in the meantime until it's fixed?

We have a smaller, more tighter knit community here, and we usually were able to discuss and fix the issues as they were surfaced. Usually if someone found an issue, then they tried to fix it, or wrote to the dev list / created an issue and we were able to fix those without a formal process.

In Hive we have much bigger codebase and bigger number of contributors, so we have a more formal process:

  • We have a jenkins job which runs a specific test case 100 times, and if any of the runs fail we call the test flaky - we use it to confirm a flaky tests
  • If a test is confirmed flaky, we disable it (even without test runs)
  • We create a jira "Reenable flaky test: ...." and try to ping users who were working on the tests lately

OTOH if we have a better codebase, like we have here in Iceberg, we might be better off identifying the commit which caused the flakiness and reverting it (or if the issue is not caused by the test, then fixing the bug).

@kbendick
Copy link
Contributor

How do we want to treat test failures in the future? Do we want to open a ticket when somebody runs into a non-related failing test + disable the test in the meantime until it's fixed?

Wanted to point out that I believe CI can be forced to rerun by closing and reopening the PR.

@rdblue
Copy link
Contributor

rdblue commented Aug 27, 2021

Yes, reopening a PR will retrigger tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants