[CI] Move flaky tests that fail very often to "quarantine" test group #10148

lhotari · 2021-04-06T08:32:13Z

Motivation

There are a few tests that fail very often. This is blocking the merging of PRs currently.

Modifications

Move the problematic tests to "quarantine" test group. This test group will be run, but the test failures will be ignored.

Configure "excludedGroups" property with "quarantine" default value
Make modifications so that tests in the quarantine group can be run. It is required to make changes to "BeforeMethod/BeforeClass" annotations so that the method will get run for the quarantine group. alwaysRun=true should be used instead of listing individual groups in the Before* annotations.
Add reporting for quarantined tests so that the PR reviewer can check if there's a sudden change in Quarantined test results
Test result will be visible directly in GitHub Actions UI, example.

Detailed quarantined test results:

Issue reports for quarantined tests

eolivelli

LGTM

@codelipenghui @aahmed-se can you please take a look ?

this patch will unblock CI

codelipenghui · 2021-04-06T08:56:36Z

@lhotari How do we get the test result of the "quarantine" test group?

lhotari · 2021-04-06T09:10:18Z

@lhotari How do we get the test result of the "quarantine" test group?

@codelipenghui There's no special reporting since we don't have a proper solution for reporting test reports.
A warning will get added to the build summary page if one of tests in the "quarantine" group fails.
This uses the GitHub Actions built-in feature. When there is ::warning:: as the prefix in console output, GitHub Actions adds a warning to the build summary page.

One would have to check the logs to see what quarantined tests failed. The reporting can be improved later. Currently the CI pipeline is almost completely clogged because a few flaky tests make most builds to fail.

codelipenghui · 2021-04-06T09:45:04Z

@lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change.

lhotari · 2021-04-06T10:20:42Z

@lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change.

That's possible, but it adds yet another workflow. Each new workflow requires more build resources.
I'll first focus on making this PR pass without adding a new workflow. Perhaps the new workflow could be added in a separate PR?

lhotari · 2021-04-06T10:52:44Z

There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation @BeforeMethod(groups = {"broker", "flaky", "quarantine"}) from running. The alternative seems to be to use @BeforeMethod(alwaysRun = true) or use some other group in BeforeMethod ("setup"?)

codelipenghui · 2021-04-06T12:06:41Z

That's possible, but it adds yet another workflow. Each new workflow requires more build resources.
I'll first focus on making this PR pass without adding a new workflow. Perhaps the new workflow could be added in a separate PR?

I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group).

lhotari · 2021-04-06T12:16:36Z

I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group).

That's a valid concern. It's something that could be addressed after we have the initial solution in place.

lhotari · 2021-04-06T12:30:00Z

There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation @BeforeMethod(groups = {"broker", "flaky", "quarantine"}) from running. The alternative seems to be to use @BeforeMethod(alwaysRun = true) or use some other group in BeforeMethod ("setup"?)

I resolved this issue by using @BeforeMethod(alwaysRun = true). That seems to be the proper solution in this case.

eolivelli · 2021-04-06T12:34:10Z

@codelipenghui the problem is that currently those tests are very flaky and we are ignoring them anyway.
Having a test that is failing and if it fails we simply ignore it it is useless.
In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.

If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.

codelipenghui · 2021-04-06T12:47:27Z

the problem is that currently those tests are very flaky and we are ignoring them anyway.
Having a test that is failing and if it fails we simply ignore it it is useless.
In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.

If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.

The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the flaky. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.

And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.

lhotari · 2021-04-06T13:10:00Z

The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the flaky. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.

I agree with you. There's just a special urgency now all PRs are blocked because of the very high flakiness of a few tests.

And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.

Yes, the tests have to be fixed as top priority. However we need a way to unblock the CI. That is by moving the most flaky tests to a quarantine test group.

The tests have been very flaky for a longer period of time. One fairly recent change in GitHub Actions has been the change in ubuntu-latest image. It now points to ubuntu-20.04 instead of ubuntu-18.04. Another change that has increased the flakiness is simply the regrouping of tests. Since some tests are very fragile and flaky, a small change in the environment, such as regrouping the tests, can increase the flakiness. These 2 major changes seems to be the reason for the sudden increase in flakiness. However, the tests have been flaky even before the changes, so it's the tests that should be fixed instead of rolling back to the old environmental setup (test grouping and ubuntu-18.04 image).

lhotari · 2021-04-06T16:00:19Z

This seems to be pretty hard to get right. TestNG has this particular detail: https://testng.org/doc/documentation-main.html#partial-groups
If the test group is defined at the test class level, the group gets assigned to all test methods. It's not possible to override the group of a single test method, it will just add additional groups to the method. This is one reason why the current solution doesn't work. Tests get executed as part of the "quarantine" group in addition.
I found some answer on Stackoverflow where adding a custom test listener can solve this problem.

lhotari · 2021-04-06T16:05:30Z

Another problem seems to be the slowness of test execution, this is an issue at least when using -DtestReuseFork=false. Each test class takes remarkable time to process even when there aren't any tests for a a specific test group in the class.
This is one of the worst examples:

[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.102 s - in org.apache.pulsar.common.naming.NamespaceBundleTest

Usually the reported times are <0.5 s, but it's possible that it doesn't add the JVM startup time to the duration.

aahmed-se · 2021-04-06T20:17:46Z

This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes.

lhotari · 2021-04-07T05:38:51Z

This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes.

It might be wrong currently. I hope we get the changes eventually "right" together.

The "extra" group didn't really exist. It was only mentioned in the group parameter of a few @BeforeMethod annotations.
This isn't a sufficient solution. I learned by experimenting, that to properly exclude a group in TestNG, the group has to be listed in the pom.xml's default value for excludedGroups. This will resolve the issue with TestNG where a single test method is added to another group:

for example:

@Test(groups = "flaky")
public class DeadLetterTopicTest extends ProducerConsumerBase {
...
    @Test
    public void testDeadLetterTopic() throws Exception {
    ....
    }
...
    @Test(groups = "quarantine")
    public void testDeadLetterTopicByCustomTopicName() throws Exception {
    ...
    }
...
...
}

In the above example, testDeadLetterTopicByCustomTopicName test case will belong to groups flaky and quarantine at runtime (explanation in TestNG documentation). If there's no excludedGroups setting, this method will also get run for the flaky group. This problem is resolved by adding quarantine to the excludedGroups.

Since it's necessary to add the quarantine group to the excludedGroups, that is the reason why it doesn't work as a solution to list all possible groups in @BeforeMethod annotations.

For example, when excludedGroups contains quarantine, this BeforeMethod doesn't get run at all, for any group:

    @BeforeMethod(groups = { "broker", "websocket", "broker-api", "broker-discovery", "broker-impl", "extra", "flaky", "quarantine" })
     public void beforeMethod(Method m) throws Exception {
         methodName = m.getName();
     }

This problem is resolved by using

    @BeforeMethod(alwaysRun = true)
     public void beforeMethod(Method m) throws Exception {
         methodName = m.getName();
     }

I hope this explanation clarifies why this PR contains necessary changes if we want to add a new TestNG group for quarantined tests. If our goal is something else, this PR isn't necessary at all. In that case, we could simply use TestNG's org.testng.annotations.Ignore annotation or @Test(enabled=false).

aahmed-se · 2021-04-07T05:49:09Z

We don't need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them have extra in already in place if it's missing we can add them.

aahmed-se · 2021-04-07T06:08:40Z

This pr shows the general direction that should be taken
#10158

lhotari · 2021-04-07T07:41:20Z

This pr shows the general direction that should be taken
#10158

@aahmed-se Please elaborate since there is no explanation.

lhotari · 2021-04-07T07:45:17Z

We do need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them extra in already in place if it's missing we can add them.

When mixing class level @Test(group="...") definitions and method level @Test(group="...") definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?

aahmed-se · 2021-04-07T18:58:33Z

Here is the updated pr #10158
I have isolated all the unstable tests with extra group annotation, also there is template jos to run those tests on nightly cadence. This is temporary but suitable solution to unblock ci issues.

When mixing class level @Test(group="...") definitions and method level @Test(group="...") definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?

I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea.

lhotari · 2021-04-07T19:40:08Z

The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the flaky. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.

@codelipenghui I have added test reporting for quarantined tests in this PR. The PR description contains examples.

lhotari · 2021-04-07T19:44:58Z

I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea.

@aahmed-se It seems to be pretty useful to be able to quarantine just a single test method instead of moving all tests of the test class to the quarantine. Why would it be a bad idea?

aahmed-se · 2021-04-07T19:53:16Z

It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals.

To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so.

lhotari · 2021-04-07T20:11:47Z

It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals.

Things seem to work fine with the changes in this PR. Can you provide an example of a possible confusion?

To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so.

The only requirement I found was to use alwaysRun = true for @Before* methods instead of listing groups. Besides that, the solution passes quarantine in excludedGroups by default and when running quarantine tests, it passes quarantine in groups and an empty string to excludedGroups. Isn't this fine?

aahmed-se · 2021-04-07T20:52:34Z

The test failures are github specific issues, we don't want to exclude things be default when devs are running things locally or in a separate ci env.

lhotari · 2021-04-11T16:54:53Z

This PR is now ready for final review. Please review again @eolivelli @codelipenghui @merlimat

eolivelli

Thank you very much for this great work

I did another pass and the patch looks good to me, I left one question, not a blocker

eolivelli · 2021-04-11T20:01:00Z

buildtools/pom.xml

@@ -26,6 +26,7 @@
    <groupId>org.apache</groupId>
    <artifactId>apache</artifactId>
    <version>23</version>
+    <relativePath></relativePath>


There has been a similar problem (maven warning) as explained in this question: https://stackoverflow.com/questions/6003831/parent-relativepath-points-at-my-com-mycompanymyproject-instead-of-org-apache

lhotari · 2021-04-12T06:40:31Z

I'm thinking of splitting this PR into 15 or 16 PRs since it contains a lot of unrelated changes. They have a common goal to improve test stability, but the changes are unrelated technically. It will lead to better commit history although the merging of the PRs will be some overhead comparing to the way of merging this large PR at once.

eolivelli · 2021-04-12T06:48:05Z

@lhotari good idea to split the patch

you can set status to "draft" on this patch in order to keep the comments

lhotari · 2021-04-12T09:20:40Z

I have removed all changes that aren't related to moving the most flaky tests to the "quarantine" test group. I'll create separate PRs for fixing resource leaks (memory leaks, thread leaks) and for other improvements to tests stability.
@eolivelli @codelipenghui please review this PR again, hopefully we could get it merged since the scope is reduced a lot.

These are the PRs that have been split out of this PR:

[Tests] Enable test retries for integration tests, shade tests and backwards compat tests #10191
[Tests] Unregister JettyStatisticsCollector to fix memory leak in Broker shutdown #10192
[Tests] Fix Prometheus metrics parsing for "+Inf" in PrometheusMetricsTest and Pulsar Function tests #10193
[Build] Fix warning about parent.relativePath in buildtools/pom.xml #10194
[Tests] Add TestNG listeners for resource cleanup, thread leak detection and "fail fast" #10195
[Broker] Close namespace clients when PulsarService is closed #10196
[Tests] Cleanup PulsarClient instances #10197
Fix some ExecutorService leaks #10198
Shutdown Broker gracefully, but forcefully after brokerShutdownTimeoutMs #10199
I kept back a few remaining changes until these PRs have been processed.

- tests are executed, but the failures are ignored for this group

…estBase - makes retries work

…rTest - SequenceIdWithErrorTest was using the wrong base class

- broker url parameters must be lazily evaluated, otherwise the retry attempt will get the old url and not the new url of the new broker

- it helps investigate issues when the JVM hangs

codelipenghui · 2021-04-13T03:22:26Z

Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR?

lhotari · 2021-04-13T06:26:18Z

/pulsarbot run-failure-checks

lhotari · 2021-04-13T07:12:01Z

Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR?

@codelipenghui The quarantined test reporting isn't fully working, but it's possible to navigate to the logs. Here are the quarantined test failures as part of "CI - Unit - Brokers - Other / unit-tests" check: https://github.com/apache/pulsar/runs/2327040599?check_suite_focus=true#step:8:133

I know that it would be easier to see the results if it was a separate build job. That would add more overhead to our builds and I'd like to avoid that since the resource consumption is already really high for our builds.

It seems that the intended reporting solution broke when I moved ReplicatorTests to run separately from other Quarantined tests. There are a few reasons why I had to do this, mainly because there are so many resource leakages that are fixed in PRs #10192 , #10195 , #10196 , #10197, #10198 and #10199 . I'd like to propose that we get these PRs merged and I can improve the quarantined test reporting after that. @codelipenghui Are you fine with that?

It should always be a blocker issue to fix quarantined tests. Therefore the need for good reporting might not be so relevant if fixing quarantined tests is taken seriously.

Some of the problems with quarantined tests will get resolved after the PRs to fix resource leakages have been merged. For example, the stability of ReplicatorTests will improve significantly and it should be possible to move the test out of the quarantine group. There are 1 or 2 flaky test methods which need slight fixes, but the root cause has been the resource leaks and asynchronous shutdown of broker instances in tests (fixed by #10199).

codelipenghui · 2021-04-13T09:42:40Z

@lhotari Sounds good to me. We can improve the quarantined test reporting later. It's useful for the reviewers to check if the new change break some tests which is under the quarantined group.

eolivelli approved these changes Apr 6, 2021

View reviewed changes

merlimat approved these changes Apr 6, 2021

View reviewed changes

lhotari force-pushed the lh-quarantine-tests branch from 4b80fa5 to bab356b Compare April 7, 2021 16:53

lhotari changed the title ~~[CI] Move flaky tests that fail very often to "quarantine" test group~~ [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability Apr 7, 2021

eolivelli approved these changes Apr 7, 2021

View reviewed changes

lhotari force-pushed the lh-quarantine-tests branch 2 times, most recently from 8f833b1 to 0f6dbee Compare April 11, 2021 15:32

eolivelli approved these changes Apr 11, 2021

View reviewed changes

lhotari marked this pull request as draft April 12, 2021 07:02

lhotari force-pushed the lh-quarantine-tests branch from 0f6dbee to a2d7656 Compare April 12, 2021 09:14

lhotari changed the title ~~[CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability~~ [CI] Move flaky tests that fail very often to "quarantine" test group Apr 12, 2021

lhotari marked this pull request as ready for review April 12, 2021 09:17

lhotari added 6 commits April 12, 2021 15:13

Move really flaky tests to "quarantine" test group

db0b9f0

- tests are executed, but the failures are ignored for this group

Add TestRetrySupport to TransactionTestBase and TransactionMetaStoreT…

d57be18

…estBase - makes retries work

Add TestRetrySupport for BkEnsemblesTestBase & fix SequenceIdWithErro…

c2265a0

…rTest - SequenceIdWithErrorTest was using the wrong base class

Fix retry support for ClientGetSchemaTest

d1c4f06

- broker url parameters must be lazily evaluated, otherwise the retry attempt will get the old url and not the new url of the new broker

Print JVM information if job gets cancelled (because of timeout)

86fd432

- it helps investigate issues when the JVM hangs

Fix cleanup in BacklogQuotaManagerTest

3a838f1

lhotari force-pushed the lh-quarantine-tests branch from a2d7656 to 3a838f1 Compare April 12, 2021 12:14

codelipenghui approved these changes Apr 13, 2021

View reviewed changes

codelipenghui assigned lhotari Apr 13, 2021

codelipenghui added this to the 2.8.0 milestone Apr 13, 2021

codelipenghui added the area/test label Apr 13, 2021

codelipenghui merged commit a7933ac into apache:master Apr 13, 2021

equanz mentioned this pull request Jan 6, 2022

Rename test file name from *Test2.java to *Test.java to run all tests correctly #13644

Merged

2 tasks

equanz mentioned this pull request Aug 10, 2022

[branch-2.9] Rename test file name from *Test2.java to *Test.java to run all tests correctly #17048

Merged

2 tasks

equanz mentioned this pull request Nov 7, 2023

[fix][test] Fix that some test cases are not running as expected on CI #21532

Merged

15 tasks

[CI] Move flaky tests that fail very often to "quarantine" test group #10148

[CI] Move flaky tests that fail very often to "quarantine" test group #10148

Conversation

lhotari commented Apr 6, 2021 • edited Loading

Motivation

Modifications

Issue reports for quarantined tests

eolivelli left a comment

Choose a reason for hiding this comment

codelipenghui commented Apr 6, 2021

lhotari commented Apr 6, 2021 • edited Loading

codelipenghui commented Apr 6, 2021

lhotari commented Apr 6, 2021

lhotari commented Apr 6, 2021

codelipenghui commented Apr 6, 2021

lhotari commented Apr 6, 2021

lhotari commented Apr 6, 2021

eolivelli commented Apr 6, 2021

codelipenghui commented Apr 6, 2021 • edited Loading

lhotari commented Apr 6, 2021

lhotari commented Apr 6, 2021

lhotari commented Apr 6, 2021

aahmed-se commented Apr 6, 2021

lhotari commented Apr 7, 2021

aahmed-se commented Apr 7, 2021 • edited Loading

aahmed-se commented Apr 7, 2021

lhotari commented Apr 7, 2021

lhotari commented Apr 7, 2021 • edited Loading

aahmed-se commented Apr 7, 2021

lhotari commented Apr 7, 2021 • edited Loading

lhotari commented Apr 7, 2021

aahmed-se commented Apr 7, 2021

lhotari commented Apr 7, 2021

aahmed-se commented Apr 7, 2021

lhotari commented Apr 11, 2021 • edited Loading

eolivelli left a comment

Choose a reason for hiding this comment

eolivelli Apr 11, 2021

Choose a reason for hiding this comment

lhotari Apr 12, 2021

Choose a reason for hiding this comment

eolivelli Apr 12, 2021

Choose a reason for hiding this comment

lhotari commented Apr 12, 2021

eolivelli commented Apr 12, 2021

lhotari commented Apr 12, 2021 • edited Loading

codelipenghui commented Apr 13, 2021

lhotari commented Apr 13, 2021

lhotari commented Apr 13, 2021

codelipenghui commented Apr 13, 2021

lhotari commented Apr 6, 2021 •

edited

Loading

lhotari commented Apr 6, 2021 •

edited

Loading

codelipenghui commented Apr 6, 2021 •

edited

Loading

aahmed-se commented Apr 7, 2021 •

edited

Loading

lhotari commented Apr 7, 2021 •

edited

Loading

lhotari commented Apr 7, 2021 •

edited

Loading

lhotari commented Apr 11, 2021 •

edited

Loading

lhotari commented Apr 12, 2021 •

edited

Loading