Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Move flaky tests that fail very often to "quarantine" test group #10148

Merged
merged 6 commits into from
Apr 13, 2021

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Apr 6, 2021

Motivation

There are a few tests that fail very often. This is blocking the merging of PRs currently.

Modifications

Move the problematic tests to "quarantine" test group. This test group will be run, but the test failures will be ignored.

  • Configure "excludedGroups" property with "quarantine" default value
  • Make modifications so that tests in the quarantine group can be run. It is required to make changes to "BeforeMethod/BeforeClass" annotations so that the method will get run for the quarantine group. alwaysRun=true should be used instead of listing individual groups in the Before* annotations.
  • Add reporting for quarantined tests so that the PR reviewer can check if there's a sudden change in Quarantined test results
    Test result will be visible directly in GitHub Actions UI, example.
    image
    Detailed quarantined test results:
    image

Issue reports for quarantined tests

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codelipenghui @aahmed-se can you please take a look ?

this patch will unblock CI

@codelipenghui
Copy link
Contributor

@lhotari How do we get the test result of the "quarantine" test group?

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

@lhotari How do we get the test result of the "quarantine" test group?

@codelipenghui There's no special reporting since we don't have a proper solution for reporting test reports.
A warning will get added to the build summary page if one of tests in the "quarantine" group fails.
This uses the GitHub Actions built-in feature. When there is ::warning:: as the prefix in console output, GitHub Actions adds a warning to the build summary page.

One would have to check the logs to see what quarantined tests failed. The reporting can be improved later. Currently the CI pipeline is almost completely clogged because a few flaky tests make most builds to fail.

@codelipenghui
Copy link
Contributor

@lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change.

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

@lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change.

That's possible, but it adds yet another workflow. Each new workflow requires more build resources.
I'll first focus on making this PR pass without adding a new workflow. Perhaps the new workflow could be added in a separate PR?

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation @BeforeMethod(groups = {"broker", "flaky", "quarantine"}) from running. The alternative seems to be to use @BeforeMethod(alwaysRun = true) or use some other group in BeforeMethod ("setup"?)

@codelipenghui
Copy link
Contributor

That's possible, but it adds yet another workflow. Each new workflow requires more build resources.
I'll first focus on making this PR pass without adding a new workflow. Perhaps the new workflow could be added in a separate PR?

I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group).

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group).

That's a valid concern. It's something that could be addressed after we have the initial solution in place.

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation @BeforeMethod(groups = {"broker", "flaky", "quarantine"}) from running. The alternative seems to be to use @BeforeMethod(alwaysRun = true) or use some other group in BeforeMethod ("setup"?)

I resolved this issue by using @BeforeMethod(alwaysRun = true). That seems to be the proper solution in this case.

@eolivelli
Copy link
Contributor

@codelipenghui the problem is that currently those tests are very flaky and we are ignoring them anyway.
Having a test that is failing and if it fails we simply ignore it it is useless.
In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.

If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.

@codelipenghui
Copy link
Contributor

codelipenghui commented Apr 6, 2021

the problem is that currently those tests are very flaky and we are ignoring them anyway.
Having a test that is failing and if it fails we simply ignore it it is useless.
In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.

If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.

The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the flaky. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.

And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the flaky. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.

I agree with you. There's just a special urgency now all PRs are blocked because of the very high flakiness of a few tests.

And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.

Yes, the tests have to be fixed as top priority. However we need a way to unblock the CI. That is by moving the most flaky tests to a quarantine test group.

The tests have been very flaky for a longer period of time. One fairly recent change in GitHub Actions has been the change in ubuntu-latest image. It now points to ubuntu-20.04 instead of ubuntu-18.04. Another change that has increased the flakiness is simply the regrouping of tests. Since some tests are very fragile and flaky, a small change in the environment, such as regrouping the tests, can increase the flakiness. These 2 major changes seems to be the reason for the sudden increase in flakiness. However, the tests have been flaky even before the changes, so it's the tests that should be fixed instead of rolling back to the old environmental setup (test grouping and ubuntu-18.04 image).

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

This seems to be pretty hard to get right. TestNG has this particular detail: https://testng.org/doc/documentation-main.html#partial-groups
If the test group is defined at the test class level, the group gets assigned to all test methods. It's not possible to override the group of a single test method, it will just add additional groups to the method. This is one reason why the current solution doesn't work. Tests get executed as part of the "quarantine" group in addition.
I found some answer on Stackoverflow where adding a custom test listener can solve this problem.

@lhotari
Copy link
Member Author

lhotari commented Apr 6, 2021

Another problem seems to be the slowness of test execution, this is an issue at least when using -DtestReuseFork=false. Each test class takes remarkable time to process even when there aren't any tests for a a specific test group in the class.
This is one of the worst examples:

[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.102 s - in org.apache.pulsar.common.naming.NamespaceBundleTest

Usually the reported times are <0.5 s, but it's possible that it doesn't add the JVM startup time to the duration.

@aahmed-se
Copy link
Contributor

This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes.

@lhotari
Copy link
Member Author

lhotari commented Apr 7, 2021

This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes.

It might be wrong currently. I hope we get the changes eventually "right" together.

The "extra" group didn't really exist. It was only mentioned in the group parameter of a few @BeforeMethod annotations.
This isn't a sufficient solution. I learned by experimenting, that to properly exclude a group in TestNG, the group has to be listed in the pom.xml's default value for excludedGroups. This will resolve the issue with TestNG where a single test method is added to another group:

for example:

@Test(groups = "flaky")
public class DeadLetterTopicTest extends ProducerConsumerBase {
...
    @Test
    public void testDeadLetterTopic() throws Exception {
    ....
    }
...
    @Test(groups = "quarantine")
    public void testDeadLetterTopicByCustomTopicName() throws Exception {
    ...
    }
...
...
}

In the above example, testDeadLetterTopicByCustomTopicName test case will belong to groups flaky and quarantine at runtime (explanation in TestNG documentation). If there's no excludedGroups setting, this method will also get run for the flaky group. This problem is resolved by adding quarantine to the excludedGroups.

Since it's necessary to add the quarantine group to the excludedGroups, that is the reason why it doesn't work as a solution to list all possible groups in @BeforeMethod annotations.

For example, when excludedGroups contains quarantine, this BeforeMethod doesn't get run at all, for any group:

    @BeforeMethod(groups = { "broker", "websocket", "broker-api", "broker-discovery", "broker-impl", "extra", "flaky", "quarantine" })
     public void beforeMethod(Method m) throws Exception {
         methodName = m.getName();
     }

This problem is resolved by using

    @BeforeMethod(alwaysRun = true)
     public void beforeMethod(Method m) throws Exception {
         methodName = m.getName();
     }

I hope this explanation clarifies why this PR contains necessary changes if we want to add a new TestNG group for quarantined tests. If our goal is something else, this PR isn't necessary at all. In that case, we could simply use TestNG's org.testng.annotations.Ignore annotation or @Test(enabled=false).

@aahmed-se
Copy link
Contributor

aahmed-se commented Apr 7, 2021

We don't need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them have extra in already in place if it's missing we can add them.

@aahmed-se
Copy link
Contributor

This pr shows the general direction that should be taken
#10158

@lhotari
Copy link
Member Author

lhotari commented Apr 7, 2021

This pr shows the general direction that should be taken
#10158

@aahmed-se Please elaborate since there is no explanation.

@lhotari
Copy link
Member Author

lhotari commented Apr 7, 2021

We do need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them extra in already in place if it's missing we can add them.

When mixing class level @Test(group="...") definitions and method level @Test(group="...") definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?

@aahmed-se
Copy link
Contributor

Here is the updated pr #10158
I have isolated all the unstable tests with extra group annotation, also there is template jos to run those tests on nightly cadence. This is temporary but suitable solution to unblock ci issues.

When mixing class level @Test(group="...") definitions and method level @Test(group="...") definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?

I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea.

@lhotari
Copy link
Member Author

lhotari commented Apr 7, 2021

The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the flaky. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.

@codelipenghui I have added test reporting for quarantined tests in this PR. The PR description contains examples.

@lhotari
Copy link
Member Author

lhotari commented Apr 7, 2021

I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea.

@aahmed-se It seems to be pretty useful to be able to quarantine just a single test method instead of moving all tests of the test class to the quarantine. Why would it be a bad idea?

@lhotari lhotari changed the title [CI] Move flaky tests that fail very often to "quarantine" test group [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability Apr 7, 2021
@aahmed-se
Copy link
Contributor

It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals.

To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so.

@lhotari
Copy link
Member Author

lhotari commented Apr 7, 2021

It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals.

Things seem to work fine with the changes in this PR. Can you provide an example of a possible confusion?

To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so.

The only requirement I found was to use alwaysRun = true for @Before* methods instead of listing groups. Besides that, the solution passes quarantine in excludedGroups by default and when running quarantine tests, it passes quarantine in groups and an empty string to excludedGroups. Isn't this fine?

@aahmed-se
Copy link
Contributor

The test failures are github specific issues, we don't want to exclude things be default when devs are running things locally or in a separate ci env.

@lhotari lhotari force-pushed the lh-quarantine-tests branch 2 times, most recently from 8f833b1 to 0f6dbee Compare April 11, 2021 15:32
@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2021

This PR is now ready for final review. Please review again @eolivelli @codelipenghui @merlimat

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this great work

I did another pass and the patch looks good to me, I left one question, not a blocker

@@ -26,6 +26,7 @@
<groupId>org.apache</groupId>
<artifactId>apache</artifactId>
<version>23</version>
<relativePath></relativePath>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been a similar problem (maven warning) as explained in this question: https://stackoverflow.com/questions/6003831/parent-relativepath-points-at-my-com-mycompanymyproject-instead-of-org-apache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@lhotari
Copy link
Member Author

lhotari commented Apr 12, 2021

I'm thinking of splitting this PR into 15 or 16 PRs since it contains a lot of unrelated changes. They have a common goal to improve test stability, but the changes are unrelated technically. It will lead to better commit history although the merging of the PRs will be some overhead comparing to the way of merging this large PR at once.

@eolivelli
Copy link
Contributor

@lhotari good idea to split the patch

you can set status to "draft" on this patch in order to keep the comments

@lhotari lhotari marked this pull request as draft April 12, 2021 07:02
@lhotari lhotari changed the title [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability [CI] Move flaky tests that fail very often to "quarantine" test group Apr 12, 2021
@lhotari lhotari marked this pull request as ready for review April 12, 2021 09:17
@lhotari
Copy link
Member Author

lhotari commented Apr 12, 2021

I have removed all changes that aren't related to moving the most flaky tests to the "quarantine" test group. I'll create separate PRs for fixing resource leaks (memory leaks, thread leaks) and for other improvements to tests stability.
@eolivelli @codelipenghui please review this PR again, hopefully we could get it merged since the scope is reduced a lot.

These are the PRs that have been split out of this PR:

- tests are executed, but the failures are ignored for this group
…rTest

- SequenceIdWithErrorTest was using the wrong base class
- broker url parameters must be lazily evaluated, otherwise the
  retry attempt will get the old url and not the new url of the
  new broker
- it helps investigate issues when the JVM hangs
@codelipenghui
Copy link
Contributor

Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR?

@lhotari
Copy link
Member Author

lhotari commented Apr 13, 2021

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Apr 13, 2021

Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR?

@codelipenghui The quarantined test reporting isn't fully working, but it's possible to navigate to the logs. Here are the quarantined test failures as part of "CI - Unit - Brokers - Other / unit-tests" check: https://github.com/apache/pulsar/runs/2327040599?check_suite_focus=true#step:8:133

I know that it would be easier to see the results if it was a separate build job. That would add more overhead to our builds and I'd like to avoid that since the resource consumption is already really high for our builds.

It seems that the intended reporting solution broke when I moved ReplicatorTests to run separately from other Quarantined tests. There are a few reasons why I had to do this, mainly because there are so many resource leakages that are fixed in PRs #10192 , #10195 , #10196 , #10197, #10198 and #10199 . I'd like to propose that we get these PRs merged and I can improve the quarantined test reporting after that. @codelipenghui Are you fine with that?

It should always be a blocker issue to fix quarantined tests. Therefore the need for good reporting might not be so relevant if fixing quarantined tests is taken seriously.

Some of the problems with quarantined tests will get resolved after the PRs to fix resource leakages have been merged. For example, the stability of ReplicatorTests will improve significantly and it should be possible to move the test out of the quarantine group. There are 1 or 2 flaky test methods which need slight fixes, but the root cause has been the resource leaks and asynchronous shutdown of broker instances in tests (fixed by #10199).

@codelipenghui
Copy link
Contributor

@lhotari Sounds good to me. We can improve the quarantined test reporting later. It's useful for the reviewers to check if the new change break some tests which is under the quarantined group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants