-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Move flaky tests that fail very often to "quarantine" test group #10148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhotari How do we get the test result of the "quarantine" test group? |
@codelipenghui There's no special reporting since we don't have a proper solution for reporting test reports. One would have to check the logs to see what quarantined tests failed. The reporting can be improved later. Currently the CI pipeline is almost completely clogged because a few flaky tests make most builds to fail. |
@lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change. |
That's possible, but it adds yet another workflow. Each new workflow requires more build resources. |
There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation |
I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group). |
That's a valid concern. It's something that could be addressed after we have the initial solution in place. |
I resolved this issue by using |
@codelipenghui the problem is that currently those tests are very flaky and we are ignoring them anyway. If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors. |
The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge. |
I agree with you. There's just a special urgency now all PRs are blocked because of the very high flakiness of a few tests.
Yes, the tests have to be fixed as top priority. However we need a way to unblock the CI. That is by moving the most flaky tests to a quarantine test group. The tests have been very flaky for a longer period of time. One fairly recent change in GitHub Actions has been the change in |
This seems to be pretty hard to get right. TestNG has this particular detail: https://testng.org/doc/documentation-main.html#partial-groups |
Another problem seems to be the slowness of test execution, this is an issue at least when using
Usually the reported times are <0.5 s, but it's possible that it doesn't add the JVM startup time to the duration. |
This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes. |
It might be wrong currently. I hope we get the changes eventually "right" together. The "extra" group didn't really exist. It was only mentioned in the for example:
In the above example, Since it's necessary to add the For example, when
This problem is resolved by using
I hope this explanation clarifies why this PR contains necessary changes if we want to add a new TestNG group for quarantined tests. If our goal is something else, this PR isn't necessary at all. In that case, we could simply use TestNG's |
We don't need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them have extra in already in place if it's missing we can add them. |
This pr shows the general direction that should be taken |
@aahmed-se Please elaborate since there is no explanation. |
When mixing class level |
4b80fa5
to
bab356b
Compare
Here is the updated pr #10158
I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea. |
@codelipenghui I have added test reporting for quarantined tests in this PR. The PR description contains examples. |
@aahmed-se It seems to be pretty useful to be able to quarantine just a single test method instead of moving all tests of the test class to the quarantine. Why would it be a bad idea? |
It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals. To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so. |
Things seem to work fine with the changes in this PR. Can you provide an example of a possible confusion?
The only requirement I found was to use |
The test failures are github specific issues, we don't want to exclude things be default when devs are running things locally or in a separate ci env. |
8f833b1
to
0f6dbee
Compare
This PR is now ready for final review. Please review again @eolivelli @codelipenghui @merlimat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this great work
I did another pass and the patch looks good to me, I left one question, not a blocker
buildtools/pom.xml
Outdated
@@ -26,6 +26,7 @@ | |||
<groupId>org.apache</groupId> | |||
<artifactId>apache</artifactId> | |||
<version>23</version> | |||
<relativePath></relativePath> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been a similar problem (maven warning) as explained in this question: https://stackoverflow.com/questions/6003831/parent-relativepath-points-at-my-com-mycompanymyproject-instead-of-org-apache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
I'm thinking of splitting this PR into 15 or 16 PRs since it contains a lot of unrelated changes. They have a common goal to improve test stability, but the changes are unrelated technically. It will lead to better commit history although the merging of the PRs will be some overhead comparing to the way of merging this large PR at once. |
@lhotari good idea to split the patch you can set status to "draft" on this patch in order to keep the comments |
0f6dbee
to
a2d7656
Compare
- tests are executed, but the failures are ignored for this group
…estBase - makes retries work
…rTest - SequenceIdWithErrorTest was using the wrong base class
- broker url parameters must be lazily evaluated, otherwise the retry attempt will get the old url and not the new url of the new broker
- it helps investigate issues when the JVM hangs
a2d7656
to
3a838f1
Compare
Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR? |
/pulsarbot run-failure-checks |
@codelipenghui The quarantined test reporting isn't fully working, but it's possible to navigate to the logs. Here are the quarantined test failures as part of "CI - Unit - Brokers - Other / unit-tests" check: https://github.com/apache/pulsar/runs/2327040599?check_suite_focus=true#step:8:133 I know that it would be easier to see the results if it was a separate build job. That would add more overhead to our builds and I'd like to avoid that since the resource consumption is already really high for our builds. It seems that the intended reporting solution broke when I moved ReplicatorTests to run separately from other Quarantined tests. There are a few reasons why I had to do this, mainly because there are so many resource leakages that are fixed in PRs #10192 , #10195 , #10196 , #10197, #10198 and #10199 . I'd like to propose that we get these PRs merged and I can improve the quarantined test reporting after that. @codelipenghui Are you fine with that? It should always be a blocker issue to fix quarantined tests. Therefore the need for good reporting might not be so relevant if fixing quarantined tests is taken seriously. Some of the problems with quarantined tests will get resolved after the PRs to fix resource leakages have been merged. For example, the stability of ReplicatorTests will improve significantly and it should be possible to move the test out of the quarantine group. There are 1 or 2 flaky test methods which need slight fixes, but the root cause has been the resource leaks and asynchronous shutdown of broker instances in tests (fixed by #10199). |
@lhotari Sounds good to me. We can improve the quarantined test reporting later. It's useful for the reviewers to check if the new change break some tests which is under the quarantined group. |
Motivation
There are a few tests that fail very often. This is blocking the merging of PRs currently.
Modifications
Move the problematic tests to "quarantine" test group. This test group will be run, but the test failures will be ignored.
alwaysRun=true
should be used instead of listing individual groups in the Before* annotations.Test result will be visible directly in GitHub Actions UI, example.
Detailed quarantined test results:
Issue reports for quarantined tests