Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exile job for unreliable tests. #3781

Merged
merged 1 commit into from
Mar 24, 2022
Merged

exile job for unreliable tests. #3781

merged 1 commit into from
Mar 24, 2022

Conversation

mbien
Copy link
Member

@mbien mbien commented Mar 15, 2022

lets give this a try to make job 13 more reliable

  1. identify unreliable tests
  2. isolate them into their own job (easier restartable without having to run well behaving tests multiple times)
  3. maybe use travis_retry

logs from a different PR where i had to restart job 13 four times in a row:

https://app.travis-ci.com/github/apache/netbeans/jobs/563140526
org.netbeans.modules.java.lsp.server.explorer.ProjectViewTest

      failed: testOnceExistedDeletedItemInfo

org.netbeans.modules.java.lsp.server.project.LspBrokenReferencesImplTest

      failed: testIgnoredErrorsTimeout

org.netbeans.modules.java.source.parsing.PartialReparseTest

      failed: testIntroduceParseError2
---
https://app.travis-ci.com/github/apache/netbeans/jobs/563138079
org.netbeans.modules.java.lsp.server.explorer.ProjectViewTest

      failed: testEmptyPackageRemains
---
https://app.travis-ci.com/github/apache/netbeans/jobs/563294404
org.netbeans.modules.java.lsp.server.project.LspBrokenReferencesImplTest

      failed: testIgnoredErrorsTimeout
---
https://app.travis-ci.com/github/apache/netbeans/jobs/563294404
org.netbeans.modules.java.lsp.server.explorer.ProjectViewTest

      failed: testProjectSourcePackages

@mbien mbien added the CI Continuous Integration label Mar 15, 2022
Copy link
Contributor

@vieiro vieiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (I've requested a small clarification, though).

- nbbuild/travis/ant.sh $OPTS build
script:
- travis_retry ant $OPTS -f java/java.lsp.server test
- travis_retry hide-logs.sh ant $OPTS -f java/gradle.java test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "hide-logs.sh" filter in the second line will also show the failed tests of the previous "java/java.lsp.server" test, right? I understand this is on purpose, not a typo, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know actually. hide-logs.sh was there before, i just moved the line to a different job. I assumed it being a streaming filter which removes some spammy log lines to make CI not error - but I haven't actually looked at the script.

Part of this test is me figuring out how practical travis_retry is. If you think this is a problem i can remove it. Main intention was to isolate bad behaving tests so that they would be in a short(er) restartable job, separated from reliable tests.

This is also somewhat of a contingency plan in case the tests can't be stabilized despite all efforts. I didn't plan to merge this if it turns out that the pending reliability PRs fix the problem.

But if you want I can merge it anyway (after your PRs are integrated).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. "hide-logs.sh" is quite useful, I think.
Regarding retries: if tests are reliable there's no need to retry them, right? :-).
I think we should try to achieve a "Green Travis" ASAP, no need to wait por PRs. There're system-wide refactorings taking place, and we have no safety net.
But maybe we want to discuss in the dev mailing list before merging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding retries: if tests are reliable there's no need to retry them, right? :-).

right. Thats why I wanted to wait till your PR is integrated to check how the improved test behaves.

Btw travis_retry is used in 14 other places in that travis.yml.

@mbien mbien marked this pull request as ready for review March 23, 2022 17:07
@mbien
Copy link
Member Author

mbien commented Mar 24, 2022

@matthiasblaesing @JaroslavTulach @sdedic should we give this a try?

@matthiasblaesing
Copy link
Contributor

I would say lets try it. If it helps great, if not it was an idea.

@JaroslavTulach
Copy link

Apache NetBeans project needs to do something with the unreliable tests. Adding @RandomlyFails (e.g. excluding the tests from the CI runs) would be my preferred choice. However I am ready to support any solution that does something with the failures.

Ideally the owners of the tests (@sdedic & other OracleLabs guys as ServerTest and Gradle is observed by the team) should try their best to address the instability. However I haven't noticed any activity in this respect recently. CCing @MartinBalin

@JaroslavTulach JaroslavTulach self-requested a review March 24, 2022 08:14
@mbien
Copy link
Member Author

mbien commented Mar 24, 2022

awesome. I wanted to make sure everyone knows that some tests might be in a different job now.

@mbien mbien merged commit 0876224 into apache:master Mar 24, 2022
@sdedic
Copy link
Member

sdedic commented Mar 24, 2022

@JaroslavTulach re our schedule, the bugfixing period should start next week. I hope I'll be able to address at least the Gradle/java one. I've already found some race - but apparently not all of them.

@JaroslavTulach
Copy link

the bugfixing period should start next week. I hope I'll be able to address at least the Gradle/java one. I've already found some race - but apparently not all of them.

+1, but:

The purpose of pre-commit CI validation is: prevent others to cause regressions. Keeping unreliable tests in the shared CI pool undermines the main purpose - it only annoys others as they are being punished by your own incapability to stabilize your tests.

Set a private job (like vscode did with java/java.lsp.server tests) to run your tests including @RandomlyFails ones. Execute often and perform stabilization of your code & tests there until the code is stable. Meanwhile just annotate every randomly failing test in pre-commit CI as @RandomlyFails ones.

I fully understand why engineers don't want to disable tests, but consider the harm caused by keeping annoying randomly failing tests in the shared CI pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants