New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testsuite: fix some test races and improve debugging #5609
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
garlick
approved these changes
Dec 7, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working through all these! Everyone will suffer a little less.
Problem: One of the tests in t1102-cmddriver.t assumes that the `flux` executable will not be found first in /bin or /usr/bin, but this may not be the case when FLUX_TEST_INSTALLED_PATH is being used. Skip the test if flux is found first in /bin or /usr/bin.
Problem: The queue output tests in t2801-top-cmd.t often fail in CI. The issue seems to be related to the checks for 0 matching lines at the end of some of the tests. As a guess, sometimes one or more lines of output match due to timing issues, and so the test fails. Just remove these overly punctilious checks.
Problem: A test in t0015-job-output.py attempts to ensure that a FileNotFoundError is returned from a job output watch when the job guest namespace has been created, but the output eventlog does not yet exist. But this test has a race if the job is released before the output watch RPC is recieved by the job-info module, and there is no good way to avoid the race during the test since the `output_event_watch` call is synchronous and called from a thread in the test Just remove the racy test so that occasional failures in CI are avoided.
Problem: There are a couple tests that occasionally fail in ci without any clues as to the source of the failure. Add debugging to these tests for future use.
Problem: The multiple job kill tests in t2201-job-cmd.t occasionally fail in ci. These tests will be more reliable if the test waits for the shell.init event, so add that synchronization to affected tests.
Problem: The job shell input tests in t2607-job-shell-input.t redirect stderr to files that are then never used, making debugging the tests in ci impossible. Do not suppress stderr so errors appear in ci logs.
Problem: t3307-system-leafcrash.t has been seen to hang when used with --chain-lint. The assumption is that this series of tests become racy with --chain-lint because some tests end up being skipped. Just skip the entire test when --chain-lint is used.
Problem: The test `flux alloc --bg can be interrupted` fails sporadically in ci. This may be due to an inherent race condition handling SIGINT. Send SIGINT a second time in the test to hopefully increase reliability.
Problem: The 'signal forwarding works' test occasionally fails in ci. A potential issue is that the test_signal.sh test is run twice, once with SIGINT and once with SIGTERM, and the sleepready.out logfile is not cleaned up between the two runs. If the subsequent waitfail invocation starts before the background process overwrites the existing sleepready.out, then the test could prematurely send SIGTERM before flux-exec is ready. Remove the output file in the test script to avoid this race.
Problem: There's a small race in the LSF uri resolver if a flux-broker process for the calling user exits between when it is listed by ps(1) and the resolver attempts to open the /proc/PID/environ file. In this case, instead of continuing to try other brokers `flux-uri` aborts with flux-uri: ERROR: No such file or directory: '/proc/PID/environ' Catch the FileNotFound exception when trying to read /proc/PID/environ and treat this the same as if the broker did not match the LSF jobid.
grondo
force-pushed
the
testsuite-fixes
branch
from
December 7, 2023 01:59
3701299
to
146f1ff
Compare
Thanks! Setting MWP. |
Thanks! I set MWP |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5609 +/- ##
=======================================
Coverage 83.44% 83.44%
=======================================
Files 487 487
Lines 82794 82797 +3
=======================================
+ Hits 69087 69093 +6
+ Misses 13707 13704 -3
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Many tests in the testsuite seem to have become a bit more unreliable lately. I disabled the
make recheck
step on a branch, at which point 2 or 3 builds failed every time. I then worked through the failing tests, trying to fix some and adding more debug info for others where there was nothing apparent in the logs for the cause of the failure.This doesn't fix everything, but it should make CI at least slightly more reliable.