Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsuite: fix some test races and improve debugging #5609

Merged
merged 10 commits into from Dec 7, 2023

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Dec 6, 2023

Many tests in the testsuite seem to have become a bit more unreliable lately. I disabled the make recheck step on a branch, at which point 2 or 3 builds failed every time. I then worked through the failing tests, trying to fix some and adding more debug info for others where there was nothing apparent in the logs for the cause of the failure.

This doesn't fix everything, but it should make CI at least slightly more reliable.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for working through all these! Everyone will suffer a little less.

Problem: One of the tests in t1102-cmddriver.t assumes that the `flux`
executable will not be found first in /bin or /usr/bin, but this may
not be the case when FLUX_TEST_INSTALLED_PATH is being used.

Skip the test if flux is found first in /bin or /usr/bin.
Problem: The queue output tests in t2801-top-cmd.t often fail in
CI.

The issue seems to be related to the checks for 0 matching lines
at the end of some of the tests. As a guess, sometimes one or more
lines of output match due to timing issues, and so the test fails.

Just remove these overly punctilious checks.
Problem: A test in t0015-job-output.py attempts to ensure that a
FileNotFoundError is returned from a job output watch when the job
guest namespace has been created, but the output eventlog does
not yet exist. But this test has a race if the job is released
before the output watch RPC is recieved by the job-info module,
and there is no good way to avoid the race during the test since the
`output_event_watch` call is synchronous and called from a thread in
the test

Just remove the racy test so that occasional failures in CI are
avoided.
Problem: There are a couple tests that occasionally fail in ci without
any clues as to the source of the failure.

Add debugging to these tests for future use.
Problem: The multiple job kill tests in t2201-job-cmd.t occasionally
fail in ci.

These tests will be more reliable if the test waits for the shell.init
event, so add that synchronization to affected tests.
Problem: The job shell input tests in t2607-job-shell-input.t redirect
stderr to files that are then never used, making debugging the tests
in ci impossible.

Do not suppress stderr so errors appear in ci logs.
Problem: t3307-system-leafcrash.t has been seen to hang when used
with --chain-lint.

The assumption is that this series of tests become racy with
--chain-lint because some tests end up being skipped. Just skip the
entire test when --chain-lint is used.
Problem: The test `flux alloc --bg can be interrupted` fails
sporadically in ci.

This may be due to an inherent race condition handling SIGINT.  Send
SIGINT a second time in the test to hopefully increase reliability.
Problem: The 'signal forwarding works' test occasionally fails in ci.

A potential issue is that the test_signal.sh test is run twice,
once with SIGINT and once with SIGTERM, and the sleepready.out
logfile is not cleaned up between the two runs. If the subsequent
waitfail invocation starts before the background process overwrites
the existing sleepready.out, then the test could prematurely send
SIGTERM before flux-exec is ready.

Remove the output file in the test script to avoid this race.
Problem: There's a small race in the LSF uri resolver if a flux-broker
process for the calling user exits between when it is listed by ps(1)
and the resolver attempts to open the /proc/PID/environ file. In this
case, instead of continuing to try other brokers `flux-uri` aborts with

 flux-uri: ERROR: No such file or directory: '/proc/PID/environ'

Catch the FileNotFound exception when trying to read /proc/PID/environ
and treat this the same as if the broker did not match the LSF jobid.
@grondo
Copy link
Contributor Author

grondo commented Dec 7, 2023

Thanks! Setting MWP.

@grondo
Copy link
Contributor Author

grondo commented Dec 7, 2023

Thanks! I set MWP

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Merging #5609 (146f1ff) into master (a96f7fa) will increase coverage by 0.00%.
The diff coverage is 66.66%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5609   +/-   ##
=======================================
  Coverage   83.44%   83.44%           
=======================================
  Files         487      487           
  Lines       82794    82797    +3     
=======================================
+ Hits        69087    69093    +6     
+ Misses      13707    13704    -3     
Files Coverage Δ
src/bindings/python/flux/uri/resolvers/lsf.py 86.27% <66.66%> (-3.31%) ⬇️

... and 8 files with indirect coverage changes

@mergify mergify bot merged commit 49287b5 into flux-framework:master Dec 7, 2023
33 of 34 checks passed
@grondo grondo deleted the testsuite-fixes branch December 7, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants