testsuite: fix some test races and improve debugging #5609

grondo · 2023-12-06T22:58:57Z

Many tests in the testsuite seem to have become a bit more unreliable lately. I disabled the make recheck step on a branch, at which point 2 or 3 builds failed every time. I then worked through the failing tests, trying to fix some and adding more debug info for others where there was nothing apparent in the logs for the cause of the failure.

This doesn't fix everything, but it should make CI at least slightly more reliable.

garlick

LGTM! Thanks for working through all these! Everyone will suffer a little less.

Problem: One of the tests in t1102-cmddriver.t assumes that the `flux` executable will not be found first in /bin or /usr/bin, but this may not be the case when FLUX_TEST_INSTALLED_PATH is being used. Skip the test if flux is found first in /bin or /usr/bin.

Problem: The queue output tests in t2801-top-cmd.t often fail in CI. The issue seems to be related to the checks for 0 matching lines at the end of some of the tests. As a guess, sometimes one or more lines of output match due to timing issues, and so the test fails. Just remove these overly punctilious checks.

Problem: A test in t0015-job-output.py attempts to ensure that a FileNotFoundError is returned from a job output watch when the job guest namespace has been created, but the output eventlog does not yet exist. But this test has a race if the job is released before the output watch RPC is recieved by the job-info module, and there is no good way to avoid the race during the test since the `output_event_watch` call is synchronous and called from a thread in the test Just remove the racy test so that occasional failures in CI are avoided.

Problem: There are a couple tests that occasionally fail in ci without any clues as to the source of the failure. Add debugging to these tests for future use.

Problem: The multiple job kill tests in t2201-job-cmd.t occasionally fail in ci. These tests will be more reliable if the test waits for the shell.init event, so add that synchronization to affected tests.

Problem: The job shell input tests in t2607-job-shell-input.t redirect stderr to files that are then never used, making debugging the tests in ci impossible. Do not suppress stderr so errors appear in ci logs.

Problem: t3307-system-leafcrash.t has been seen to hang when used with --chain-lint. The assumption is that this series of tests become racy with --chain-lint because some tests end up being skipped. Just skip the entire test when --chain-lint is used.

Problem: The test `flux alloc --bg can be interrupted` fails sporadically in ci. This may be due to an inherent race condition handling SIGINT. Send SIGINT a second time in the test to hopefully increase reliability.

Problem: The 'signal forwarding works' test occasionally fails in ci. A potential issue is that the test_signal.sh test is run twice, once with SIGINT and once with SIGTERM, and the sleepready.out logfile is not cleaned up between the two runs. If the subsequent waitfail invocation starts before the background process overwrites the existing sleepready.out, then the test could prematurely send SIGTERM before flux-exec is ready. Remove the output file in the test script to avoid this race.

Problem: There's a small race in the LSF uri resolver if a flux-broker process for the calling user exits between when it is listed by ps(1) and the resolver attempts to open the /proc/PID/environ file. In this case, instead of continuing to try other brokers `flux-uri` aborts with flux-uri: ERROR: No such file or directory: '/proc/PID/environ' Catch the FileNotFound exception when trying to read /proc/PID/environ and treat this the same as if the broker did not match the LSF jobid.

grondo · 2023-12-07T01:59:09Z

Thanks! Setting MWP.

grondo · 2023-12-07T16:56:56Z

Thanks! I set MWP

codecov · 2023-12-07T16:57:16Z

Codecov Report

Merging #5609 (146f1ff) into master (a96f7fa) will increase coverage by 0.00%.
The diff coverage is 66.66%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #5609   +/-   ##
=======================================
  Coverage   83.44%   83.44%           
=======================================
  Files         487      487           
  Lines       82794    82797    +3     
=======================================
+ Hits        69087    69093    +6     
+ Misses      13707    13704    -3

Files	Coverage Δ
src/bindings/python/flux/uri/resolvers/lsf.py	`86.27% <66.66%> (-3.31%)`	⬇️

... and 8 files with indirect coverage changes

grondo mentioned this pull request Dec 7, 2023

NEWS.md: add release notes for v0.57.0 #5610

Merged

garlick approved these changes Dec 7, 2023

View reviewed changes

grondo added 10 commits December 7, 2023 01:59

testsuite: add some debugging in sporadically failing tests

ad1d894

Problem: There are a couple tests that occasionally fail in ci without any clues as to the source of the failure. Add debugging to these tests for future use.

testsuite: fix potential races in t2201-job-cmd.t

5d1d742

Problem: The multiple job kill tests in t2201-job-cmd.t occasionally fail in ci. These tests will be more reliable if the test waits for the shell.init event, so add that synchronization to affected tests.

testsuite: do not suppress stderr in t2607-job-shell-input.t

177eb0a

Problem: The job shell input tests in t2607-job-shell-input.t redirect stderr to files that are then never used, making debugging the tests in ci impossible. Do not suppress stderr so errors appear in ci logs.

testsuite: improve reliability of t2712-python-cli-alloc.t

9d7fc42

Problem: The test `flux alloc --bg can be interrupted` fails sporadically in ci. This may be due to an inherent race condition handling SIGINT. Send SIGINT a second time in the test to hopefully increase reliability.

grondo force-pushed the testsuite-fixes branch from 3701299 to 146f1ff Compare December 7, 2023 01:59

grondo mentioned this pull request Dec 7, 2023

ci: alpine build errors #5596

Closed

grondo added the merge-when-passing label Dec 7, 2023

mergify bot merged commit 49287b5 into flux-framework:master Dec 7, 2023
33 of 34 checks passed

grondo deleted the testsuite-fixes branch December 7, 2023 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testsuite: fix some test races and improve debugging #5609

testsuite: fix some test races and improve debugging #5609

grondo commented Dec 6, 2023

garlick left a comment

grondo commented Dec 7, 2023

grondo commented Dec 7, 2023

codecov bot commented Dec 7, 2023

testsuite: fix some test races and improve debugging #5609

testsuite: fix some test races and improve debugging #5609

Conversation

grondo commented Dec 6, 2023

garlick left a comment

Choose a reason for hiding this comment

grondo commented Dec 7, 2023

grondo commented Dec 7, 2023

codecov bot commented Dec 7, 2023

Codecov Report