testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212

grondo · 2022-03-12T17:55:40Z

A good stress test for the testsuite is to run multiple copies of every test across a flux instance via

$ flux mini bulksubmit --quiet --shuffle --watch --progress -o pty --cc=1-4  --job-name={./%} \
   sh -c './{} -d -v --root=$FLUX_JOB_TMPDIR' ::: t*.t python/t*.py lua/t*.t

This command submits 4 copies of every test in the testsuite as a job to the current instance, watching progress and output as the tests progress.

However, when attempting this on fluke, I ran into several issues, including:

Because the testsuite runs with LANG=C and FLUX_JOB_TMPDIR contains the jobid in f58 encoding, a few test scripts and one core flux Python module broke due to Unicode errors.
at least one test broke when FLUX_JOB_CC was set in the current environment
a couple tests used static files or ports outside of the trash directory, so could not tolerate being run concurrently
the python tests don't accept or ignore options like -d, --debug or --root=PATH and instead abort with unknown option errors

This PR fixes all the little issues above.

I was then able to grab a large allocation on fluke and run the above command (even using --cc=1-10 to run 10 copies of every test) successfully.

PD:0    R:0    CD:1699 F:1    │██████████████████████████████████│100.0% 0:01:48
$ flux jobs -f failed -o {name}
NAME
t0000-sharness

There are still a couple issues I hope to run down:

t0000-sharness.t intermittently fails in this mode
Some Lua tests need to get the ability to run in a unique directory
Some tests emit special pty characters when running, which messes up the top-level --progress meter

Therefore this is still a WIP

grondo · 2022-03-13T18:45:02Z

I just pushed a few more fixes and now all tests appear to run successfully when concurrently, at least with the flux mini bulksubmit command above and --cc=1-12 in a large instance on fluke.

garlick

This all looks great! Very cool way to shake out the flakey tests!

I just noted one commit message typo.

garlick · 2022-03-13T18:30:56Z

src/bindings/python/flux/util.py

@@ -88,9 +88,11 @@ def encode_payload(payload):
    if payload is None or payload == ffi.NULL:


commit message
s/non-UTF-9/non-UTF-8/

Problem: flux.util.encode_payload() fails with unicode errors if valid utf-8 characters are in the payload and a non-UTF-8 locale such as LANG=C is set (as is in the case in the testsuite). Add errors=surrogateescape to the encode() methods used in encode_payload() to avoid these errors.

Problem: A couple Python scripts used in the testsuite fail with Unicode exceptions in sqlite3.connect() when they are run in a path with unicode characters, such as $FLUX_JOB_TMPDIR. Work around this problem by using <path>.encode("utf-8", errors="surrogateescape").decode() before passing <path> to sqlite3.open().

Problem: The t0013-config-file.t test fails if more than one test is run on the same system, even from different working trees, since the test uses fixed ports and paths in config files. Generate ports and paths dynamically so they are unlikely to conflict with other tests on the same node.

Problem: Some tests make use of FLUX_JOB_CC, so we should not allow this variable to percolate down into sharness tests when they are running as a job in Flux. Unset FLUX_JOB_CC for all sharness tests in sharness.d/flux-sharness.sh.

Problem: The system personality tests already use the --root=DIR option to relaunch test instances in a subdirectory of the existing trashdir, but if --root is also passed on the command line of the test, then the option gets added twice to the test_under_flux() re-invocation of the test script. This confuses sharness and causes tests to fail. Only propagate the current value of $root as --root=$root when it isn't already set by the system personality.

Problem: Many python tests abort if passed the common sharness test options --debug, -d and --root=PATH since unittest doesn't support those arguments. Consume and ignore -d, --debug and --root in "subflux" module and pass the remaining arguments down to unittest. This allows python tests to run alongside sharness tests with things like `flux mini bulksubmit` when the --root or --debug options are needed, without having to jump through hoops trying to pass different options to different tests.

Problem: Running tests with --root=FLUX_JOB_TMPDIR often causes strange issues, but nothing in the ci testsuite runs in this way. Adjust t/test-incenption.sh to run with --root=$FLUX_JOB_TMPDIR.

Problem: Until issue flux-framework#4047 (cancallation of job hieararchy doesn't fully clean up) is resolved, job processes may be left over after running the testsuite. Use of `sleep inf` causes these particular processes to be left over forever, until manually killed. Replace uses of `sleep inf` with `sleep 300` so that the proceses eventually exit.

Problem: The --progress option is used in a couple places in the testsuite, but this can mangle a user's tty when they are also running the testsuite under bulksubmit with the --progress option. Since there is no good reason to use --progress inside of the testsuite, remove all uses that do not test the option itself.

Problem: The tests in t/lua/t*.t do not support the --root=PATH option like sharness, which can be used to ensure the same test run multiple times on the same node doesn't conflict with itself. Add support for the --root option in fluxometer.lua so that Lua tests can be run with the same command line as all other tests.

Problem: t0000-sharness.t fails when run multiple times per node because the sub-sharness may run in an original trash direcotry not one specified to the --root=PATH directory. Check for $root set in the environment of run_sub_test_lib_test() and pass it along to the sharness invocation in the function.

grondo · 2022-03-13T20:22:24Z

Thanks! I fixed the typo and set MWP.

codecov · 2022-03-13T20:59:12Z

Codecov Report

Merging #4212 (d61f511) into master (5473965) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4212      +/-   ##
==========================================
- Coverage   83.45%   83.44%   -0.01%     
==========================================
  Files         380      380              
  Lines       63673    63673              
==========================================
- Hits        53139    53135       -4     
- Misses      10534    10538       +4

Impacted Files	Coverage Δ
src/bindings/python/flux/util.py	`94.02% <100.00%> (ø)`
src/broker/state_machine.c	`80.64% <0.00%> (-1.30%)`	⬇️
src/common/libsubprocess/server.c	`74.80% <0.00%> (-0.52%)`	⬇️
src/common/libterminus/terminus.c	`85.82% <0.00%> (-0.25%)`	⬇️
src/broker/overlay.c	`88.50% <0.00%> (-0.11%)`	⬇️
src/cmd/flux-module.c	`83.96% <0.00%> (ø)`
src/common/libsubprocess/subprocess.c	`89.00% <0.00%> (+0.30%)`	⬆️
src/modules/cron/cron.c	`82.92% <0.00%> (+0.44%)`	⬆️
src/modules/job-manager/drain.c	`74.35% <0.00%> (+2.56%)`	⬆️

garlick mentioned this pull request Mar 13, 2022

add flux-dump and flux-restore #4208

Merged

grondo changed the title ~~WIP: testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR~~ testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR Mar 13, 2022

garlick approved these changes Mar 13, 2022

View reviewed changes

grondo added 11 commits March 13, 2022 13:21

testsuite: unset FLUX_JOB_CC for all tests

489415f

Problem: Some tests make use of FLUX_JOB_CC, so we should not allow this variable to percolate down into sharness tests when they are running as a job in Flux. Unset FLUX_JOB_CC for all sharness tests in sharness.d/flux-sharness.sh.

testsuite: run inception tests in FLUX_JOB_TMPDIR

5c509ae

Problem: Running tests with --root=FLUX_JOB_TMPDIR often causes strange issues, but nothing in the ci testsuite runs in this way. Adjust t/test-incenption.sh to run with --root=$FLUX_JOB_TMPDIR.

grondo force-pushed the multi-tests branch from dc90413 to d61f511 Compare March 13, 2022 20:21

grondo added the merge-when-passing label Mar 13, 2022

mergify bot merged commit 02a130d into flux-framework:master Mar 13, 2022

grondo deleted the multi-tests branch September 12, 2022 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212

testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212

grondo commented Mar 12, 2022

grondo commented Mar 13, 2022

garlick left a comment

garlick Mar 13, 2022

grondo commented Mar 13, 2022

codecov bot commented Mar 13, 2022

		@@ -88,9 +88,11 @@ def encode_payload(payload):
		if payload is None or payload == ffi.NULL:

testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212

testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212

Conversation

grondo commented Mar 12, 2022

grondo commented Mar 13, 2022

garlick left a comment

Choose a reason for hiding this comment

garlick Mar 13, 2022

Choose a reason for hiding this comment

grondo commented Mar 13, 2022

codecov bot commented Mar 13, 2022

Codecov Report