Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212

Merged
merged 11 commits into from
Mar 13, 2022

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Mar 12, 2022

A good stress test for the testsuite is to run multiple copies of every test across a flux instance via

$ flux mini bulksubmit --quiet --shuffle --watch --progress -o pty --cc=1-4  --job-name={./%} \
   sh -c './{} -d -v --root=$FLUX_JOB_TMPDIR' ::: t*.t python/t*.py lua/t*.t

This command submits 4 copies of every test in the testsuite as a job to the current instance, watching progress and output as the tests progress.

However, when attempting this on fluke, I ran into several issues, including:

  • Because the testsuite runs with LANG=C and FLUX_JOB_TMPDIR contains the jobid in f58 encoding, a few test scripts and one core flux Python module broke due to Unicode errors.
  • at least one test broke when FLUX_JOB_CC was set in the current environment
  • a couple tests used static files or ports outside of the trash directory, so could not tolerate being run concurrently
  • the python tests don't accept or ignore options like -d, --debug or --root=PATH and instead abort with unknown option errors

This PR fixes all the little issues above.

I was then able to grab a large allocation on fluke and run the above command (even using --cc=1-10 to run 10 copies of every test) successfully.

PD:0    R:0    CD:1699 F:1    │██████████████████████████████████│100.0% 0:01:48
$ flux jobs -f failed -o {name}
NAME
t0000-sharness

There are still a couple issues I hope to run down:

  • t0000-sharness.t intermittently fails in this mode
  • Some Lua tests need to get the ability to run in a unique directory
  • Some tests emit special pty characters when running, which messes up the top-level --progress meter

Therefore this is still a WIP

@grondo grondo changed the title WIP: testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR Mar 13, 2022
@grondo
Copy link
Contributor Author

grondo commented Mar 13, 2022

I just pushed a few more fixes and now all tests appear to run successfully when concurrently, at least with the flux mini bulksubmit command above and --cc=1-12 in a large instance on fluke.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great! Very cool way to shake out the flakey tests!

I just noted one commit message typo.

@@ -88,9 +88,11 @@ def encode_payload(payload):
if payload is None or payload == ffi.NULL:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message
s/non-UTF-9/non-UTF-8/

Problem: flux.util.encode_payload() fails with unicode errors if
valid utf-8 characters are in the payload and a non-UTF-8 locale
such as LANG=C is set (as is in the case in the testsuite).

Add errors=surrogateescape to the encode() methods used in
encode_payload() to avoid these errors.
Problem: A couple Python scripts used in the testsuite fail with
Unicode exceptions in sqlite3.connect() when they are run in a path
with unicode characters, such as $FLUX_JOB_TMPDIR.

Work around this problem by using

  <path>.encode("utf-8", errors="surrogateescape").decode()

before passing <path> to sqlite3.open().
Problem: The t0013-config-file.t test fails if more than one test is
run on the same system, even from different working trees, since the
test uses fixed ports and paths in config files.

Generate ports and paths dynamically so they are unlikely to conflict
with other tests on the same node.
Problem: Some tests make use of FLUX_JOB_CC, so we should not allow
this variable to percolate down into sharness tests when they are running
as a job in Flux.

Unset FLUX_JOB_CC for all sharness tests in sharness.d/flux-sharness.sh.
Problem: The system personality tests already use the --root=DIR option
to relaunch test instances in a subdirectory of the existing trashdir,
but if --root is also passed on the command line of the test, then
the option gets added twice to the test_under_flux() re-invocation
of the test script. This confuses sharness and causes tests to fail.

Only propagate the current value of $root as --root=$root when it
isn't already set by the system personality.
Problem: Many python tests abort if passed the common sharness test
options --debug, -d and --root=PATH since unittest doesn't support
those arguments.

Consume and ignore -d, --debug and --root in "subflux" module and
pass the remaining arguments down to unittest.

This allows python tests to run alongside sharness tests with things
like `flux mini bulksubmit` when the --root or --debug options are
needed, without having to jump through hoops trying to pass different
options to different tests.
Problem: Running tests with --root=FLUX_JOB_TMPDIR often causes strange
issues, but nothing in the ci testsuite runs in this way.

Adjust t/test-incenption.sh to run with --root=$FLUX_JOB_TMPDIR.
Problem: Until issue flux-framework#4047 (cancallation of job hieararchy doesn't
fully clean up) is resolved, job processes may be left over after
running the testsuite. Use of `sleep inf` causes these particular
processes to be left over forever, until manually killed.

Replace uses of `sleep inf` with `sleep 300` so that the proceses
eventually exit.
Problem: The --progress option is used in a couple places in the
testsuite, but this can mangle a user's tty when they are also running
the testsuite under bulksubmit with the --progress option.

Since there is no good reason to use --progress inside of the
testsuite, remove all uses that do not test the option itself.
Problem: The tests in t/lua/t*.t do not support the --root=PATH option
like sharness, which can be used to ensure the same test run multiple
times on the same node doesn't conflict with itself.

Add support for the --root option in fluxometer.lua so that Lua tests
can be run with the same command line as all other tests.
Problem: t0000-sharness.t fails when run multiple times per node
because the sub-sharness may run in an original trash direcotry
not one specified to the --root=PATH directory.

Check for $root set in the environment of run_sub_test_lib_test()
and pass it along to the sharness invocation in the function.
@grondo
Copy link
Contributor Author

grondo commented Mar 13, 2022

Thanks! I fixed the typo and set MWP.

@codecov
Copy link

codecov bot commented Mar 13, 2022

Codecov Report

Merging #4212 (d61f511) into master (5473965) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4212      +/-   ##
==========================================
- Coverage   83.45%   83.44%   -0.01%     
==========================================
  Files         380      380              
  Lines       63673    63673              
==========================================
- Hits        53139    53135       -4     
- Misses      10534    10538       +4     
Impacted Files Coverage Δ
src/bindings/python/flux/util.py 94.02% <100.00%> (ø)
src/broker/state_machine.c 80.64% <0.00%> (-1.30%) ⬇️
src/common/libsubprocess/server.c 74.80% <0.00%> (-0.52%) ⬇️
src/common/libterminus/terminus.c 85.82% <0.00%> (-0.25%) ⬇️
src/broker/overlay.c 88.50% <0.00%> (-0.11%) ⬇️
src/cmd/flux-module.c 83.96% <0.00%> (ø)
src/common/libsubprocess/subprocess.c 89.00% <0.00%> (+0.30%) ⬆️
src/modules/cron/cron.c 82.92% <0.00%> (+0.44%) ⬆️
src/modules/job-manager/drain.c 74.35% <0.00%> (+2.56%) ⬆️

@mergify mergify bot merged commit 02a130d into flux-framework:master Mar 13, 2022
@grondo grondo deleted the multi-tests branch September 12, 2022 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants