-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testsuite: ensure tests can run concurrently with --root=$FLUX_JOB_TMPDIR #4212
Conversation
I just pushed a few more fixes and now all tests appear to run successfully when concurrently, at least with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks great! Very cool way to shake out the flakey tests!
I just noted one commit message typo.
@@ -88,9 +88,11 @@ def encode_payload(payload): | |||
if payload is None or payload == ffi.NULL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message
s/non-UTF-9/non-UTF-8/
Problem: flux.util.encode_payload() fails with unicode errors if valid utf-8 characters are in the payload and a non-UTF-8 locale such as LANG=C is set (as is in the case in the testsuite). Add errors=surrogateescape to the encode() methods used in encode_payload() to avoid these errors.
Problem: A couple Python scripts used in the testsuite fail with Unicode exceptions in sqlite3.connect() when they are run in a path with unicode characters, such as $FLUX_JOB_TMPDIR. Work around this problem by using <path>.encode("utf-8", errors="surrogateescape").decode() before passing <path> to sqlite3.open().
Problem: The t0013-config-file.t test fails if more than one test is run on the same system, even from different working trees, since the test uses fixed ports and paths in config files. Generate ports and paths dynamically so they are unlikely to conflict with other tests on the same node.
Problem: Some tests make use of FLUX_JOB_CC, so we should not allow this variable to percolate down into sharness tests when they are running as a job in Flux. Unset FLUX_JOB_CC for all sharness tests in sharness.d/flux-sharness.sh.
Problem: The system personality tests already use the --root=DIR option to relaunch test instances in a subdirectory of the existing trashdir, but if --root is also passed on the command line of the test, then the option gets added twice to the test_under_flux() re-invocation of the test script. This confuses sharness and causes tests to fail. Only propagate the current value of $root as --root=$root when it isn't already set by the system personality.
Problem: Many python tests abort if passed the common sharness test options --debug, -d and --root=PATH since unittest doesn't support those arguments. Consume and ignore -d, --debug and --root in "subflux" module and pass the remaining arguments down to unittest. This allows python tests to run alongside sharness tests with things like `flux mini bulksubmit` when the --root or --debug options are needed, without having to jump through hoops trying to pass different options to different tests.
Problem: Running tests with --root=FLUX_JOB_TMPDIR often causes strange issues, but nothing in the ci testsuite runs in this way. Adjust t/test-incenption.sh to run with --root=$FLUX_JOB_TMPDIR.
Problem: Until issue flux-framework#4047 (cancallation of job hieararchy doesn't fully clean up) is resolved, job processes may be left over after running the testsuite. Use of `sleep inf` causes these particular processes to be left over forever, until manually killed. Replace uses of `sleep inf` with `sleep 300` so that the proceses eventually exit.
Problem: The --progress option is used in a couple places in the testsuite, but this can mangle a user's tty when they are also running the testsuite under bulksubmit with the --progress option. Since there is no good reason to use --progress inside of the testsuite, remove all uses that do not test the option itself.
Problem: The tests in t/lua/t*.t do not support the --root=PATH option like sharness, which can be used to ensure the same test run multiple times on the same node doesn't conflict with itself. Add support for the --root option in fluxometer.lua so that Lua tests can be run with the same command line as all other tests.
Problem: t0000-sharness.t fails when run multiple times per node because the sub-sharness may run in an original trash direcotry not one specified to the --root=PATH directory. Check for $root set in the environment of run_sub_test_lib_test() and pass it along to the sharness invocation in the function.
Thanks! I fixed the typo and set MWP. |
Codecov Report
@@ Coverage Diff @@
## master #4212 +/- ##
==========================================
- Coverage 83.45% 83.44% -0.01%
==========================================
Files 380 380
Lines 63673 63673
==========================================
- Hits 53139 53135 -4
- Misses 10534 10538 +4
|
A good stress test for the testsuite is to run multiple copies of every test across a flux instance via
This command submits 4 copies of every test in the testsuite as a job to the current instance, watching progress and output as the tests progress.
However, when attempting this on fluke, I ran into several issues, including:
LANG=C
andFLUX_JOB_TMPDIR
contains the jobid in f58 encoding, a few test scripts and one core flux Python module broke due to Unicode errors.FLUX_JOB_CC
was set in the current environment-d, --debug
or--root=PATH
and instead abort with unknown option errorsThis PR fixes all the little issues above.
I was then able to grab a large allocation on fluke and run the above command (even using
--cc=1-10
to run 10 copies of every test) successfully.There are still a couple issues I hope to run down:
t0000-sharness.t
intermittently fails in this mode--progress
meterTherefore this is still a WIP