Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsuite: improve reliability of some sharness tests #2058

merged 6 commits into from Mar 1, 2019


Copy link

commented Feb 28, 2019

I got tired of some common test failures, so I worked through some of the more unreliable tests. My test case was to run make -j 24 check on a quad core node. These fixes are fairly minor and mostly related to extending timeouts and grace period.

I think I also finally worked around a race in the t0015-cron.t "min-interval" test. I punted on trying to catch the state before the interval expired and instead just check for an expected log message if a cron task was delayed due to min-interval. (There still might be a rare race here if the two events generated are delivered after the interval, but this change makes the test much more reliable)

Closes #1687.

grondo added 6 commits Feb 28, 2019
Problem: there is an inherent race in the cron "min-interval" test,
which ensures that a cron "event" entry will not run more than
once per specified minimum interval. In the test, there is no way to
guarantee that the check before the sleep runs before the interval has
expired, and thus on busy or slow systems the test may fail because
the stats.count is >1 instead of 1.

Since there is no way to guarantee that our test command runs within
the minimum interval specified, dump the idea of testing this way,
and instead ensure the cron job was delayed by checking for a log
message from the cron service.
For reasons shrouded in mystery, the backgroupd flux-kvs command
in the job_wait_event() helper function was killed with signal 11
(segv) instead of 2 (sigint).

Fix the argument to kill(1) to avoid dropping corefiles each time
this test is run.
On slow or heavily loaded machines, some of the flux-aggregate cmds
may take more than the 2s allowed by run_timeout. Since the timeout
won't occur in the normal (non failing) case, there is no harm in
increasing the timeout to 5s.
The wreck rc "personality" scripts were missed during the big
wreck cleanout of '19. Remove these last remnants.
The t0001-basic.t 'test_under_flux' test is unreliable on slow or
loaded machines. It might be that the 5s timeout is not quite long
enough to get the underlying flux instance being spawned by
test_under_flux instantiated and terminated.

Double the timeout to 10s in hopes the test becomes more reliable.
For heavily loaded or slow machines, the bootstrap and shutdown of the
flux instances used in the t2010-kvs-snapshot-restore test may take
longer than even 3s. There is no downside to increasing the grace
timeout even further, since under normal circumstances the instance
will shutdown normally, so increase the grace time to 15s to handle
pathological cases.

This comment has been minimized.

Copy link

commented Feb 28, 2019

Codecov Report

Merging #2058 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2058      +/-   ##
- Coverage   80.43%   80.42%   -0.01%     
  Files         191      191              
  Lines       30252    30252              
- Hits        24333    24331       -2     
- Misses       5919     5921       +2
Impacted Files Coverage Δ
src/modules/connector-local/local.c 73.62% <0%> (-1.04%) ⬇️
src/common/libflux/message.c 81.15% <0%> (-0.25%) ⬇️
src/cmd/flux-module.c 83.72% <0%> (-0.24%) ⬇️
src/broker/broker.c 77.92% <0%> (+0.45%) ⬆️
src/common/libflux/reactor.c 92.08% <0%> (+0.9%) ⬆️

This comment has been minimized.

Copy link

commented Mar 1, 2019

These all look like great improvements to me. Even the one that is "shrouded in mystery".

@garlick garlick merged commit af26b4f into flux-framework:master Mar 1, 2019
4 checks passed
4 checks passed
Mergify — Summary 1 potential rule
codecov/patch Coverage not affected when comparing 52cb275...df04f75
codecov/project 80.42% (-0.01%) compared to 52cb275
continuous-integration/travis-ci/pr The Travis CI build passed
@grondo grondo deleted the grondo:tests-reliability branch Mar 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.