Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testsuite: improve test reliability and prepare for asan testing #1733

Merged
merged 9 commits into from Oct 15, 2018

Conversation

Projects
None yet
5 participants
@trws
Copy link
Member

trws commented Oct 12, 2018

This is a collection of changes to reduce the non-determinism in the tests and to prepare for using the address sanitizer in an automated way. Mainly meant to address #1687 and the various issues related to it. As predicted, turning on asan produces a non-trivial number of errors, I've filed issues for all of the ones that crop up before the sharness tests start, hopefully squashing some of those will help with the sharness ones.

@trws trws force-pushed the trws:test-determinism branch from df79d38 to b26c17f Oct 12, 2018

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Oct 12, 2018

To hopefully save you some time. The bionic errors are:

FAIL: t0001-basic.t 24 - test_under_flux works
FAIL: t0001-basic.t 25 - test_under_flux fails if loaded modules are not unloaded
             t0001-basic.t:  FAIL: N=57  PASS=55  FAIL=2 SKIP=0 XPASS=0 XFAIL=0
ERROR: t0001-basic.t - exited with status 1

The centos 7 errors are:

make: unrecognized option '--output-sync=line'

Looks like --output-sync requires make v4.0+ and centos7 ships with make v3.82:

$ make --version
GNU Make 3.82
Built for x86_64-redhat-linux-gnu
Copyright (C) 2010  Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Oct 12, 2018

Stupid question inbound: I get (or at least I think I get) why asan doesn't work with RTLD_DEEPBIND (the deepbind prevents asan from intercepting free and malloc in shared libraries that have their own definitions). The thing I'm uncertain about is, without RTLD_DEEPBIND (like with musl or on BSD), can we make any guarantees that arbitrary flux modules will work correctly? Assuming the module defines a function with the same name as a function in the global scope of the broker, incorrect/unintended behavior will occur, right?

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 12, 2018

Thanks for that, should be a couple extra_dist lines and using serial make with distcheck on centos (ugh...).

As to deep bind, it only matters for globally bound symbols resolved at runtime. Since all modules are loaded local, the only symbols that will cause issues are the ones loaded globally by core, so if a module uses a jansson symbol to mean something else, or flux create, then it is a problem, but two modules using the same symbol not globally exposed by core will be just fine. Deepbind is also a recent glibc specific addition, as in it didn't work reliably 2 years ago, so it's best to offer it where possible but not count on it if we can.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 12, 2018

Deepbind is also a recent glibc specific addition, as in it didn't work reliably 2 years ago, so it's best to offer it where possible but not count on it if we can.

Agreed. DEEPBIND was introduced to get around json-c/jansson symbol collision -- now that we have removed json-c we could probably get away without RTLD_DEEPBIND (as most projects do), though providing it on our main platform still may be a good idea...

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 12, 2018

should be a couple extra_dist lines and using serial make with distcheck on centos (ugh...).

Maybe only enable output-sync optionally (could even check make version first)?

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 12, 2018

Ok, I think I have this basically straightened out, but travis is choking on almost all of the builders on the version string. I'm getting versions like "flux-core-0.10.0-244-gb26c17f" out of distcheck, and the test expects that it must be flux-core-\d+\.\d+\.\d+, should I fix the test to allow for the extended version string or is this a problem in version generation?

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 12, 2018

On the theory that the generation is correct, I'm amending the test to check for this, and switching the grep to using extended syntax.

flux-core-[0-9]+\.[0-9]+\.[0-9]+(-[0-9]+(-[a-z0-9]+))?
@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 12, 2018

With that, they all passed but the caliper build, which hung in either kvs-internals or the sighandler lua test, because of the way it died, I have no idea which.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 12, 2018

Codecov Report

Merging #1733 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1733      +/-   ##
==========================================
+ Coverage   79.31%   79.32%   +<.01%     
==========================================
  Files         184      184              
  Lines       34552    34552              
==========================================
+ Hits        27405    27407       +2     
+ Misses       7147     7145       -2
Impacted Files Coverage Δ
src/common/libflux/module.c 77.18% <100%> (ø) ⬆️
src/common/libflux/handle.c 83.68% <100%> (ø) ⬆️
src/broker/module.c 83.78% <100%> (ø) ⬆️
src/cmd/flux-module.c 85.28% <0%> (-0.31%) ⬇️
src/common/libflux/message.c 81.06% <0%> (+0.11%) ⬆️
src/bindings/lua/lua-hostlist/hostlist.c 59% <0%> (+0.22%) ⬆️
@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 12, 2018

All green lights.

I'm going to add one more thing here, it seems the write error bug may crop up elsewhere and is related to stdout and stderr being set non-blocking by travis for some reason, see here.

@trws trws force-pushed the trws:test-determinism branch from d7a3e0b to 15d95ff Oct 12, 2018

@trws trws requested review from grondo and garlick Oct 12, 2018

@grondo

grondo approved these changes Oct 12, 2018

Copy link
Contributor

grondo left a comment

Looks good to me, Thanks!

Just one comment to perhaps make the DEEPBIND define name more obviously coming from our configure.

Nice find on the non-blocking stdout...

Edit: one other thing. I hate to make you change your commit messages, but we've kind of been using a system of ad-hoc section: or subject: prefixes to our commit messages, which makes it nice to search for kvs: commits for example (I know there are other ways to do that)

It would be nice if these commits were tagged with travis-ci: docker: or test:/testsuite: as appropriate. (Sorry for the extra work!)

@@ -59,6 +59,15 @@ AS_CASE($ax_cv_cxx_compiler_vendor,
AX_CXX_COMPILE_STDCXX([11], [noext], [mandatory])

X_AC_ENABLE_SANITIZER
if test "x$san_enabled" != "xno" ; then
AC_DEFINE([DEEPBIND], [0],
[deepbind is unsupported with asan, musl and so-forth])

This comment has been minimized.

Copy link
@grondo

grondo Oct 12, 2018

Contributor

I might suggest a different name for the DEEPBIND config.h macro. Perhaps FLUX_DEEPBIND to give a hint in the code that we're using a locally provided define.

This comment has been minimized.

Copy link
@trws

trws Oct 14, 2018

Author Member

Good point @grondo. It also made me notice I had forgotten to add the check for whether RTLD_DEEPBIND even exists, so it's renamed and the check has been added as well.

@trws trws force-pushed the trws:test-determinism branch from 7a44468 to d7597a9 Oct 14, 2018

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 14, 2018

Rebased.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Oct 14, 2018

Nice, thanks @trws! I'd say please fix up the commit messages per @grondo's request and this can go in. Also since we're deriving release notes from merge commits (roughly), you might rename this PR to something like: "testsuite: improve test reliability and prepare for asan testing"

trws added some commits Oct 11, 2018

travis-ci: add output sync to test make
Output synchronization on make should prevent the race condition in
parallel builds where a file descriptor in use may be closed by a parent
process in an unexpected way.  It should also prevent in-line
overlapping output, which may help with grepping output.
test: add @garlick's fix, touch filesystem less
This changes almost nothing functionally, but makes the two
test-under-flux tests run faster when building on slow filesystems,
reducing the chance of timeout failures.
test: reduce delay in unreliable cron event test
I can't be sure this will always solve it, but running eight less execs
can't hurt, and using greater-than on the end should help with longer
than intended sleep durations.
test: avoid RTLD_DEEPBIND when asan is turned on
Apparently RTLD_DEEPBIND prevents asan from working correctly, since
this flag also isn't supported on non-glibc platforms having a start on
a way to swap it out seems reasonable.
test: check for output sync support, add to centos
The devtoolset-7-make package has been added to our centos-7 image so we
can count on output sync for travis, and the travis_run script checks
for a recent make, if it's not available it tries to load the SCL for
the devtoolset, if that fails it just goes ahead without sync.
test: add extracted test sources to extra_dist
The extracted test scripts and data have been added to extra dist to
placate distcheck.

@trws trws force-pushed the trws:test-determinism branch from d7597a9 to 4646ef4 Oct 15, 2018

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 15, 2018

Rebased, and commits renamed, I'll rename the PR in a moment.

@trws trws changed the title Test determinism testsuite: improve test reliability and prepare for asan testing Oct 15, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 15, 2018

Nice! Thanks!

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 15, 2018

Just had to restart another one because of the valgrind hang. Really not sure why that keeps happening with the run_timeout on there. Anyone have an idea?

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Oct 15, 2018

OK I think this can go in.

@garlick garlick merged commit 74fa7c4 into flux-framework:master Oct 15, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 15, 2018

Just had to restart another one because of the valgrind hang. Really not sure why that keeps happening with the run_timeout on there. Anyone have an idea?

Yeah, I'm mystified. I just tried to set the timeout way down and it did seem to kill the session (at first I was worried valgrind was just ignoring the signal...)

Maybe what is hanging isn't what we think is hanging?

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Oct 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.