Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental: add --wrap=arg0,arg1,... option to flux-start #1542

Merged
merged 7 commits into from Jun 1, 2018

Conversation

Projects
None yet
4 participants
@grondo
Copy link
Contributor

grondo commented May 31, 2018

This is an experimental PR that adds a --wrap option to flux-start that wraps flux-broker execution in arbitrary arguments. Posting here for any comments. If we move forward with this, I'd just need to add some tests and cleanup a bit.

The main use case for --wrap is to run valgrind on flux brokers in a test session without all the --trace-children=yes --trace-children-skip=pattern shenanigans.

e.g.

t$ src/cmd/flux start -s4 -vX --wrap=libtool,--mode=execute,valgrind,--leak-check=full,--suppressions=t/valgrind/valgrind.supp flux getattr rank
flux-start: warning: setting --bootstrap=selfpmi due to --size option
flux-start: 0: libtool --mode=execute valgrind --leak-check=full --suppressions=t/valgrind/valgrind.supp /g/g0/grondo/git/flux-core.git/src/broker/flux-broker --setattr=broker.rundir=/var/tmp/grondo/flux-190781-T5W4Wy/0 --setattr=tbon.endpoint=ipc://%B/req flux getattr rank
flux-start: 1: libtool --mode=execute valgrind --leak-check=full --suppressions=t/valgrind/valgrind.supp /g/g0/grondo/git/flux-core.git/src/broker/flux-broker --setattr=broker.rundir=/var/tmp/grondo/flux-190781-T5W4Wy/1 --setattr=tbon.endpoint=ipc://%B/req
flux-start: 2: libtool --mode=execute valgrind --leak-check=full --suppressions=t/valgrind/valgrind.supp /g/g0/grondo/git/flux-core.git/src/broker/flux-broker --setattr=broker.rundir=/var/tmp/grondo/flux-190781-T5W4Wy/2 --setattr=tbon.endpoint=ipc://%B/req
flux-start: 3: libtool --mode=execute valgrind --leak-check=full --suppressions=t/valgrind/valgrind.supp /g/g0/grondo/git/flux-core.git/src/broker/flux-broker --setattr=broker.rundir=/var/tmp/grondo/flux-190781-T5W4Wy/3 --setattr=tbon.endpoint=ipc://%B/req

In addition, this PR adds an enhancement to sharness test_under_flux to run the session under valgrind if FLUX_TEST_VALGRIND is set. Therefore, an entire sharness test can be executed under valgrind by running

$ FLUX_TEST_VALGRIND=t ./t2100-aggregate.t
flux-start: 0: libtool e valgrind --leak-check=full --trace-children=no --child-silent-after-fork=yes --leak-resolution=med --error-exitcode=1 --suppressions=/g/g0/grondo/git/flux-core.git/t/valgrind/valgrind.supp /g/g0/grondo/git/flux-core.git/src/broker/flux-broker --setattr=broker.rundir=/var/tmp/grondo/flux-190811-uHVOPd/0 --setattr=tbon.endpoint=ipc://%B/req -Slog-filename=t2100-aggregate.broker.log -Slog-forward-level=7 -Sinit.rc2_timeout=300 sh ./t2100-aggregate.t 
[snp]
==190883== Command: /g/g0/grondo/git/flux-core.git/src/broker/.libs/lt-flux-broker --setattr=broker.rundir=/var/tmp/grondo/flux-190811-uHVOPd/7 --setattr=tbon.endpoint=ipc://%B/req -Slog-filename=t2100-aggregate.broker.log -Slog-forward-level=7 -Sinit.rc2_timeout=300
==190883== 
ok 1 - have aggregator module
ok 2 - flux-aggreagate: works
ok 3 - flux-aggreagate: works for floating-point numbers
ok 4 - flux-aggreagate: works for strings
ok 5 - flux-aggreagate: works for arrays
ok 6 - flux-aggreagate: works for objects
2018-05-31T14:37:12.792982Z aggregator.err[0]: sink: refusing to sink to rootdir
2018-05-31T14:37:12.804185Z aggregator.err[0]: sink: refusing to sink to rootdir
2018-05-31T14:37:12.804649Z aggregator.err[0]: sink: aborting aggregate .
ok 7 - flux-aggregate: abort works
ok 8 - flux-aggregate: different value per rank
ok 9 - flux-aggregate: different fp value per rank
ok 10 - flux-aggregate: --timeout=0. - immediate forward
ok 11 - flux-aggregate: --fwd-count works
# passed all 11 test(s)
1..11
==190865== 
==190865== HEAP SUMMARY:
==190865==     in use at exit: 19,755 bytes in 53 blocks
==190865==   total heap usage: 203,266 allocs, 203,213 frees, 37,181,484 bytes allocated
==190865== 
==190876== 
==190876== HEAP SUMMARY:
==190876==     in use at exit: 16,567 bytes in 43 blocks
==190876==   total heap usage: 44,456 allocs, 44,413 frees, 21,059,084 bytes allocated
==190876== 
==190883== 
==190883== HEAP SUMMARY:
==190883==     in use at exit: 16,567 bytes in 43 blocks
==190883==   total heap usage: 45,011 allocs, 44,968 frees, 21,083,525 bytes allocated
==190883== 
==190865== 24 bytes in 1 blocks are definitely lost in loss record 1 of 13
==190865==    at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==190865==    by 0x52BF77D: json_integer (in /usr/lib64/libjansson.so.4.10.0)
==190865==    by 0x52BC370: ??? (in /usr/lib64/libjansson.so.4.10.0)
==190865==    by 0x52BC169: ??? (in /usr/lib64/libjansson.so.4.10.0)
==190865==    by 0x52BC169: ??? (in /usr/lib64/libjansson.so.4.10.0)
==190865==    by 0x52BC4F5: ??? (in /usr/lib64/libjansson.so.4.10.0)
==190865==    by 0x52BC692: json_loads (in /usr/lib64/libjansson.so.4.10.0)
==190865==    by 0x4E55A4C: flux_msg_vunpack (message.c:1169)
==190865==    by 0x4E55B26: flux_msg_unpack (message.c:1190)
==190865==    by 0xC22FF96: push_cb (aggregator.c:604)
==190865==    by 0x4E51639: call_handler (msg_handler.c:302)
==190865==    by 0x4E51785: dispatch_message (msg_handler.c:326)
==190865==    by 0x4E51785: handle_cb (msg_handler.c:392)
==190865== 
==190865== LEAK SUMMARY:
==190865==    definitely lost: 24 bytes in 1 blocks
==190865==    indirectly lost: 0 bytes in 0 blocks
==190865==      possibly lost: 0 bytes in 0 blocks
==190865==    still reachable: 19,731 bytes in 52 blocks
==190865==         suppressed: 0 bytes in 0 blocks
==190865== Reachable blocks (those to which a pointer was found) are not shown.
==190865== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190865== 
==190865== For counts of detected and suppressed errors, rerun with: -v
==190865== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==190868== 
==190868== HEAP SUMMARY:
==190868==     in use at exit: 16,567 bytes in 43 blocks
==190868==   total heap usage: 63,130 allocs, 63,087 frees, 23,235,826 bytes allocated
==190868== 
==190872== 
==190872== HEAP SUMMARY:
==190872==     in use at exit: 16,567 bytes in 43 blocks
==190872==   total heap usage: 44,452 allocs, 44,409 frees, 21,058,761 bytes allocated
==190872== 
==190876== LEAK SUMMARY:
==190876==    definitely lost: 0 bytes in 0 blocks
==190876==    indirectly lost: 0 bytes in 0 blocks
==190876==      possibly lost: 0 bytes in 0 blocks
==190876==    still reachable: 16,567 bytes in 43 blocks
==190876==         suppressed: 0 bytes in 0 blocks
==190876== Reachable blocks (those to which a pointer was found) are not shown.
==190876== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190876== 
==190876== For counts of detected and suppressed errors, rerun with: -v
==190876== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==190867== 
==190867== HEAP SUMMARY:
==190867==     in use at exit: 16,567 bytes in 43 blocks
==190867==   total heap usage: 77,112 allocs, 77,069 frees, 25,116,820 bytes allocated
==190867== 
==190868== LEAK SUMMARY:
==190868==    definitely lost: 0 bytes in 0 blocks
==190868==    indirectly lost: 0 bytes in 0 blocks
==190868==      possibly lost: 0 bytes in 0 blocks
==190868==    still reachable: 16,567 bytes in 43 blocks
==190868==         suppressed: 0 bytes in 0 blocks
==190868== Reachable blocks (those to which a pointer was found) are not shown.
==190868== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190868== 
==190868== For counts of detected and suppressed errors, rerun with: -v
==190868== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==190883== LEAK SUMMARY:
==190883==    definitely lost: 0 bytes in 0 blocks
==190883==    indirectly lost: 0 bytes in 0 blocks
==190883==      possibly lost: 0 bytes in 0 blocks
==190883==    still reachable: 16,567 bytes in 43 blocks
==190883==         suppressed: 0 bytes in 0 blocks
==190883== Reachable blocks (those to which a pointer was found) are not shown.
==190883== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190883== 
==190883== For counts of detected and suppressed errors, rerun with: -v
==190883== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==190872== LEAK SUMMARY:
==190872==    definitely lost: 0 bytes in 0 blocks
==190872==    indirectly lost: 0 bytes in 0 blocks
==190872==      possibly lost: 0 bytes in 0 blocks
==190872==    still reachable: 16,567 bytes in 43 blocks
==190872==         suppressed: 0 bytes in 0 blocks
==190872== Reachable blocks (those to which a pointer was found) are not shown.
==190872== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190872== 
==190872== For counts of detected and suppressed errors, rerun with: -v
==190872== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==190866== 
==190866== HEAP SUMMARY:
==190866==     in use at exit: 16,567 bytes in 43 blocks
==190866==   total heap usage: 93,147 allocs, 93,104 frees, 26,961,555 bytes allocated
==190866== 
==190867== LEAK SUMMARY:
==190867==    definitely lost: 0 bytes in 0 blocks
==190867==    indirectly lost: 0 bytes in 0 blocks
==190867==      possibly lost: 0 bytes in 0 blocks
==190867==    still reachable: 16,567 bytes in 43 blocks
==190867==         suppressed: 0 bytes in 0 blocks
==190867== Reachable blocks (those to which a pointer was found) are not shown.
==190867== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190867== 
==190867== For counts of detected and suppressed errors, rerun with: -v
==190867== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==190866== LEAK SUMMARY:
==190866==    definitely lost: 0 bytes in 0 blocks
==190866==    indirectly lost: 0 bytes in 0 blocks
==190866==      possibly lost: 0 bytes in 0 blocks
==190866==    still reachable: 16,567 bytes in 43 blocks
==190866==         suppressed: 0 bytes in 0 blocks
==190866== Reachable blocks (those to which a pointer was found) are not shown.
==190866== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190866== 
==190866== For counts of detected and suppressed errors, rerun with: -v
==190866== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
flux-start: 0 (pid 190865) exited with rc=1
==190870== 
==190870== HEAP SUMMARY:
==190870==     in use at exit: 16,567 bytes in 43 blocks
==190870==   total heap usage: 44,439 allocs, 44,396 frees, 21,059,836 bytes allocated
==190870== 
==190870== LEAK SUMMARY:
==190870==    definitely lost: 0 bytes in 0 blocks
==190870==    indirectly lost: 0 bytes in 0 blocks
==190870==      possibly lost: 0 bytes in 0 blocks
==190870==    still reachable: 16,567 bytes in 43 blocks
==190870==         suppressed: 0 bytes in 0 blocks
==190870== Reachable blocks (those to which a pointer was found) are not shown.
==190870== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==190870== 
==190870== For counts of detected and suppressed errors, rerun with: -v
==190870== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

grondo added some commits May 31, 2018

flux-start: use add_args_list in exec_broker
Problem: Code to push --broker-opts option list onto the broker cmdline
in exec_broker() duplicates add_args_list() as used in client_create().

Delete the duplicated code and use the common function instead.
flux-start: do not pass broker_path to execvp_argz
The broker_path passed to execvp_argz() is unnecessary, since this
same path has already been pushed onto the head of the argz vector.

Remove the unnecessary argument and pass av[0] as the first arg to
execvp().
flux-start: add --wrap option
Add a --wrap=arg0,arg1,.. option to flux-start to allow wrapping
broker processes with another command (e.g. valgrind)
@garlick

This comment has been minimized.

Copy link
Member

garlick commented May 31, 2018

Oh wow, that is super useful! I definitely vote we go ahead with this approach.

sharness: allow valgrind to be used with test_under_flux
If FLUX_TEST_VALGRIND is set in test_under_flux(), set up flux-start's
--wrap option so that it runs all brokers under valgrind, pulling in
suppressions from the valgrind test.

@grondo grondo force-pushed the grondo:flux-start--wrap branch from 4661f1b to 20a60bf May 31, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented May 31, 2018

Codecov Report

Merging #1542 into master will increase coverage by 0.05%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master    #1542      +/-   ##
==========================================
+ Coverage   79.03%   79.08%   +0.05%     
==========================================
  Files         169      169              
  Lines       30983    30983              
==========================================
+ Hits        24486    24502      +16     
+ Misses       6497     6481      -16
Impacted Files Coverage Δ
src/cmd/flux-start.c 81% <83.33%> (+6.81%) ⬆️
src/broker/modservice.c 79.61% <0%> (-1.95%) ⬇️
src/common/libflux/request.c 88.46% <0%> (-1.29%) ⬇️
src/common/libflux/response.c 82.89% <0%> (-0.66%) ⬇️
src/common/libflux/future.c 88.98% <0%> (-0.45%) ⬇️
src/modules/connector-local/local.c 74.18% <0%> (-0.21%) ⬇️
src/bindings/lua/flux-lua.c 82.15% <0%> (-0.09%) ⬇️
src/common/libflux/handle.c 83.91% <0%> (+0.24%) ⬆️
src/common/libflux/message.c 81.26% <0%> (+0.35%) ⬆️
@coveralls

This comment has been minimized.

Copy link

coveralls commented May 31, 2018

Coverage Status

Coverage increased (+0.05%) to 79.371% when pulling 9d04c89 on grondo:flux-start--wrap into 48670b5 on flux-framework:master.

@grondo grondo force-pushed the grondo:flux-start--wrap branch from e0d3dd4 to 1fb26f5 May 31, 2018

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jun 1, 2018

Looks like the two failures in travis are due to a spellcheck error on flux-start.adoc but I can't tell which word. Odd that those two builders are failing but not the others...

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jun 1, 2018

Maybe they didn't rerun. I fixed the spellcheck error by adding valgrind to the dictionary.

Last night I was seeing a failure from one of the new tests but just in one builder. Have to investigate further because there was no indication why the test failed.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jun 1, 2018

That one builder that just failed was a strange error. Looks like t0001-basic.t exited right after this test passed:

test_expect_success 'flux-start init.rc2_timeout attribute works' "
        test_expect_code 143 flux start ${ARGS} -o,-Sinit.rc2_timeout=0.1 sleep 5
"

the next test is

test_expect_success 'test_under_flux works' '
        echo >&2 "$(pwd)" &&
        mkdir -p test-under-flux && (
                cd test-under-flux &&
                cat >.test.t <<-EOF &&
                #!/bin/sh
                pwd
                test_description="test_under_flux (in sub sharness)"
                . "\$SHARNESS_TEST_SRCDIR"/sharness.sh
                test_under_flux 2
                test_expect_success "flux comms info" "
                        flux comms info
                "
                test_done
                EOF
        chmod +x .test.t &&
        SHARNESS_TEST_DIRECTORY=`pwd` &&
        export SHARNESS_TEST_SRCDIR SHARNESS_TEST_DIRECTORY FLUX_BUILD_DIR debug &&
        run_timeout 5 ./.test.t --verbose --debug >out 2>err
        ) &&
        grep "size=2" test-under-flux/out
'

grondo added some commits May 31, 2018

t0001-basic.t: add tests for flux-start --wrap
Add test for flux-start --wrap option to the t0001-basic tests.
doc: document --wrap in flux-start(1)
Add a short entry for flux-start's new --wrap option.

@grondo grondo force-pushed the grondo:flux-start--wrap branch from 1fb26f5 to 9d04c89 Jun 1, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jun 1, 2018

the next test is

Actually I had added two tests right after the last successful test.

The problem was that my new test failed --chain-lint (which is only enabled in that one builder). I have no idea why no error messages went to the travis logs about failing chain-lint test. Should have looked like:

error: bug in the test script: broken &&-chain: 
        broker_path=$(flux start -vX 2>&1 | sed "s/^flux-start: *//g") &&
        echo broker_path=${broker_path}
        test -n "${broker_path}" &&
        flux start --wrap=/bin/echo,start: arg0 arg1 arg2 > wrap.output &&
        test_debug "cat wrap.output" &&
        cat >wrap.expected <<-EOF &&
        start: ${broker_path} arg0 arg1 arg2
        EOF
        test_cmp wrap.expected wrap.output

I'm guessing that tap-driver.sh is eating this error 😠

Anyhow, I fixed the test and resubmitted so we're hopefully good now -- after restarting spurious travis failures N times...

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jun 1, 2018

I'm guessing that tap-driver.sh is eating this error 

It appears the error goes into t/test-suite.log, which we should probably be displaying in travis' after-failure output. I'll post a separate PR for that.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jun 1, 2018

Great, thanks!

@garlick garlick merged commit 8196c53 into flux-framework:master Jun 1, 2018

4 checks passed

codecov/patch 83.33% of diff hit (target 79.03%)
Details
codecov/project 79.08% (+0.05%) compared to 48670b5
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.05%) to 79.371%
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jun 1, 2018

Thanks!

@grondo grondo deleted the grondo:flux-start--wrap branch Jun 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.