Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue385: Support tbon.endpoint and mcast.endpoint attributes #1030

Merged
merged 6 commits into from Apr 12, 2017

Conversation

Projects
None yet
5 participants
@chu11
Copy link
Contributor

chu11 commented Apr 6, 2017

Support new attributes that allow the end user to configure endpoints for the flux broker.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 6, 2017

Some notes. I kept the --enable-epgm and --shared-ipc-namespace options, although they could be removed. Effectively, they now are convenience options that set mcast.endpoint and tbon.endpoint to known endpoints. I think code could be cleaner if we remove one or more of these options (lots of create string, free string, cleanup file, kind of code paths).

There were no EPGM tests before, so I didn't add any. I assume this is b/c we cannot assume we're ever on a network that has multicast available (travis?).

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 6, 2017

Coverage Status

Coverage decreased (-0.3%) to 77.941% when pulling 34cb7c5 on chu11:issue385 into a9f4cd5 on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 6, 2017

Hmmm, got an error ERROR: lua/t1002-kvs.t - exited with status 139 (terminated by signal 11?) under clang. Can't see how anything I did effected this. Re-running

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 6, 2017

Hmmm, got errors again but only on clang again.

PASS: lua/t1002-kvs.t 125 - but key value has been updated
ERROR: lua/t1002-kvs.t - exited with status 139 (terminated by signal 11?)
PASS: lua/t1003-iowatcher.t 1 - require 'flux'

hmmm, perhaps something racy and I've just tickled it the right way? 11 is SIGSEGV, hmmm

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 6, 2017

yeah, too bad corefiles stopped working in Travis.

I wonder if it is something in the shutdown path? I wonder if the Lua test script leaves something in a bad state due to lazy cleanup or something.

Have you tried compiling with clang on your workstation?

@chu11 chu11 force-pushed the chu11:issue385 branch from 34cb7c5 to 0b1e144 Apr 6, 2017

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 6, 2017

Codecov Report

Merging #1030 into master will decrease coverage by 0.06%.
The diff coverage is 61.64%.

@@            Coverage Diff            @@
##           master   #1030      +/-   ##
=========================================
- Coverage   77.96%   77.9%   -0.07%     
=========================================
  Files         150     150              
  Lines       25603   25648      +45     
=========================================
+ Hits        19962   19980      +18     
- Misses       5641    5668      +27
Impacted Files Coverage Δ
src/cmd/flux-start.c 74% <100%> (ø) ⬆️
src/broker/broker.c 72.34% <61.11%> (-0.36%) ⬇️
src/broker/overlay.c 67.3% <0%> (-3.53%) ⬇️
src/common/libflux/tagpool.c 95.12% <0%> (-1.22%) ⬇️
src/common/libflux/rpc.c 91.2% <0%> (-1.1%) ⬇️
src/broker/module.c 82.15% <0%> (-0.29%) ⬇️
src/common/libflux/dispatch.c 83.19% <0%> (-0.28%) ⬇️
src/common/libflux/handle.c 85.86% <0%> (-0.26%) ⬇️
src/modules/kvs/kvs.c 80.88% <0%> (-0.25%) ⬇️
src/modules/connector-local/local.c 70.14% <0%> (+0.2%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f915009...f557c24. Read the comment docs.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 6, 2017

yeah, I've tried clang 3.9 & 4.0 on TOSS3 systems, wasn't able to reproduce with them. Gonna try and do some "git bisect" here via travis, see if I can get anything out of it.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 6, 2017

Coverage Status

Coverage decreased (-0.06%) to 78.156% when pulling 0b1e144 on chu11:issue385 into a9f4cd5 on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 6, 2017

Coverage Status

Coverage decreased (-0.3%) to 77.938% when pulling cd6c33a on chu11:issue385 into a9f4cd5 on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 6, 2017

Well, it appears somehow third time was a charm.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 7, 2017

Coverage Status

Coverage decreased (-0.3%) to 77.945% when pulling 97e56c1 on chu11:issue385 into 6fbe380 on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 7, 2017

rebased and it appears fourth time was a charm too. low code coverage b/c lots of missed error paths and --enable-epgm paths.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 7, 2017

Sorry for taking a long time to comment!

Yes, testing epgm in travis is hard since (as I recall) opening the same epgm endpoint multiple times on the same host doesn't work, which is why there is the "event relay" code in there when there is more than one broker per node and epgm is active. I was about to refresh my memory on the actual epgm limitation, but I need to go soon so I'll just bookmark this zeromq-dev thread.

I would welcome the the further cleanup to remove redundant broker options, but it can happen in another PR.

This should definitely be tested by hand under slurm on a real system, with and without epgm. I can give that a try later if that's not already been done. Once we're sure that still works, I'm for merging this.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 10, 2017

Ran by hand under slurm. All I did was run flux under slurm, make sure I could run flux wreckrun across nodes, and verify that children in the overlay were muted/not-muted based on the config I set.

As for --enable-epgm and --shared-ipc-namespace, is the feeling to axe both of them? I was sort of leaning to keeping --shared-ipc-namespace at first, but now I see how little it is used (mostly just by flux-start). So perhaps it's best to just axe both options.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 10, 2017

Thanks for doing the testing.

Yeah, the main point of --shared-ipc-namespace was just for flux-start to influence whether tcp:// or ipc:// endpoints were used for the tbon wireup. If setting tbon.endpoint to ipc://%B/req accomplishes the same thing then let's give it the axe!

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 11, 2017

Possibly %P should be dropped here, due to #1034. Especially since we have * (although I'm not sure that works with epgm)

@chu11 chu11 force-pushed the chu11:issue385 branch from 97e56c1 to c305d45 Apr 11, 2017

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 11, 2017

@garlick Yup, dropping %P was exactly what I was thinking. You are correct the wildcard doesn't work with epgm, which is probably why the pseudo-random-ish 5000 + 1024 % sessionid was used before.

Just pushed new changes, axeing %P and the old --enable-epgm and --shared-ipc-namespace.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 11, 2017

There is a fallout of this patch series that I've been trying to hunt down, but perhaps it's for a different issue. In the below I just used a wildcard port, which is invalid for epgm. But it works with any invalid input to tbon.endpoint or mcast.endpoint

$ srun -N4 -n4 --pty src/cmd/flux start -o,--setattr=log-filename=/tmp/achu/mylog -o,--setattr=log-level=7 -o,--setattr=mcast.endpoint='epgm://%h;239.192.1.1:*'
lt-flux-broker: bind_event_pub: epgm://192.168.128.108;239.192.1.1:*: Resource temporarily unavailable
E: (flux-broker) 17-04-11 11:52:38 dangling 'PAIR' socket created at src/zsys.c:379
E: (flux-broker) 17-04-11 11:52:38 dangling 'PAIR' socket created at src/zsys.c:380
E: (flux-broker) 17-04-11 11:52:38 dangling 'REP' socket created at src/zauth.c:77
E: (flux-broker) 17-04-11 11:52:38 dangling 'ROUTER' socket created at overlay.c:467
E: (flux-broker) 17-04-11 11:52:38 dangling 'PUB' socket created at overlay.c:486
E: (flux-broker) 17-04-11 11:52:38 dangling sockets: cannot terminate ZMQ safely
lt-flux-broker: src/zsock_option.c:1347: zsock_set_sndtimeo: Assertion `rc == 0 || zmq_errno () == (156384712 + 53)' failed.
srun: error: opal110: task 2: Exited with exit code 1
srun: error: opal109: task 1: Exited with exit code 1
srun: error: opal111: task 3: Aborted (core dumped)
srun: error: opal108: task 0: Aborted (core dumped)

Don't know the exact code path at the moment, but basically a bad bind call leads down some path that eventually asserts inside czmq. Of course, it should exit cleanly on the bad bind.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 11, 2017

Coverage Status

Coverage decreased (-0.2%) to 78.032% when pulling c305d45 on chu11:issue385 into 6fbe380 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 11, 2017

My guess is a stack trace would lead to czmq's atexit handler, which is emitting the E: prefixed messages, and will try to close all open zmq sockets. Ideally we would fix overlay.c to clean up after itself and return failure rather than call exit(), but that's probably not for this PR. Would it avoid the segfault if this socket were closed in the failure path of zsock_bind() before the call to log_err_exit()?

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 11, 2017

Nah, destroying socket in the zsock_bind() path didn't help. I'll open up a new issue regarding this.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

@chu11 are you still working here or is this ready for merging?

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 12, 2017

Now that we've got #1036, this one is all done now.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

Thanks - nice cleanup.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

I'm hesitating on the merge because of the overall project coverage drop of -0.2% which seems kind of high. Looking at the raw travis output I do see some errors:

make[1]: Entering directory `/home/travis/build/flux-framework/flux-core'
lcov --quiet --directory . --capture --output-file "flux-core-0.7.0-29-g8a26b5b-coverage.info.tmp1" --test-name "flux-core-0.7.0-29-g8a26b5b" --no-checksum --compat-libtool 
geninfo: WARNING: invalid characters removed from testname!
<built-in>:cannot open source file
geninfo: WARNING: no data found for /home/travis/build/flux-framework/flux-core/t/<built-in>
<built-in>:cannot open source file
geninfo: WARNING: no data found for /home/travis/build/flux-framework/flux-core/t/<built-in>
<built-in>:cannot open source file
geninfo: WARNING: no data found for /home/travis/build/flux-framework/flux-core/t/<built-in>
lcov --quiet -a "flux-core-0.7.0-29-g8a26b5b-coverage.info.initial" -a "flux-core-0.7.0-29-g8a26b5b-coverage.info.tmp1" -o "flux-core-0.7.0-29-g8a26b5b-coverage.info.tmp" 
lcov: WARNING: negative counts found in tracefile flux-core-0.7.0-29-g8a26b5b-coverage.info.tmp1
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_kvs.c:5739
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_kvs.c:5745
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_jsc.c:1354
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_jsc.c:1360
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/sds.c:933
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_kz.c:1587
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_kz.c:1593
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/shortjson.h:113
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/modules/libjsc/jstatctl.c:966
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/modules/libjsc/jstatctl.c:662
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/modules/libjsc/jstatctl.c:635
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/kary.c:61
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/kary.c:52
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/kary.c:89
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/kary.c:74
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/tstat.c:72
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libflux/reactor.c:87
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libflux/reactor.c:243
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libsubprocess/zio.c:807
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libsubprocess/zio.c:519
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libsubprocess/zio.c:306
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libsubprocess/zio.c:311
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libsubprocess/zio.c:715
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libsubprocess/zio.c:300
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/cronodate.c:352
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/cronodate.c:226
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/cronodate.c:305
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/cronodate.c:289
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/cronodate.c:312
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_barrier.c:1039
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_barrier.c:1045
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libminilzo/minilzo.c:4057
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/common/libutil/environment.c:212
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_core.c:13301
lcov: WARNING: function data mismatch at /home/travis/build/flux-framework/flux-core/src/bindings/python/flux/_core.c:13307

and a bit later

Running gcov in . �[0;90m(disable via -X gcov)�[0m
treduce.gcno:cannot open graph file
treduce.gcno:cannot open graph file
topen.gcno:cannot open graph file
tsend.gcno:cannot open graph file
trecv.gcno:cannot open graph file
tevent.gcno:cannot open graph file
trpc.gcno:cannot open graph file
trpc_then.gcno:cannot open graph file
trpc_then_multi.gcno:cannot open graph file
tinfo.gcno:cannot open graph file
lua-affinity.gcno:cannot open graph file
cpuset-str.gcno:cannot open graph file
lua-affinity.gcno:cannot open graph file
cpuset-str.gcno:cannot open graph file
hostlist.gcno:cannot open graph file
lua-hostlist.gcno:cannot open graph file
[snip] lots of these

I'll look at some other runs to see if these are just normal noise or something new.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 12, 2017

I did figure it'd be a bit lower b/c we are going to miss anything I touched related to epgm. But you're right -.2% is a bit high. Looking through the codecov, the biggest loss not related to my patch was we no longer hit some overlay code related to multicast (bind_event_pub, connect_event_pub, overlay_set_event).

The reason is ---shared-ipc-namespace would always create an ipc event file (ipc://rundir/event), regardless if the user requested epgm or not. I simplified it to just use the tbon.

So I should probably add a test for setting mcast.endpoint to an ipc endpoint.

Looking at the logs, the stuff above appears to be normal noise.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

If you can quickly add a test for that, it would be interesting to see how that affects coverage.

It would appear that the noise I reported above occurs in other PR's so that's probably not it.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 12, 2017

The missing .gcno warnings for the test executables and lua modules is probably nothing to worry about. Likely they are built without -fprofile-arcs or whatever, and they aren't included in coverage anyway.

The WARNING: no data found for /home/travis/build/flux-framework/flux-core/t/<built-in> is probably fine, some symbol is perhaps resolving to <built-in> but I doubt it is one of ours? I'll plumb google for that one.

I'll also try to research the function data mismatch errors.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

I'm kicking the tires on this one to recreate #1036 and did notice one thing:

Maybe the old event-uri and tbon-request-uri could go away now that we have mcast.endpoint and tbon.endpoint if, once the endpoints have been opened, we substituted the "expanded" endpoint for the symbolic one?

Then for consistency maybe event-relay-uri could be come mcast.relay-endpoint?

And tbon-parent-uri could become tbon.parent-endpoint?

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 12, 2017

@garlick I think those are some good ideas. I was thinking of something similar, although was gonna do it in another PR. How about we just start an issue for it.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

Sounds good!

chu11 added some commits Mar 21, 2017

broker: Support tbon.endpoint attribute
Support new tbon.endpoint attribute to support configuration of
a TBON endpoint rather than the previously hardcoded "tcp://<ipaddr>:*".

Default is now "tcp://%h:*", where %h is a format specifier indicating
ip address of the host's hostname.
broker: Refactor --shared-ipc-namespace
Refactor --shared-ipc-namespace option to configure via the
tbon.endpoint attribute.  Utilize a newly added %B format
specifier for tbon.endpoint that indicates the broker.rundir.
broker: Support mcast.endpoint attribute
Support new mcast.endpoint attribute to support configuration of a
multicast endpoint for event distribution.

Default is now "tbon" to use to the tree based overlay network.  If
--enable-epgm is used, it now defaults to "epgm://%h;239.192.1.1:5000".

Fixes #385
broker: Remove --enable-epgm
With new mcast.endpoint attribute, --enable-epgm is no longer
needed.  Users should set mcast.endpoint appropriately via
--setattr option.
cmd/flux-start: Remove use of --shared-ipc-namespace
When setting up broker, use --setattr to set tbon.endpoint instead
of calling --shared-ipc-namespace.
broker: Remove --shared-ipc-namespace
With new tbon.endpoint attribute, --shared-ipc-namespace is no longer
needed.  Users should set tbon.endpoint appropriately via --setattr
option.

@chu11 chu11 force-pushed the chu11:issue385 branch from c305d45 to f557c24 Apr 12, 2017

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 12, 2017

Coverage Status

Coverage decreased (-0.07%) to 78.15% when pulling f557c24 on chu11:issue385 into f915009 on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Apr 12, 2017

woo-hoo, only -0.07% now with one additional test. Coverage on diff target is expectedly low w/ lots of error paths and corner cases missed.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 12, 2017

Great! Merging.

@garlick garlick merged commit 632c92f into flux-framework:master Apr 12, 2017

2 of 4 checks passed

codecov/patch 61.64% of diff hit (target 77.96%)
Details
codecov/project 77.9% (-0.07%) compared to f915009
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.07%) to 78.15%
Details

@grondo grondo referenced this pull request Aug 23, 2017

Closed

0.8.0 Release #1160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.