Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: various cleanup + refactoring #2194

Merged
merged 13 commits into from Jun 15, 2019

Conversation

@chu11
Copy link
Contributor

commented Jun 15, 2019

While PR #2181 is still half fresh in my mind, wanted to do these small cleanups & refactorings.

Obviously, way more refactoring can be done as I've listed things in #2182, but this is more of a refactoring "setup" for bigger things later.

Mostly small changes to slowly decouple objects from each other or clarify lack of dependency between objects.

Also, removed a few more log_err_exit() and similar paths, forcing callers to go through the main broker cleanup path.

chu11 added 13 commits Jun 14, 2019
The init.mode string of "normal" was checked twice.  Likely a merge
conflict from before.
Move estimation of shutdown grace period into shutdown object with
new function shutdown_set_grace().  With this setting, grace period
no longer needs to be set via shutdown_arm(), thus the grace parameter
is no longer needed in that function.  Update tests appropriately.
Remove function heartbeat_set_ratestr().  Instead, broker will
parse command-line input from user and set via heartbeat_set_rate().

This change removes the heartbeat object as a dependency for
parsing commandline arguments.

Update tests accordingly.
Move the zsecurity context from the broker context to the
overlay object.  It is only used inside the overlay and does
not need to be depended on by the broker ctx.
Move sigwatchers list into broker context.  Instead of calling
broker_unhandle_signals() on cleanup, simply set free function
on each element of sigwatchers list.
Do not "manually" cleanup up subscriptions list.  Simply set a
free function after each append to the subscriptions list.
Have overlay init cb return an int for success/fail.  Have this value
be returnable by overlay_init(), which can be returned by boot_pmi()
and boot_config().
Move all "register_attrs" calls to a location closer to the
initialization / setup of other parts of the object.  With this
change, the register_attrs calls are more in line with their
object and removes the impression that the registering of the
attrs needs to be done in a specific order except for when
commented.

The movement of content_cache_register_attrs() effectively
reverses commit

3d59fbf

which was a fix for #1063.  The fix of issue #1036 no longer
requires the content_cache_register_attrs() call to be made
far earlier in the code before zsys_init().
Return int for success / error, do not call log_err_exit() or oom()
on error.  Requires update to register_request(), register_event(),
and getctx() as well.

Update callers accordingly.
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2019

Hmm, one builder hit a valgrind issue, no stack available. Was also hit in #2137 & #2178 (keying off the byte count of 27,632).

==768== Memcheck, a memory error detector
==768== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==768== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==768== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-766-MPYdtA --setattr=tbon.endpoint=ipc://%B/req --shutdown-grace=16
==768== 
==767== Memcheck, a memory error detector
==767== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==767== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==767== Command: /usr/libexec/flux/cmd/flux-broker --setattr=rundir=/tmp/flux-766-MPYdtA --setattr=tbon.endpoint=ipc://%B/req --shutdown-grace=16 /usr/src/t/valgrind/valgrind-workload.sh
==767== 
FLUX_URI=local:///tmp/flux-766-MPYdtA/0
Running job
id=28991029248
id=32883343360
id=39493566464
id=47009759232
id=53351546880
id=63870861312
id=75749130240
id=84322287616
id=95613353984
id=105998450688
++ flux jobspec srun -t 1 -n 1 /bin/true
++ flux job submit
+ id=117390180352
+ flux job wait-event 117390180352 start
1560559386.705316 start
+ flux job cancel 117390180352
+ flux job wait-event 117390180352 clean
1560559387.700731 clean
+ flux job info 117390180352 eventlog jobspec R
{"timestamp":1560559384.4427228,"name":"submit","context":{"userid":2000,"priority":16,"flags":0}}
{"timestamp":1560559384.8781703,"name":"depend"}
{"timestamp":1560559386.1868644,"name":"alloc","context":{"note":"rank0/core1"}}
{"timestamp":1560559386.7053158,"name":"start"}
{"timestamp":1560559387.0576141,"name":"exception","context":{"type":"cancel","severity":0,"userid":2000,"note":""}}
{"timestamp":1560559387.0888948,"name":"finish","context":{"status":9}}
{"timestamp":1560559387.6782327,"name":"release","context":{"ranks":"all","final":true}}
{"timestamp":1560559387.700299,"name":"free"}
{"timestamp":1560559387.7007306,"name":"clean"}

{"tasks": [{"slot": "task", "count": {"per_slot": 1}, "command": ["/bin/true"], "attributes": {}}], "attributes": {"system": {"duration": 60.0}}, "version": 1, "resources": [{"count": 1, "with": [{"count": 1, "type": "core"}], "type": "slot", "label": "task"}]}

{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"1"}}]}}
+ flux job drain
==768== 
==768== HEAP SUMMARY:
==768==     in use at exit: 27,632 bytes in 70 blocks
==768==   total heap usage: 25,714 allocs, 25,644 frees, 17,211,323 bytes allocated
==768== 
==768== LEAK SUMMARY:
==768==    definitely lost: 0 bytes in 0 blocks
==768==    indirectly lost: 0 bytes in 0 blocks
==768==      possibly lost: 0 bytes in 0 blocks
==768==    still reachable: 27,632 bytes in 70 blocks
==768==         suppressed: 0 bytes in 0 blocks
==768== Reachable blocks (those to which a pointer was found) are not shown.
==768== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==768== 
==768== For counts of detected and suppressed errors, rerun with: -v
==768== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
flux-start: 0 (pid 767) Killed
not ok 1 - valgrind reports no new errors on 2 broker run
@grondo

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2019

Hm, in the valgrind test we adjust the broker's "shutdown-grace" period, but not flux-start's --killer-timeout which defaults to 2s. I wonder if the rank 0 broker is slow to exit and hits this timeout? I didn't look in any detail but we should probably add --killer-timeout (of at least VALGRIND_SHUTDOWN_GRACE?) to the flux-start command line in the test.

@garlick

This comment has been minimized.

Copy link
Member

commented Jun 15, 2019

These cleanups look good to me!

Good catch on that flux start timeout @grondo. I opened #2195 to track that as a separate issue. Meanwhile I think this can go in.

@garlick garlick merged commit 4c93728 into flux-framework:master Jun 15, 2019
2 checks passed
2 checks passed
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.