-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add flux-shutdown command #4250
Conversation
Re-pushed with some tests, and some fixes that came out of writing the tests. |
Repushed with test race fix. |
@@ -1446,16 +1446,14 @@ int flux_match_asprintf (struct flux_match *m, const char *fmt, ...) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message typo "Modfiy" -> "Modify"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed that fix.
a6d5c4c
to
b5f2498
Compare
Codecov Report
@@ Coverage Diff @@
## master #4250 +/- ##
==========================================
- Coverage 83.55% 83.52% -0.03%
==========================================
Files 386 388 +2
Lines 64786 64921 +135
==========================================
+ Hits 54131 54226 +95
- Misses 10655 10695 +40
|
Cool! Tested this out on fluke and it works to remotely shut down a job. Nice! $ src/cmd/flux shutdown ƒSk7twdWVfm
broker.err[0]: rc2.0: /bin/bash Exited (rc=130) 36.7s
broker.info[0]: rc2-fail: run->cleanup 36.6892s
broker.info[0]: cleanup.0: /bin/bash -c flux queue stop --quiet Exited (rc=0) 0.1s
broker.info[0]: cleanup.1: /bin/bash -c flux job cancelall --user=all --quiet -f --states RUN Exited (rc=0) 0.1s
broker.info[0]: cleanup.2: /bin/bash -c flux queue idle --quiet Exited (rc=0) 0.1s
broker.info[0]: cleanup-success: cleanup->shutdown 0.230415s
broker.info[0]: children-complete: shutdown->finalize 0.491639s
sched-simple.info[0]: resource update: {"down":"1"}
broker.info[0]: rc3.0: /bin/bash -c /g/g0/grondo/git/flux-core.git/etc/rc3 Exited (rc=0) 1.0s
broker.info[0]: rc3-success: finalize->goodbye 1.03809s However, when I do try to "accidentally" run
|
Uh oh, let me see if I can reproduce that. I do have a test for it but the test is rather simple minded. |
Problem: the broker publishes a "shutdown" event message on entry to the SHUTDOWN state, but it is not used. The shutdown message is no longer used to drive the state machine, and the state-machine.monitor streaming RPC is available for any subsystems that need to be notified when shutdown is occurring. Remove publishing code. Update markdown broker state machine documentation.
Problem: when flux-shutdown is running, the broker needs to ensure that it receives all of its messages before unloading the local connector. Add a new state, GOODBYE, between FINALIZE (rc3) and EXIT. For now just post the "goodbye" event immediately to exit the state. In the future, the shutdown handler will do this. Update sharness test. Update markdown broker state documentation.
Problem: the state-machine.monitor streaming RPC stops streaming responses after STATE_SHUTDOWN, but does not send an ENODATA or error response after that to terminate the stream, and therefore does not conform to RFC 6. Send an ENODATA response after the final non-error response is sent. Also, instead of hardwiring STATE_SHUTDOWN as the final state transition to report, add an optional "final" parameter which lets the caller set the desired final state. If unspecified, the final state is STATE_EXIT. The internal state machine user sets state=STATE_SHUTDOWN, as before.
Problem: flux_msg_route_match_first() returns false when either or both messges lack routing frames. Consequently, flux_msglist_cancel() does not work on requests sent by the broker to itself. Modify the logic in this function so that it returns true when neither message has routing frames, which is consistent with the intent of the function: to match requests that have the same sender, despite the perhaps poorly chosen function name. Add unit tests.
Problem: instance shutdown currently occurs upon completion of the initial program or receipt of signals, but sometimes it may be useful to have manual control over the shutdown process. Implement a shutdown.start RPC, for rank 0 only, that interfaces with a front end command to initiate and monitor instance shutdown. The same procedure is followed as when the broker catches a terminating signal: by killing the intial program with SIGTERM, or if there is no initial program, posting the "rc2-abort" event. The broker transitions out of RUN state and works through its teardown process. If the shutdown.start RPC has the streaming flag set, log messages from that point are forwarded to the client for display during the shutdown process. After rc3 completes and the broker enters the new GOODBYE state, if there is an active shutdown client, the stream of shutdown.start responses are terminated with ENODATA. Upon the client's disconnect, the "goodbye" event is posted, causing the broker to transition to EXIT state, which unloads the connector-local module and eventually exits. If the instance is running under systemd, the broker exit code will be the special one that inhibits systemd from restarting the broker.
Problem: there is no command front end to drive instance shutdown. Add flux-shutdown, a command that can trigger shutdown and monitor log activity during the shutdown process. If a TARGET option is present, it is resolved to the URI of a Flux instance that will be shut down. Otherwise the current instance is the target. Logs are displayed until the GOODBYE state is reached. The logging level may be adjusted with the --quiet and --verbose options. Alternatively, the --background option may be use to to skip the monitoring and exit immediately. Fixes flux-framework#3895.
Problem: the new flux-shutdown command has no man page. Add a man page.
Problem: broker runat unit test names the wrong function in output when testing it for bad parameters. Fix function name.
Just pushed a fix for that hang. The problem is that we are using SIGTERM generally to abort rc scripts spawned by the broker. Most of them are shell scripts, but the shell (at least SIGHUP has the same default action (terminate process) as SIGTERM, but the shell distributes SIGHUP to any children, so it seems like a much better choice. The fix is to switch over to that signal. Possibly in the future we may find that in some extreme cases this signal needs to be followed with a SIGKILL after a timeout, but for now I think this is fine.
|
Problem: the broker sends SIGTERM to the commands it spawns for rc1, rc2, cleanup, and rc3, but most of these are shell scripts and SIGTERM is ignored by bash when a subcommand is running. SIGHUP has the same default action as SIGTERM but is forwarded by bash to subprocesses. Change runat_abort() to use SIGHUP instead of SIGTERM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I think the manpage needs to be updated to indicate the switch to SIGHUP
, but I may have just missed that change.
I tested some corner cases, but everything seems to work as advertised.
The only thing I wonder, and I wouldn't argue that it needs to be added now, is if we want a TIME
argument like shutdown(8)
, with a default of waiting at least a few seconds and an option to cancel the shutdown with a Ctrl-C?
Honestly, we can probably wait until someone accidentally says "What's this command do?" and types flux shutdown
as root, but I just wanted to bring it up (so I can later declare I once had forethought)
doc/man1/flux-shutdown.rst
Outdated
with SIGTERM. Note that the broker exit value normally reflects the | ||
exit code of the initial program, so if it is terminated by this signal, | ||
the broker exits with 128 + 15 = 143. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be updated to SIGHUP
and 128 + 1 = 129 (Hangup)?
Problem: there is no test coverage for flux-shutdown. Add t2808-shutdown-cmd.t.
OK, I pushed with a fix for the man page. Yes, I agree we may want some options like shutdown(8) in the future. Just wanted to get the basics in first. I'll go ahead and set MWP. |
This adds
flux-shutdown(1)
and a mechanism for synchronizing shutdown activities so the command can monitor logs on the way down, and not exit until afterrc3
completes.Example, shutting down my system instance:
Yeah, some of those logs could use some fine tuning, but that's a matter for another PR.
I envisioned adding other options to the shutdown command such as:
flux-dump
for later restartcontent.sqlite
to remove unreferenced blobrefsThere may be other shutdown-time options that would be useful to have. But for now, just the bare bones.
This is WIP pending the addition of tests, but I wanted to get any feedback on the overall design.