-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add flux start --recovery #4783
Conversation
This seems useful! If possible, a message printed just before dropping to the prompt with some terse instructions might be helpful. (I'm thinking like the recovery/single user mode shell in linux) |
Here's what I came up with for a message. I reworked a couple of bits including reducing the stderr log level to "warn". $ sudo -u flux flux start -r
2022-11-28T19:53:44.970486Z job-manager.err[0]: restart: checkpoint.job-manager: Invalid argument
2022-11-28T19:53:44.971955Z job-manager.crit[0]: module exiting abnormally
2022-11-28T19:53:44.974528Z broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
2022-11-28T19:53:44.976154Z broker.err[0]: rc1.0: /usr/local/etc/flux/rc1 Exited (rc=1) 0.7s
+-----------------------------------------------------
| Entering Flux system instance recovery mode.
| All resources will be offline during recovery.
| Any rc1 failures noted above may result in
| reduced functionality until manually corrected.
| KVS changes made during recovery are persitent.
|
| broker.rc1_path /usr/local/etc/flux/rc1
| broker.rc3_path /usr/local/etc/flux/rc3
| config.path /usr/local/etc/flux/system/conf.d
| statedir /var/lib/flux
|
| Exit this shell when finished.
+-----------------------------------------------------
$ |
Oh nice! I wasn't thinking of that much detail but that is pretty nice 👍 |
a8d8677
to
49f0939
Compare
I think this will work now for a non-system instance as well. At least it works superficially: $ mkdir /tmp/statedir
$ src/cmd/flux start -o,-Sstatedir=/tmp/statedir
ƒ(s=1,d=0,builddir) $ flux mini run hostnamesystem76-pc
ƒ(s=1,d=0,builddir) $
exit
$ src/cmd/flux start --recovery=/tmp/statedir
+-----------------------------------------------------
| Entering Flux recovery mode.
| All resources will be offline during recovery.
| Any rc1 failures noted above may result in
| reduced functionality until manually corrected.
| KVS changes made during recovery are persistent.
|
| broker.rc1_path /home/garlick/proj/flux-core/etc/rc1
| broker.rc3_path /home/garlick/proj/flux-core/etc/rc3
| config.path -
| statedir /tmp/statedir
|
| Exit this shell when finished.
+-----------------------------------------------------
ƒ(s=1,d=0,builddir) $ flux jobs -a
JOBID USER NAME ST NTASKS NNODES TIME INFO
ƒ2QkBQDm garlick hostname CD 1 1 0.019s system76-pc |
Oh, cool! Can this also be used as an easier method to solve #4466? |
Ah thanks for finding that issue! (I was looking for it) Yes, we'll just need to set recovery mode but restore to an instance that puts Edit: |
58ad5f1
to
debbab8
Compare
I tentatively just made the I hope I'm not making this too confusing by conflating two use cases with one option. If ARG is a statedir (or is missing and the statedir is presumed to be Furthermore, if Confusing? I did modify the startup banner so if you're running from a dump, you get something like
|
0804469
to
194a228
Compare
This is probably ready for a review when someone has a moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing really caught my eye, just one side comment
@@ -482,6 +503,21 @@ int execvp_argz (char *argz, size_t argz_len) | |||
return -1; | |||
} | |||
|
|||
bool system_instance_is_running (void) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but i couldn't help but wonder if there's a better way or "correct way" to do this, like perhaps calling systemd directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, maybe, although I don't have the first idea of how to do that, and I would imagine it would be more complex...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this seems fine to me since it would work even if the system instance is started without systemd or via some other process management scheme (e.g. in cloud scenarios if that is ever a thing?)
What struck me here is that this duplicates code that is already present in flux_open(3)
-- would it be simpler to just unset FLUX_URI
and then call flux_open (NULL, 0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done, just had one extra suggestion (besides the other comment about flux_open()
)
t/t2315-resource-system.t
Outdated
@@ -163,7 +163,7 @@ test_expect_success 'resources can be configured in TOML' ' | |||
hosts = "foo[3-10]" | |||
properties = ["batch"] | |||
EOF | |||
flux start -s 1 \ | |||
flux start -s 11 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will make this test take significantly longer, especially in CI and with coverage enabled. My suggestion would be this instead:
diff --git a/t/t2315-resource-system.t b/t/t2315-resource-system.t
index b3a0c665f..10ab4c0d1 100755
--- a/t/t2315-resource-system.t
+++ b/t/t2315-resource-system.t
@@ -165,12 +165,12 @@ test_expect_success 'resources can be configured in TOML' '
EOF
flux start -s 1 \
-o,--config-path=$(pwd)/${name},-Slog-filename=${name}/logfile \
- flux resource list -s up > ${name}/output &&
+ flux resource list -s all > ${name}/output &&
test_debug "cat ${name}/output" &&
cat <<-EOF >${name}/expected &&
STATE PROPERTIES NNODES NCORES NGPUS NODELIST
- up debug 3 12 0 foo[0-2]
- up batch 8 32 4 foo[3-10]
+ all debug 3 12 0 foo[0-2]
+ all batch 8 32 4 foo[3-10]
EOF
test_cmp ${name}/expected ${name}/output
Fixed those things, and I guess I'll set MWP! Thanks guys. |
Problem: a test of TOML-configured resources expects "extra" execution targets to be marked up, but this will soon change. Change the test to eliminate this expecation.
Problem: sometimes a Flux system instance will refuse to start, and then debugging is tricky because the system cannot be interactively probed. Add flux start --recovery[=ARG], which starts a singleton instance using state from a previous instance. If ARG is unspecifed, recover the system instance, e.g. sudo -u flux flux start --recovery If ARG is a directory, recover a persistent 'statedir', e.g. flux start --recovery=/tmp/statedir If ARG is a file, recover a flux-dump(1) archive, e.g. flux start --recovery=/tmp/mydump.tar or for a system dump: sudo -u flux flux start \ --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz Fixes flux-framework#4466
Problem: if rank 0 is not excluded from the resource configuration, jobs could potentially start in recovery mode. If the broker.recovery-mode attribute is set, bypass resource monitoring and let the resource acquisition protocol return all nodes offline (including rank 0).
Problem: in recovery mode, if R contains more execution targets than the instance size, as may be the case if the instance was launched with PMI, the extra targets are marked "up" and the scheduler thinks it can allocate them to pending jobs. In the resource module, use the execution target count in R, if available, instead of the instance size in calculating the set of "up" and "down" rank sets.
Problem: flux start --recovery=statedir fails resource validation if the rank 0 node running recovery is different than the one it ran on before. Bypass resource verification in recovery mode.
Problem: recovery mode aborts if something goes wrong in rc1, such as a module failing to load. This is the primary reason why a system instance wouldn't start, so if rc1 fails in recovery mode, post a new "rc1-ignorefail" event that transitions to the quorum state instead of shutdown.
Problem: downstream peers could complicate repair of problems in recovery mode. In boot_config.c, skip over the socket initialization when the broker.recovery-mode attribute is set.
Problem: in recovery mode, the state machine doesn't know whether rc2 is an interactive shell or not, so we cannot conditionally print a message geared towards interactive users. Add the runat_is_interactive() accessor.
Problem: users running Flux in interactive recovery mode might benefit from a short message before the shell is started. Print some recovery hints and paths that are useful to know. Suppress the message if the initial program is not an interactive shell.
Problem: flux-start(1) does not document the --recovery or --sysconfig options. Update man page.
Problem: recovery mode has no tests. Add a new sharness script for testing flux start --recovery.
Problem: there are no tests for system instance recovery. Add system/0004-recovery.t.
Codecov Report
@@ Coverage Diff @@
## master #4783 +/- ##
==========================================
- Coverage 84.34% 83.48% -0.86%
==========================================
Files 413 416 +3
Lines 62482 70756 +8274
==========================================
+ Hits 52698 59071 +6373
- Misses 9784 11685 +1901
|
Problem: when a system instance won't start, debugging is a chore.
This adds a
flux start --recovery
mode that should let rank 0 of a system instance be started interactively when something goes wrong with systemd startup, e.g.It only works for the system instance right now, since it hardwires the system instance configuration path and doesn't work with PMI bootstrap.
Thought I'd post it to see what people thought before digging in deeper. It seems like it's happened several times that we try to start Flux after a configuration update or new release, and the first problem we encounter is that the instance won't start. If there's something in the KVS (like the
checkpoint.job-manager
problem I noted in #4776) then in theory it could be manually corrected using this mode.