add flux start --recovery #4783

garlick · 2022-11-28T15:41:55Z

Problem: when a system instance won't start, debugging is a chore.

This adds a flux start --recovery mode that should let rank 0 of a system instance be started interactively when something goes wrong with systemd startup, e.g.

$ sudo -u flux ./flux start --recovery
flux@picl0:/nfshome/garlick/proj/flux-core/src/cmd$

It only works for the system instance right now, since it hardwires the system instance configuration path and doesn't work with PMI bootstrap.

Thought I'd post it to see what people thought before digging in deeper. It seems like it's happened several times that we try to start Flux after a configuration update or new release, and the first problem we encounter is that the instance won't start. If there's something in the KVS (like the checkpoint.job-manager problem I noted in #4776) then in theory it could be manually corrected using this mode.

grondo · 2022-11-28T16:32:29Z

This seems useful!

If possible, a message printed just before dropping to the prompt with some terse instructions might be helpful. (I'm thinking like the recovery/single user mode shell in linux)

garlick · 2022-11-28T19:56:28Z

Here's what I came up with for a message. I reworked a couple of bits including reducing the stderr log level to "warn".

$ sudo -u flux flux start -r
2022-11-28T19:53:44.970486Z job-manager.err[0]: restart: checkpoint.job-manager: Invalid argument
2022-11-28T19:53:44.971955Z job-manager.crit[0]: module exiting abnormally
2022-11-28T19:53:44.974528Z broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
2022-11-28T19:53:44.976154Z broker.err[0]: rc1.0: /usr/local/etc/flux/rc1 Exited (rc=1) 0.7s
+-----------------------------------------------------
|  Entering Flux system instance recovery mode.
|  All resources will be offline during recovery.
|  Any rc1 failures noted above may result in
|   reduced functionality until manually corrected.
|  KVS changes made during recovery are persitent.
|
|  broker.rc1_path    /usr/local/etc/flux/rc1
|  broker.rc3_path    /usr/local/etc/flux/rc3
|  config.path        /usr/local/etc/flux/system/conf.d
|  statedir           /var/lib/flux
|
|  Exit this shell when finished.
+-----------------------------------------------------
$

grondo · 2022-11-28T20:26:01Z

Oh nice! I wasn't thinking of that much detail but that is pretty nice 👍

garlick · 2022-11-29T00:03:16Z

I think this will work now for a non-system instance as well. At least it works superficially:

$ mkdir /tmp/statedir
$ src/cmd/flux start -o,-Sstatedir=/tmp/statedir
ƒ(s=1,d=0,builddir) $ flux mini run hostnamesystem76-pc
ƒ(s=1,d=0,builddir) $ 
exit
$ src/cmd/flux start --recovery=/tmp/statedir
+-----------------------------------------------------
|  Entering Flux recovery mode.
|  All resources will be offline during recovery.
|  Any rc1 failures noted above may result in
|   reduced functionality until manually corrected.
|  KVS changes made during recovery are persistent.
|
|  broker.rc1_path    /home/garlick/proj/flux-core/etc/rc1
|  broker.rc3_path    /home/garlick/proj/flux-core/etc/rc3
|  config.path        -
|  statedir           /tmp/statedir
|
|  Exit this shell when finished.
+-----------------------------------------------------
ƒ(s=1,d=0,builddir) $ flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
    ƒ2QkBQDm garlick  hostname   CD      1      1   0.019s system76-pc

grondo · 2022-11-29T04:25:10Z

Oh, cool! Can this also be used as an easier method to solve #4466?

garlick · 2022-11-29T04:54:02Z

Ah thanks for finding that issue! (I was looking for it) Yes, we'll just need to set recovery mode but restore to an instance that puts content.sqlite in rundir not statedir. I guess a different flux start option would be needed. Thinking up a name will be the hardest part :-)

Edit: --recovery is not set in stone. We just need some option names that convey starting with existing instance state - one where the state is read-only, and one where the state is modified.

garlick · 2022-11-29T18:03:56Z

I tentatively just made the --recovery[=ARG] optional argument accept a dump file.

I hope I'm not making this too confusing by conflating two use cases with one option. If ARG is a statedir (or is missing and the statedir is presumed to be /var/lib/flux), then changes are persistent. For example, you can change the KVS or submit jobs (which won't run until the resources are online when the instance is running normally). However, if ARG is a dump file, nothing you do gets preserved - there is no statedir, and content.sqlite is created in rundir instead, which is cleaned up on exit.

Furthermore, if --recovery with no ARG is specified, the system instance configuration is loaded, but if you load from a system instance dump file, you must add the --sysconfig option (or -o,--config-path=PATH for a path other than the default).

Confusing?

I did modify the startup banner so if you're running from a dump, you get something like

+-----------------------------------------------------
| Entering Flux recovery mode.
| All resources will be offline during recovery.
| Any rc1 failures noted above may result in
|  reduced functionality until manually corrected.
|
| broker.rc1_path    -
| broker.rc3_path    -
| config.path        -
| statedir           changes will not be preserved
|
| Exit this shell when finished.
+-----------------------------------------------------

garlick · 2022-11-29T23:18:10Z

This is probably ready for a review when someone has a moment.

chu11

nothing really caught my eye, just one side comment

chu11 · 2022-11-30T04:28:34Z

src/cmd/flux-start.c

@@ -482,6 +503,21 @@ int execvp_argz (char *argz, size_t argz_len)
    return -1;
 }

+bool system_instance_is_running (void)


This is fine, but i couldn't help but wonder if there's a better way or "correct way" to do this, like perhaps calling systemd directly?

Heh, maybe, although I don't have the first idea of how to do that, and I would imagine it would be more complex...

FWIW, this seems fine to me since it would work even if the system instance is started without systemd or via some other process management scheme (e.g. in cloud scenarios if that is ever a thing?)

What struck me here is that this duplicates code that is already present in flux_open(3) -- would it be simpler to just unset FLUX_URI and then call flux_open (NULL, 0)?

Good point! Will fix.

grondo

Nicely done, just had one extra suggestion (besides the other comment about flux_open())

grondo · 2022-11-30T15:32:48Z

t/t2315-resource-system.t

@@ -163,7 +163,7 @@ test_expect_success 'resources can be configured in TOML' '
 	hosts = "foo[3-10]"
 	properties = ["batch"]
 	EOF
-	flux start -s 1 \
+	flux start -s 11 \


This change will make this test take significantly longer, especially in CI and with coverage enabled. My suggestion would be this instead:

diff --git a/t/t2315-resource-system.t b/t/t2315-resource-system.t index b3a0c665f..10ab4c0d1 100755 --- a/t/t2315-resource-system.t +++ b/t/t2315-resource-system.t @@ -165,12 +165,12 @@ test_expect_success 'resources can be configured in TOML' ' EOF flux start -s 1 \ -o,--config-path=$(pwd)/${name},-Slog-filename=${name}/logfile \ - flux resource list -s up > ${name}/output && + flux resource list -s all > ${name}/output && test_debug "cat ${name}/output" && cat <<-EOF >${name}/expected && STATE PROPERTIES NNODES NCORES NGPUS NODELIST - up debug 3 12 0 foo[0-2] - up batch 8 32 4 foo[3-10] + all debug 3 12 0 foo[0-2] + all batch 8 32 4 foo[3-10] EOF test_cmp ${name}/expected ${name}/output

garlick · 2022-11-30T23:54:56Z

Fixed those things, and I guess I'll set MWP! Thanks guys.

Problem: a test of TOML-configured resources expects "extra" execution targets to be marked up, but this will soon change. Change the test to eliminate this expecation.

Problem: sometimes a Flux system instance will refuse to start, and then debugging is tricky because the system cannot be interactively probed. Add flux start --recovery[=ARG], which starts a singleton instance using state from a previous instance. If ARG is unspecifed, recover the system instance, e.g. sudo -u flux flux start --recovery If ARG is a directory, recover a persistent 'statedir', e.g. flux start --recovery=/tmp/statedir If ARG is a file, recover a flux-dump(1) archive, e.g. flux start --recovery=/tmp/mydump.tar or for a system dump: sudo -u flux flux start \ --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz Fixes flux-framework#4466

Problem: if rank 0 is not excluded from the resource configuration, jobs could potentially start in recovery mode. If the broker.recovery-mode attribute is set, bypass resource monitoring and let the resource acquisition protocol return all nodes offline (including rank 0).

Problem: in recovery mode, if R contains more execution targets than the instance size, as may be the case if the instance was launched with PMI, the extra targets are marked "up" and the scheduler thinks it can allocate them to pending jobs. In the resource module, use the execution target count in R, if available, instead of the instance size in calculating the set of "up" and "down" rank sets.

Problem: flux start --recovery=statedir fails resource validation if the rank 0 node running recovery is different than the one it ran on before. Bypass resource verification in recovery mode.

Problem: recovery mode aborts if something goes wrong in rc1, such as a module failing to load. This is the primary reason why a system instance wouldn't start, so if rc1 fails in recovery mode, post a new "rc1-ignorefail" event that transitions to the quorum state instead of shutdown.

Problem: downstream peers could complicate repair of problems in recovery mode. In boot_config.c, skip over the socket initialization when the broker.recovery-mode attribute is set.

Problem: in recovery mode, the state machine doesn't know whether rc2 is an interactive shell or not, so we cannot conditionally print a message geared towards interactive users. Add the runat_is_interactive() accessor.

Problem: users running Flux in interactive recovery mode might benefit from a short message before the shell is started. Print some recovery hints and paths that are useful to know. Suppress the message if the initial program is not an interactive shell.

Problem: flux-start(1) does not document the --recovery or --sysconfig options. Update man page.

Problem: recovery mode has no tests. Add a new sharness script for testing flux start --recovery.

Problem: there are no tests for system instance recovery. Add system/0004-recovery.t.

codecov · 2022-12-01T04:38:21Z

Codecov Report

Merging #4783 (bd81c43) into master (22828cd) will decrease coverage by 0.85%.
The diff coverage is 85.57%.

@@            Coverage Diff             @@
##           master    #4783      +/-   ##
==========================================
- Coverage   84.34%   83.48%   -0.86%     
==========================================
  Files         413      416       +3     
  Lines       62482    70756    +8274     
==========================================
+ Hits        52698    59071    +6373     
- Misses       9784    11685    +1901

Impacted Files	Coverage Δ
src/cmd/flux-start.c	`84.25% <70.21%> (-1.34%)`	⬇️
src/broker/runat.c	`84.07% <90.00%> (-1.11%)`	⬇️
src/broker/boot_config.c	`81.78% <100.00%> (+0.56%)`	⬆️
src/broker/state_machine.c	`83.36% <100.00%> (-1.47%)`	⬇️
src/modules/resource/inventory.c	`83.66% <100.00%> (+1.71%)`	⬆️
src/modules/resource/monitor.c	`71.79% <100.00%> (-0.59%)`	⬇️
src/modules/resource/resource.c	`83.33% <100.00%> (-0.10%)`	⬇️
src/cmd/top/ucache.c	`86.66% <0.00%> (-13.34%)`	⬇️
src/modules/job-info/allow.c	`76.66% <0.00%> (-12.70%)`	⬇️
src/common/libzmqutil/monitor.c	`80.00% <0.00%> (-11.03%)`	⬇️
... and 339 more

garlick force-pushed the recovery branch 2 times, most recently from a8d8677 to 49f0939 Compare November 28, 2022 23:55

garlick force-pushed the recovery branch from d193e97 to 2c73931 Compare November 29, 2022 03:04

garlick force-pushed the recovery branch 2 times, most recently from 58ad5f1 to debbab8 Compare November 29, 2022 17:53

garlick force-pushed the recovery branch 2 times, most recently from 0804469 to 194a228 Compare November 29, 2022 21:51

garlick changed the title ~~WIP: add flux start --recovery-shell~~ add flux start --recovery Nov 29, 2022

chu11 approved these changes Nov 30, 2022

View reviewed changes

grondo approved these changes Nov 30, 2022

View reviewed changes

garlick force-pushed the recovery branch from 194a228 to a4f61e3 Compare November 30, 2022 23:54

garlick added the merge-when-passing label Nov 30, 2022

garlick added 9 commits December 1, 2022 03:38

testsuite: adjust node up expectations in test

5fd9e84

Problem: a test of TOML-configured resources expects "extra" execution targets to be marked up, but this will soon change. Change the test to eliminate this expecation.

resource: bypass resource verification in recovery

7d33dc2

Problem: flux start --recovery=statedir fails resource validation if the rank 0 node running recovery is different than the one it ran on before. Bypass resource verification in recovery mode.

broker: don't bind to socket in recovery mode

599d377

Problem: downstream peers could complicate repair of problems in recovery mode. In boot_config.c, skip over the socket initialization when the broker.recovery-mode attribute is set.

broker: add runat_is_interactive()

ab2808b

Problem: in recovery mode, the state machine doesn't know whether rc2 is an interactive shell or not, so we cannot conditionally print a message geared towards interactive users. Add the runat_is_interactive() accessor.

garlick added 3 commits December 1, 2022 03:38

flux-start(1): document recovery options

f60c5fe

Problem: flux-start(1) does not document the --recovery or --sysconfig options. Update man page.

testsuite: cover recovery mode

6432a43

Problem: recovery mode has no tests. Add a new sharness script for testing flux start --recovery.

testsuite: add system recovery test

bd81c43

Problem: there are no tests for system instance recovery. Add system/0004-recovery.t.

garlick force-pushed the recovery branch from a4f61e3 to bd81c43 Compare December 1, 2022 03:38

mergify bot merged commit a13ee4c into flux-framework:master Dec 1, 2022

garlick deleted the recovery branch December 4, 2022 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add flux start --recovery #4783

add flux start --recovery #4783

garlick commented Nov 28, 2022 •

edited

Loading

grondo commented Nov 28, 2022

garlick commented Nov 28, 2022

grondo commented Nov 28, 2022

garlick commented Nov 29, 2022

grondo commented Nov 29, 2022

garlick commented Nov 29, 2022 •

edited

Loading

garlick commented Nov 29, 2022 •

edited

Loading

garlick commented Nov 29, 2022

chu11 left a comment

chu11 Nov 30, 2022

garlick Nov 30, 2022

grondo Nov 30, 2022

garlick Nov 30, 2022

grondo left a comment

grondo Nov 30, 2022

garlick commented Nov 30, 2022

codecov bot commented Dec 1, 2022

add flux start --recovery #4783

add flux start --recovery #4783

Conversation

garlick commented Nov 28, 2022 • edited Loading

grondo commented Nov 28, 2022

garlick commented Nov 28, 2022

grondo commented Nov 28, 2022

garlick commented Nov 29, 2022

grondo commented Nov 29, 2022

garlick commented Nov 29, 2022 • edited Loading

garlick commented Nov 29, 2022 • edited Loading

garlick commented Nov 29, 2022

chu11 left a comment

Choose a reason for hiding this comment

chu11 Nov 30, 2022

Choose a reason for hiding this comment

garlick Nov 30, 2022

Choose a reason for hiding this comment

grondo Nov 30, 2022

Choose a reason for hiding this comment

garlick Nov 30, 2022

Choose a reason for hiding this comment

grondo left a comment

Choose a reason for hiding this comment

grondo Nov 30, 2022

Choose a reason for hiding this comment

garlick commented Nov 30, 2022

codecov bot commented Dec 1, 2022

Codecov Report

garlick commented Nov 28, 2022 •

edited

Loading

garlick commented Nov 29, 2022 •

edited

Loading

garlick commented Nov 29, 2022 •

edited

Loading