Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flux start --recovery #4783

Merged
merged 12 commits into from
Dec 1, 2022
Merged

add flux start --recovery #4783

merged 12 commits into from
Dec 1, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented Nov 28, 2022

Problem: when a system instance won't start, debugging is a chore.

This adds a flux start --recovery mode that should let rank 0 of a system instance be started interactively when something goes wrong with systemd startup, e.g.

$ sudo -u flux ./flux start --recovery
flux@picl0:/nfshome/garlick/proj/flux-core/src/cmd$ 

It only works for the system instance right now, since it hardwires the system instance configuration path and doesn't work with PMI bootstrap.

Thought I'd post it to see what people thought before digging in deeper. It seems like it's happened several times that we try to start Flux after a configuration update or new release, and the first problem we encounter is that the instance won't start. If there's something in the KVS (like the checkpoint.job-manager problem I noted in #4776) then in theory it could be manually corrected using this mode.

@grondo
Copy link
Contributor

grondo commented Nov 28, 2022

This seems useful!

If possible, a message printed just before dropping to the prompt with some terse instructions might be helpful. (I'm thinking like the recovery/single user mode shell in linux)

@garlick
Copy link
Member Author

garlick commented Nov 28, 2022

Here's what I came up with for a message. I reworked a couple of bits including reducing the stderr log level to "warn".

$ sudo -u flux flux start -r
2022-11-28T19:53:44.970486Z job-manager.err[0]: restart: checkpoint.job-manager: Invalid argument
2022-11-28T19:53:44.971955Z job-manager.crit[0]: module exiting abnormally
2022-11-28T19:53:44.974528Z broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
2022-11-28T19:53:44.976154Z broker.err[0]: rc1.0: /usr/local/etc/flux/rc1 Exited (rc=1) 0.7s
+-----------------------------------------------------
|  Entering Flux system instance recovery mode.
|  All resources will be offline during recovery.
|  Any rc1 failures noted above may result in
|   reduced functionality until manually corrected.
|  KVS changes made during recovery are persitent.
|
|  broker.rc1_path    /usr/local/etc/flux/rc1
|  broker.rc3_path    /usr/local/etc/flux/rc3
|  config.path        /usr/local/etc/flux/system/conf.d
|  statedir           /var/lib/flux
|
|  Exit this shell when finished.
+-----------------------------------------------------
$ 

@grondo
Copy link
Contributor

grondo commented Nov 28, 2022

Oh nice! I wasn't thinking of that much detail but that is pretty nice 👍

@garlick garlick force-pushed the recovery branch 2 times, most recently from a8d8677 to 49f0939 Compare November 28, 2022 23:55
@garlick
Copy link
Member Author

garlick commented Nov 29, 2022

I think this will work now for a non-system instance as well. At least it works superficially:

$ mkdir /tmp/statedir
$ src/cmd/flux start -o,-Sstatedir=/tmp/statedir
ƒ(s=1,d=0,builddir) $ flux mini run hostnamesystem76-pc
ƒ(s=1,d=0,builddir) $ 
exit
$ src/cmd/flux start --recovery=/tmp/statedir
+-----------------------------------------------------
|  Entering Flux recovery mode.
|  All resources will be offline during recovery.
|  Any rc1 failures noted above may result in
|   reduced functionality until manually corrected.
|  KVS changes made during recovery are persistent.
|
|  broker.rc1_path    /home/garlick/proj/flux-core/etc/rc1
|  broker.rc3_path    /home/garlick/proj/flux-core/etc/rc3
|  config.path        -
|  statedir           /tmp/statedir
|
|  Exit this shell when finished.
+-----------------------------------------------------
ƒ(s=1,d=0,builddir) $ flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
    ƒ2QkBQDm garlick  hostname   CD      1      1   0.019s system76-pc

@grondo
Copy link
Contributor

grondo commented Nov 29, 2022

Oh, cool! Can this also be used as an easier method to solve #4466?

@garlick
Copy link
Member Author

garlick commented Nov 29, 2022

Ah thanks for finding that issue! (I was looking for it) Yes, we'll just need to set recovery mode but restore to an instance that puts content.sqlite in rundir not statedir. I guess a different flux start option would be needed. Thinking up a name will be the hardest part :-)

Edit: --recovery is not set in stone. We just need some option names that convey starting with existing instance state - one where the state is read-only, and one where the state is modified.

@garlick garlick force-pushed the recovery branch 2 times, most recently from 58ad5f1 to debbab8 Compare November 29, 2022 17:53
@garlick
Copy link
Member Author

garlick commented Nov 29, 2022

I tentatively just made the --recovery[=ARG] optional argument accept a dump file.

I hope I'm not making this too confusing by conflating two use cases with one option. If ARG is a statedir (or is missing and the statedir is presumed to be /var/lib/flux), then changes are persistent. For example, you can change the KVS or submit jobs (which won't run until the resources are online when the instance is running normally). However, if ARG is a dump file, nothing you do gets preserved - there is no statedir, and content.sqlite is created in rundir instead, which is cleaned up on exit.

Furthermore, if --recovery with no ARG is specified, the system instance configuration is loaded, but if you load from a system instance dump file, you must add the --sysconfig option (or -o,--config-path=PATH for a path other than the default).

Confusing?

I did modify the startup banner so if you're running from a dump, you get something like

+-----------------------------------------------------
| Entering Flux recovery mode.
| All resources will be offline during recovery.
| Any rc1 failures noted above may result in
|  reduced functionality until manually corrected.
|
| broker.rc1_path    -
| broker.rc3_path    -
| config.path        -
| statedir           changes will not be preserved
|
| Exit this shell when finished.
+-----------------------------------------------------

@garlick garlick force-pushed the recovery branch 2 times, most recently from 0804469 to 194a228 Compare November 29, 2022 21:51
@garlick garlick changed the title WIP: add flux start --recovery-shell add flux start --recovery Nov 29, 2022
@garlick
Copy link
Member Author

garlick commented Nov 29, 2022

This is probably ready for a review when someone has a moment.

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing really caught my eye, just one side comment

@@ -482,6 +503,21 @@ int execvp_argz (char *argz, size_t argz_len)
return -1;
}

bool system_instance_is_running (void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but i couldn't help but wonder if there's a better way or "correct way" to do this, like perhaps calling systemd directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, maybe, although I don't have the first idea of how to do that, and I would imagine it would be more complex...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this seems fine to me since it would work even if the system instance is started without systemd or via some other process management scheme (e.g. in cloud scenarios if that is ever a thing?)

What struck me here is that this duplicates code that is already present in flux_open(3) -- would it be simpler to just unset FLUX_URI and then call flux_open (NULL, 0)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Will fix.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done, just had one extra suggestion (besides the other comment about flux_open())

@@ -163,7 +163,7 @@ test_expect_success 'resources can be configured in TOML' '
hosts = "foo[3-10]"
properties = ["batch"]
EOF
flux start -s 1 \
flux start -s 11 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will make this test take significantly longer, especially in CI and with coverage enabled. My suggestion would be this instead:

diff --git a/t/t2315-resource-system.t b/t/t2315-resource-system.t
index b3a0c665f..10ab4c0d1 100755
--- a/t/t2315-resource-system.t
+++ b/t/t2315-resource-system.t
@@ -165,12 +165,12 @@ test_expect_success 'resources can be configured in TOML' '
        EOF
        flux start -s 1 \
                -o,--config-path=$(pwd)/${name},-Slog-filename=${name}/logfile \
-               flux resource list -s up > ${name}/output &&
+               flux resource list -s all > ${name}/output &&
        test_debug "cat ${name}/output" &&
        cat <<-EOF >${name}/expected &&
             STATE PROPERTIES NNODES   NCORES    NGPUS NODELIST
-               up debug           3       12        0 foo[0-2]
-               up batch           8       32        4 foo[3-10]
+              all debug           3       12        0 foo[0-2]
+              all batch           8       32        4 foo[3-10]
        EOF
        test_cmp ${name}/expected ${name}/output

@garlick
Copy link
Member Author

garlick commented Nov 30, 2022

Fixed those things, and I guess I'll set MWP! Thanks guys.

Problem: a test of TOML-configured resources expects "extra"
execution targets to be marked up, but this will soon change.

Change the test to eliminate this expecation.
Problem: sometimes a Flux system instance will refuse to start, and
then debugging is tricky because the system cannot be interactively
probed.

Add flux start --recovery[=ARG], which starts a singleton instance
using state from a previous instance.

If ARG is unspecifed, recover the system instance, e.g.
  sudo -u flux flux start --recovery

If ARG is a directory, recover a persistent 'statedir', e.g.
  flux start --recovery=/tmp/statedir

If ARG is a file, recover a flux-dump(1) archive, e.g.
  flux start --recovery=/tmp/mydump.tar
or for a system dump:
  sudo -u flux flux start \
    --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz

Fixes flux-framework#4466
Problem: if rank 0 is not excluded from the resource configuration,
jobs could potentially start in recovery mode.

If the broker.recovery-mode attribute is set, bypass resource
monitoring and let the resource acquisition protocol return all nodes
offline (including rank 0).
Problem: in recovery mode, if R contains more execution targets
than the instance size, as may be the case if the instance was
launched with PMI, the extra targets are marked "up" and the
scheduler thinks it can allocate them to pending jobs.

In the resource module, use the execution target count in R, if
available, instead of the instance size in calculating the set of "up"
and "down" rank sets.
Problem: flux start --recovery=statedir fails resource validation
if the rank 0 node running recovery is different than the one
it ran on before.

Bypass resource verification in recovery mode.
Problem: recovery mode aborts if something goes wrong in rc1,
such as a module failing to load.

This is the primary reason why a system instance wouldn't start,
so if rc1 fails in recovery mode, post a new "rc1-ignorefail" event
that transitions to the quorum state instead of shutdown.
Problem: downstream peers could complicate repair of problems in
recovery mode.

In boot_config.c, skip over the socket initialization when the
broker.recovery-mode attribute is set.
Problem: in recovery mode, the state machine doesn't know whether
rc2 is an interactive shell or not, so we cannot conditionally print
a message geared towards interactive users.

Add the runat_is_interactive() accessor.
Problem: users running Flux in interactive recovery mode might
benefit from a short message before the shell is started.

Print some recovery hints and paths that are useful to know.
Suppress the message if the initial program is not an interactive shell.
Problem: flux-start(1) does not document the --recovery or
--sysconfig options.

Update man page.
Problem: recovery mode has no tests.

Add a new sharness script for testing flux start --recovery.
Problem: there are no tests for system instance recovery.

Add system/0004-recovery.t.
@codecov
Copy link

codecov bot commented Dec 1, 2022

Codecov Report

Merging #4783 (bd81c43) into master (22828cd) will decrease coverage by 0.85%.
The diff coverage is 85.57%.

@@            Coverage Diff             @@
##           master    #4783      +/-   ##
==========================================
- Coverage   84.34%   83.48%   -0.86%     
==========================================
  Files         413      416       +3     
  Lines       62482    70756    +8274     
==========================================
+ Hits        52698    59071    +6373     
- Misses       9784    11685    +1901     
Impacted Files Coverage Δ
src/cmd/flux-start.c 84.25% <70.21%> (-1.34%) ⬇️
src/broker/runat.c 84.07% <90.00%> (-1.11%) ⬇️
src/broker/boot_config.c 81.78% <100.00%> (+0.56%) ⬆️
src/broker/state_machine.c 83.36% <100.00%> (-1.47%) ⬇️
src/modules/resource/inventory.c 83.66% <100.00%> (+1.71%) ⬆️
src/modules/resource/monitor.c 71.79% <100.00%> (-0.59%) ⬇️
src/modules/resource/resource.c 83.33% <100.00%> (-0.10%) ⬇️
src/cmd/top/ucache.c 86.66% <0.00%> (-13.34%) ⬇️
src/modules/job-info/allow.c 76.66% <0.00%> (-12.70%) ⬇️
src/common/libzmqutil/monitor.c 80.00% <0.00%> (-11.03%) ⬇️
... and 339 more

@mergify mergify bot merged commit a13ee4c into flux-framework:master Dec 1, 2022
@garlick garlick deleted the recovery branch December 4, 2022 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants