Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need option for restarting from kvs dump for debug #4466

Closed
garlick opened this issue Aug 4, 2022 · 1 comment
Closed

need option for restarting from kvs dump for debug #4466

garlick opened this issue Aug 4, 2022 · 1 comment

Comments

@garlick
Copy link
Member

garlick commented Aug 4, 2022

Problem: it's currently possible to run flux kvs dump on a running instance to obtain a dump of kvs content, and load that in a test instance for offline debug, but rc1 has to be hand edited in the test instance to avoid loading modules like resource that will cause the instance to fail to start:

$ flux start -o,-Scontent.restore=/g/g0/garlick/kvs.tgz
[garlick@fluke1:flux-core]$ src/cmd/flux start -o,-Scontent.restore=kvs.tgz
2022-08-04T16:04:45.543257Z resource.err[0]: problem replaying eventlog drain state: Invalid argument
2022-08-04T16:04:45.543268Z resource.crit[0]: module exiting abnormally
2022-08-04T16:04:45.543841Z broker.err[0]: rc1.0: flux-module: broker.insmod: Invalid argument
2022-08-04T16:04:45.544863Z broker.err[0]: rc1.0: /g/g0/garlick/proj/flux-core/etc/rc1 Exited (rc=1) 2.1s
[garlick@fluke1:flux-core]$ 
@garlick
Copy link
Member Author

garlick commented Aug 4, 2022

Current procedure is apply this diff to rc1 (assuming you're running the test instance from the source tree)

diff --git a/etc/rc1 b/etc/rc1
index dfb6b615d..72c5351e0 100755
--- a/etc/rc1
+++ b/etc/rc1
@@ -48,11 +48,12 @@ if test $RANK -eq 0; then
     flux startlog --post-start-event
 fi
 
-modload all resource
+#modload all resource
 modload 0 cron sync=heartbeat.pulse
 modload 0 job-manager
 modload all job-info
 modload 0 job-list
+exit 0
 period=`flux config get --default= archive.period`
 if test $RANK -eq 0 -a -n "${period}"; then
     flux module load job-archive

Then start as above. FWIW, the script I was using to start the test instance under valgrind to chase #4465 is

#!/bin/bash
src/cmd/flux start \
        --wrap=libtool,e,valgrind \
        --wrap=--log-file=valgrind.out \
        --wrap=--tool=memcheck \
        --wrap=--leak-check=full \
        --wrap=--gen-suppressions=all \
        --wrap=--trace-children=no \
        --wrap=--child-silent-after-fork=yes \
        --wrap=--num-callers=30 \
        --wrap=--leak-resolution=med \
        --wrap=--error-exitcode=1 \
        -o,-Scontent.restore=/g/g0/garlick/bug_state/kvs2.tgz

garlick added a commit to garlick/flux-core that referenced this issue Nov 29, 2022
Problem: sometimes a Flux system instance will refuse to start, and
then debugging is tricky because the system cannot be interactively
probed.

Add flux start --recovery[=ARG], which starts a singleton instance
using state from a previous instance.

If ARG is unspecifed, recover the system instance, e.g.
  sudo -u flux flux start --recovery

If ARG is a directory, recover a persistent 'statedir', e.g.
  flux start --recovery=/tmp/statedir

If ARG is a file, recover a flux-dump(1) archive, e.g.
  flux start --recovery=/tmp/mydump.tar
or for a system dump:
  sudo -u flux flux start \
    --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz

Fixes flux-framework#4466
garlick added a commit to garlick/flux-core that referenced this issue Nov 29, 2022
Problem: sometimes a Flux system instance will refuse to start, and
then debugging is tricky because the system cannot be interactively
probed.

Add flux start --recovery[=ARG], which starts a singleton instance
using state from a previous instance.

If ARG is unspecifed, recover the system instance, e.g.
  sudo -u flux flux start --recovery

If ARG is a directory, recover a persistent 'statedir', e.g.
  flux start --recovery=/tmp/statedir

If ARG is a file, recover a flux-dump(1) archive, e.g.
  flux start --recovery=/tmp/mydump.tar
or for a system dump:
  sudo -u flux flux start \
    --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz

Fixes flux-framework#4466
garlick added a commit to garlick/flux-core that referenced this issue Nov 30, 2022
Problem: sometimes a Flux system instance will refuse to start, and
then debugging is tricky because the system cannot be interactively
probed.

Add flux start --recovery[=ARG], which starts a singleton instance
using state from a previous instance.

If ARG is unspecifed, recover the system instance, e.g.
  sudo -u flux flux start --recovery

If ARG is a directory, recover a persistent 'statedir', e.g.
  flux start --recovery=/tmp/statedir

If ARG is a file, recover a flux-dump(1) archive, e.g.
  flux start --recovery=/tmp/mydump.tar
or for a system dump:
  sudo -u flux flux start \
    --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz

Fixes flux-framework#4466
@mergify mergify bot closed this as completed in 2c2247c Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant