-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
need option for restarting from kvs dump for debug #4466
Comments
Current procedure is apply this diff to rc1 (assuming you're running the test instance from the source tree) diff --git a/etc/rc1 b/etc/rc1
index dfb6b615d..72c5351e0 100755
--- a/etc/rc1
+++ b/etc/rc1
@@ -48,11 +48,12 @@ if test $RANK -eq 0; then
flux startlog --post-start-event
fi
-modload all resource
+#modload all resource
modload 0 cron sync=heartbeat.pulse
modload 0 job-manager
modload all job-info
modload 0 job-list
+exit 0
period=`flux config get --default= archive.period`
if test $RANK -eq 0 -a -n "${period}"; then
flux module load job-archive Then start as above. FWIW, the script I was using to start the test instance under valgrind to chase #4465 is #!/bin/bash
src/cmd/flux start \
--wrap=libtool,e,valgrind \
--wrap=--log-file=valgrind.out \
--wrap=--tool=memcheck \
--wrap=--leak-check=full \
--wrap=--gen-suppressions=all \
--wrap=--trace-children=no \
--wrap=--child-silent-after-fork=yes \
--wrap=--num-callers=30 \
--wrap=--leak-resolution=med \
--wrap=--error-exitcode=1 \
-o,-Scontent.restore=/g/g0/garlick/bug_state/kvs2.tgz |
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Nov 29, 2022
Problem: sometimes a Flux system instance will refuse to start, and then debugging is tricky because the system cannot be interactively probed. Add flux start --recovery[=ARG], which starts a singleton instance using state from a previous instance. If ARG is unspecifed, recover the system instance, e.g. sudo -u flux flux start --recovery If ARG is a directory, recover a persistent 'statedir', e.g. flux start --recovery=/tmp/statedir If ARG is a file, recover a flux-dump(1) archive, e.g. flux start --recovery=/tmp/mydump.tar or for a system dump: sudo -u flux flux start \ --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz Fixes flux-framework#4466
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Nov 29, 2022
Problem: sometimes a Flux system instance will refuse to start, and then debugging is tricky because the system cannot be interactively probed. Add flux start --recovery[=ARG], which starts a singleton instance using state from a previous instance. If ARG is unspecifed, recover the system instance, e.g. sudo -u flux flux start --recovery If ARG is a directory, recover a persistent 'statedir', e.g. flux start --recovery=/tmp/statedir If ARG is a file, recover a flux-dump(1) archive, e.g. flux start --recovery=/tmp/mydump.tar or for a system dump: sudo -u flux flux start \ --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz Fixes flux-framework#4466
garlick
added a commit
to garlick/flux-core
that referenced
this issue
Nov 30, 2022
Problem: sometimes a Flux system instance will refuse to start, and then debugging is tricky because the system cannot be interactively probed. Add flux start --recovery[=ARG], which starts a singleton instance using state from a previous instance. If ARG is unspecifed, recover the system instance, e.g. sudo -u flux flux start --recovery If ARG is a directory, recover a persistent 'statedir', e.g. flux start --recovery=/tmp/statedir If ARG is a file, recover a flux-dump(1) archive, e.g. flux start --recovery=/tmp/mydump.tar or for a system dump: sudo -u flux flux start \ --sysconfig --recovery=/var/lib/flux/dump/20221127_065818.tgz Fixes flux-framework#4466
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem: it's currently possible to run
flux kvs dump
on a running instance to obtain a dump of kvs content, and load that in a test instance for offline debug, but rc1 has to be hand edited in the test instance to avoid loading modules likeresource
that will cause the instance to fail to start:The text was updated successfully, but these errors were encountered: