Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make KVS garbage collection automatic #5840

Merged
merged 9 commits into from Mar 28, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Mar 28, 2024

This enables KVS garbage collection when flux is started with systemd, and wires flux-dump(1) and flux-restore(1) into sd_notify(3) so that systemd won't kill the dump process if it takes too long, and systemctl status flux will show regular updates about how many keys have been dumped/restored.

I've tested this on my home system by beefing up the content store and reducing the TimeoutStopSec=90 in the unit file to just a few seconds. Definitely kills flux-dump without this PR, and does not with it.

I probably need to add a little more testing but thought I'd post this as is since dinner just arrived!

@garlick garlick force-pushed the issue#5778 branch 3 times, most recently from 74a20e7 to 9e6a0ff Compare March 28, 2024 03:26
@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

Strange: I'm getting a failure here, but I don't see how this PR could have affected this test. Maybe I'm not thinking hard enough.

2024-03-28T14:09:15.5379023Z FAIL: t9000-system.t 18 - 0005-exec.t: start a long-running guest job
2024-03-28T14:09:15.5379678Z #	
2024-03-28T14:09:15.5380026Z #	        flux submit -n1 --wait-event=start sleep inf &&
2024-03-28T14:09:15.5380412Z #		jobid=$(flux job last)
2024-03-28T14:09:15.5380663Z #	
2024-03-28T14:09:15.5380981Z ok 19 - 0005-exec.t: flux exec --jobid fails as guest
2024-03-28T14:09:15.5381559Z PASS: t9000-system.t 19 - 0005-exec.t: flux exec --jobid fails as guest
2024-03-28T14:09:15.5382148Z ok 20 - 0005-exec.t: flux exec --jobid fails as instance owner
2024-03-28T14:09:15.5382776Z PASS: t9000-system.t 20 - 0005-exec.t: flux exec --jobid fails as instance owner
2024-03-28T14:09:15.5383968Z ok 21 - 0005-exec.t: flux exec without --jobid works as instance owner
2024-03-28T14:09:15.5385257Z PASS: t9000-system.t 21 - 0005-exec.t: flux exec without --jobid works as instance owner
2024-03-28T14:09:15.5386605Z ok 22 - 0005-exec.t: flux exec without --jobid fails as guest
2024-03-28T14:09:15.5387741Z PASS: t9000-system.t 22 - 0005-exec.t: flux exec without --jobid fails as guest
2024-03-28T14:09:15.5388768Z not ok 23 - 0005-exec.t: cancel long-running job
2024-03-28T14:09:15.5389612Z FAIL: t9000-system.t 23 - 0005-exec.t: cancel long-running job
2024-03-28T14:09:15.5390016Z #	
2024-03-28T14:09:15.5390220Z #		flux cancel $jobid
2024-03-28T14:09:15.5390457Z #	
2024-03-28T14:09:15.5390666Z # failed 2 among 23 test(s)

@grondo
Copy link
Contributor

grondo commented Mar 28, 2024

Oops, it looks like this is missing from t9000-system.t and is preventing the .output logfile which would give us more information:

diff --git a/t/t9000-system.t b/t/t9000-system.t
index e982b8d3b..f81168d37 100755
--- a/t/t9000-system.t
+++ b/t/t9000-system.t
@@ -13,6 +13,8 @@ if test -n "$FLUX_ENABLE_SYSTEM_TESTS"; then
                FLUX_TEST_INSTALLED_PATH=${FLUX_TEST_INSTALLED_PATH:-/usr/bin}
        fi
 fi
+# Append --logfile option if FLUX_TESTS_LOGFILE is set in environment:
+test -n "$FLUX_TESTS_LOGFILE" && set -- "$@" --logfile
 . `dirname $0`/sharness.sh
 
 #  Do not run system tests by default unless FLUX_ENABLE_SYSTEM_TESTS

@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

Ah thanks, I'll toss that in.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement! LGTM!

@@ -4,7 +4,7 @@ Wants=munge.service

[Service]
Type=notify
NotifyAccess=main
NotifyAccess=all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a missing "to" in the commit message "Change to NotifyAccess=all (from =main)..."?

if (sd_notify_flag
&& (keycount % 100 == 0 || keycount < 10)) {
sd_notifyf (0, "EXTEND_TIMEOUT_USEC=%d", 10000000); // 10s
sd_notifyf (0, "STATUS=flux-restore(1) has restored %d keys", keycount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does flux-restore(1) know the total number of keys being restored? If so it might be useful to supply that here as a kind of progress indicator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't, unfortunately. AFAICT, like tar, libarchive doesn't know the full table of contents until it has streamed the entire archive, which it doesn't do up front because it might be a tape :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Perhaps in the future flux-dump could write a final item to the archive in a well known path that gives the total number of keys if it becomes an issue.

Actually probably flux dump is the bigger issue. On a recent system the dump took >6m -- it might be useful for admins to know how far along it is. Not sure if that one's even possible though.

Anyway, this isn't really applicable to this PR, so nevermind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah delaying shutdown for an open ended period of time is really not ok. But I'm not sure what else to do here. I hope that when it's taking too long, the sys admins will peek at systemctl status flux and see that apparently something is actively happening before they start turning things off.

@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

Well that produced a clue!

[Errno 38] No service matching job-ingest.submit is registered

Ah the test is right after 0004-recovery.t which restarts the system instance, so perhaps it is now restoring a dump and not properly up when the job is submitted.

@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

Well that sorted the tests. I probably should add a test for the --skip-gc option of flux-shutdown though.

@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

I guess I'll set MWP on this one and likely play with it some more on my test system before we tag.

Copy link

codecov bot commented Mar 28, 2024

Codecov Report

Merging #5840 (9c247cf) into master (aca18f5) will decrease coverage by 0.04%.
Report is 4 commits behind head on master.
The diff coverage is 54.16%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5840      +/-   ##
==========================================
- Coverage   83.30%   83.27%   -0.04%     
==========================================
  Files         510      510              
  Lines       82880    82903      +23     
==========================================
- Hits        69047    69040       -7     
- Misses      13833    13863      +30     
Files Coverage Δ
src/cmd/builtin/shutdown.c 86.11% <83.33%> (-0.30%) ⬇️
src/cmd/builtin/dump.c 83.53% <44.44%> (-1.51%) ⬇️
src/cmd/builtin/restore.c 87.91% <44.44%> (-1.49%) ⬇️

... and 14 files with indirect coverage changes

Problem: processes other than the main one cannot send systemd
status updates, but flux-dump(1) would like to do so.

Set NotifyAccess=all instead of NotifyAccess=main to allow all
members of the service's control group to send updates.
Problem: garbage collecting the KVS is a manual process that
is often overlooked.

Set content.dump=auto in the systemd unit file, so 'systemctl stop
flux' will perform garbage collection.
Problem: when garbage collection is enabled by default in systemd,
it may be handy to have a way to avoid it, for example if the system
has been garbage collected recently, but it is still expected to
take a long time.

Add a flux shutdown --skip-gc option to avoid garbage collection.
Problem: if an instance that is configured to dump is shutdown
via 'systemctl stop flux', it could run out the shutdown timer and
be killed, interrupting the dump.

Have flux-dump check for the broker attribute 'broker.sd-notify'.
If it is set to something other than zero, then use sd_notify(3) to
send systemd EXTEND_TIMEOUT_USEC messages as the dump progresses.
In addition, send progress updates that are displayed under "Status:"
in 'systemctl status flux'.
Problem: flux-dump(1) now sends updates to systemd via sd_notify()
but flux-restore(1), which runs during rc1 does not provide similar
updates.

Do the same thing in flux-restore(1).
Problem: the --skip-gc option has no documentation.

Add it to the man page.
Problem: t9000-system.t doesn't handle FLUX_TESTS_LOGFILE so
logs are not shown on failure in CI.

Add handling for that environment variable.

Co-authoried-by: Mark A. Grondona <mark.grondona@gmail.com>
Problem: the 0004-recovery system test restarts the system
instance, which can cause the next test to fail if the
instance doesn't enter run state quickly enough.

Add a test that waits for the instance to reach run state.
Problem: no tests show that flux shutdown --skip-gc prevent
a scheduled dump from happening.

Add a test.
@mergify mergify bot merged commit 69711aa into flux-framework:master Mar 28, 2024
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants