make KVS garbage collection automatic #5840

garlick · 2024-03-28T01:08:14Z

This enables KVS garbage collection when flux is started with systemd, and wires flux-dump(1) and flux-restore(1) into sd_notify(3) so that systemd won't kill the dump process if it takes too long, and systemctl status flux will show regular updates about how many keys have been dumped/restored.

I've tested this on my home system by beefing up the content store and reducing the TimeoutStopSec=90 in the unit file to just a few seconds. Definitely kills flux-dump without this PR, and does not with it.

I probably need to add a little more testing but thought I'd post this as is since dinner just arrived!

garlick · 2024-03-28T14:17:02Z

Strange: I'm getting a failure here, but I don't see how this PR could have affected this test. Maybe I'm not thinking hard enough.

2024-03-28T14:09:15.5379023Z FAIL: t9000-system.t 18 - 0005-exec.t: start a long-running guest job
2024-03-28T14:09:15.5379678Z #	
2024-03-28T14:09:15.5380026Z #	        flux submit -n1 --wait-event=start sleep inf &&
2024-03-28T14:09:15.5380412Z #		jobid=$(flux job last)
2024-03-28T14:09:15.5380663Z #	
2024-03-28T14:09:15.5380981Z ok 19 - 0005-exec.t: flux exec --jobid fails as guest
2024-03-28T14:09:15.5381559Z PASS: t9000-system.t 19 - 0005-exec.t: flux exec --jobid fails as guest
2024-03-28T14:09:15.5382148Z ok 20 - 0005-exec.t: flux exec --jobid fails as instance owner
2024-03-28T14:09:15.5382776Z PASS: t9000-system.t 20 - 0005-exec.t: flux exec --jobid fails as instance owner
2024-03-28T14:09:15.5383968Z ok 21 - 0005-exec.t: flux exec without --jobid works as instance owner
2024-03-28T14:09:15.5385257Z PASS: t9000-system.t 21 - 0005-exec.t: flux exec without --jobid works as instance owner
2024-03-28T14:09:15.5386605Z ok 22 - 0005-exec.t: flux exec without --jobid fails as guest
2024-03-28T14:09:15.5387741Z PASS: t9000-system.t 22 - 0005-exec.t: flux exec without --jobid fails as guest
2024-03-28T14:09:15.5388768Z not ok 23 - 0005-exec.t: cancel long-running job
2024-03-28T14:09:15.5389612Z FAIL: t9000-system.t 23 - 0005-exec.t: cancel long-running job
2024-03-28T14:09:15.5390016Z #	
2024-03-28T14:09:15.5390220Z #		flux cancel $jobid
2024-03-28T14:09:15.5390457Z #	
2024-03-28T14:09:15.5390666Z # failed 2 among 23 test(s)

grondo · 2024-03-28T15:32:13Z

Oops, it looks like this is missing from t9000-system.t and is preventing the .output logfile which would give us more information:

diff --git a/t/t9000-system.t b/t/t9000-system.t
index e982b8d3b..f81168d37 100755
--- a/t/t9000-system.t
+++ b/t/t9000-system.t
@@ -13,6 +13,8 @@ if test -n "$FLUX_ENABLE_SYSTEM_TESTS"; then
                FLUX_TEST_INSTALLED_PATH=${FLUX_TEST_INSTALLED_PATH:-/usr/bin}
        fi
 fi
+# Append --logfile option if FLUX_TESTS_LOGFILE is set in environment:
+test -n "$FLUX_TESTS_LOGFILE" && set -- "$@" --logfile
 . `dirname $0`/sharness.sh
 
 #  Do not run system tests by default unless FLUX_ENABLE_SYSTEM_TESTS

garlick · 2024-03-28T15:56:19Z

Ah thanks, I'll toss that in.

grondo

Great improvement! LGTM!

grondo · 2024-03-28T15:49:00Z

etc/flux.service.in

@@ -4,7 +4,7 @@ Wants=munge.service

 [Service]
 Type=notify
-NotifyAccess=main
+NotifyAccess=all


Is there a missing "to" in the commit message "Change to NotifyAccess=all (from =main)..."?

grondo · 2024-03-28T15:55:30Z

src/cmd/builtin/restore.c

+    if (sd_notify_flag
+        && (keycount % 100 == 0 || keycount < 10)) {
+        sd_notifyf (0, "EXTEND_TIMEOUT_USEC=%d", 10000000); // 10s
+        sd_notifyf (0, "STATUS=flux-restore(1) has restored %d keys", keycount);


Does flux-restore(1) know the total number of keys being restored? If so it might be useful to supply that here as a kind of progress indicator.

I think we don't, unfortunately. AFAICT, like tar, libarchive doesn't know the full table of contents until it has streamed the entire archive, which it doesn't do up front because it might be a tape :-)

That makes sense. Perhaps in the future flux-dump could write a final item to the archive in a well known path that gives the total number of keys if it becomes an issue.

Actually probably flux dump is the bigger issue. On a recent system the dump took >6m -- it might be useful for admins to know how far along it is. Not sure if that one's even possible though.

Anyway, this isn't really applicable to this PR, so nevermind.

Yeah delaying shutdown for an open ended period of time is really not ok. But I'm not sure what else to do here. I hope that when it's taking too long, the sys admins will peek at systemctl status flux and see that apparently something is actively happening before they start turning things off.

garlick · 2024-03-28T16:33:00Z

Well that produced a clue!

[Errno 38] No service matching job-ingest.submit is registered

Ah the test is right after 0004-recovery.t which restarts the system instance, so perhaps it is now restoring a dump and not properly up when the job is submitted.

garlick · 2024-03-28T17:59:12Z

Well that sorted the tests. I probably should add a test for the --skip-gc option of flux-shutdown though.

garlick · 2024-03-28T23:36:57Z

I guess I'll set MWP on this one and likely play with it some more on my test system before we tag.

codecov · 2024-03-28T23:37:42Z

Codecov Report

Merging #5840 (9c247cf) into master (aca18f5) will decrease coverage by 0.04%.
Report is 4 commits behind head on master.
The diff coverage is 54.16%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5840      +/-   ##
==========================================
- Coverage   83.30%   83.27%   -0.04%     
==========================================
  Files         510      510              
  Lines       82880    82903      +23     
==========================================
- Hits        69047    69040       -7     
- Misses      13833    13863      +30

Files	Coverage Δ
src/cmd/builtin/shutdown.c	`86.11% <83.33%> (-0.30%)`	⬇️
src/cmd/builtin/dump.c	`83.53% <44.44%> (-1.51%)`	⬇️
src/cmd/builtin/restore.c	`87.91% <44.44%> (-1.49%)`	⬇️

... and 14 files with indirect coverage changes

Problem: processes other than the main one cannot send systemd status updates, but flux-dump(1) would like to do so. Set NotifyAccess=all instead of NotifyAccess=main to allow all members of the service's control group to send updates.

Problem: garbage collecting the KVS is a manual process that is often overlooked. Set content.dump=auto in the systemd unit file, so 'systemctl stop flux' will perform garbage collection.

Problem: when garbage collection is enabled by default in systemd, it may be handy to have a way to avoid it, for example if the system has been garbage collected recently, but it is still expected to take a long time. Add a flux shutdown --skip-gc option to avoid garbage collection.

Problem: if an instance that is configured to dump is shutdown via 'systemctl stop flux', it could run out the shutdown timer and be killed, interrupting the dump. Have flux-dump check for the broker attribute 'broker.sd-notify'. If it is set to something other than zero, then use sd_notify(3) to send systemd EXTEND_TIMEOUT_USEC messages as the dump progresses. In addition, send progress updates that are displayed under "Status:" in 'systemctl status flux'.

Problem: flux-dump(1) now sends updates to systemd via sd_notify() but flux-restore(1), which runs during rc1 does not provide similar updates. Do the same thing in flux-restore(1).

Problem: the --skip-gc option has no documentation. Add it to the man page.

Problem: t9000-system.t doesn't handle FLUX_TESTS_LOGFILE so logs are not shown on failure in CI. Add handling for that environment variable. Co-authoried-by: Mark A. Grondona <mark.grondona@gmail.com>

Problem: the 0004-recovery system test restarts the system instance, which can cause the next test to fail if the instance doesn't enter run state quickly enough. Add a test that waits for the instance to reach run state.

Problem: no tests show that flux shutdown --skip-gc prevent a scheduled dump from happening. Add a test.

garlick force-pushed the issue#5778 branch 3 times, most recently from 74a20e7 to 9e6a0ff Compare March 28, 2024 03:26

grondo approved these changes Mar 28, 2024

View reviewed changes

garlick force-pushed the issue#5778 branch from 9e6a0ff to f11ba44 Compare March 28, 2024 16:07

garlick added the merge-when-passing label Mar 28, 2024

garlick added 9 commits March 28, 2024 23:37

systemd: set NotifyAccess=all

95b3e24

Problem: processes other than the main one cannot send systemd status updates, but flux-dump(1) would like to do so. Set NotifyAccess=all instead of NotifyAccess=main to allow all members of the service's control group to send updates.

systemd: set content.dump=auto

cb40e07

Problem: garbage collecting the KVS is a manual process that is often overlooked. Set content.dump=auto in the systemd unit file, so 'systemctl stop flux' will perform garbage collection.

flux-restore: call sd_notify()

a91f4ec

Problem: flux-dump(1) now sends updates to systemd via sd_notify() but flux-restore(1), which runs during rc1 does not provide similar updates. Do the same thing in flux-restore(1).

flux-shutdown(1): document --skip-gc option

0a5d88a

Problem: the --skip-gc option has no documentation. Add it to the man page.

testsuite: handle FLUX_TESTS_LOGFILE in t9000

24f9e6c

Problem: t9000-system.t doesn't handle FLUX_TESTS_LOGFILE so logs are not shown on failure in CI. Add handling for that environment variable. Co-authoried-by: Mark A. Grondona <mark.grondona@gmail.com>

testsuite: ensure instance reaches run state

fe5e897

Problem: the 0004-recovery system test restarts the system instance, which can cause the next test to fail if the instance doesn't enter run state quickly enough. Add a test that waits for the instance to reach run state.

testsuite: cover --skip-gc

f113df8

Problem: no tests show that flux shutdown --skip-gc prevent a scheduled dump from happening. Add a test.

garlick force-pushed the issue#5778 branch from 9c247cf to f113df8 Compare March 28, 2024 23:37

mergify bot merged commit 69711aa into flux-framework:master Mar 28, 2024
33 checks passed

garlick mentioned this pull request Apr 2, 2024

make KVS garbage collection automatic at shutdown #5778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make KVS garbage collection automatic #5840

make KVS garbage collection automatic #5840

garlick commented Mar 28, 2024 •

edited

garlick commented Mar 28, 2024

grondo commented Mar 28, 2024

garlick commented Mar 28, 2024

grondo left a comment

grondo Mar 28, 2024

grondo Mar 28, 2024

garlick Mar 28, 2024

grondo Mar 28, 2024

garlick Mar 28, 2024

garlick commented Mar 28, 2024

garlick commented Mar 28, 2024

garlick commented Mar 28, 2024

codecov bot commented Mar 28, 2024

make KVS garbage collection automatic #5840

make KVS garbage collection automatic #5840

Conversation

garlick commented Mar 28, 2024 • edited

garlick commented Mar 28, 2024

grondo commented Mar 28, 2024

garlick commented Mar 28, 2024

grondo left a comment

Choose a reason for hiding this comment

grondo Mar 28, 2024

Choose a reason for hiding this comment

grondo Mar 28, 2024

Choose a reason for hiding this comment

garlick Mar 28, 2024

Choose a reason for hiding this comment

grondo Mar 28, 2024

Choose a reason for hiding this comment

garlick Mar 28, 2024

Choose a reason for hiding this comment

garlick commented Mar 28, 2024

garlick commented Mar 28, 2024

garlick commented Mar 28, 2024

codecov bot commented Mar 28, 2024

Codecov Report

garlick commented Mar 28, 2024 •

edited