systemd: add prolog/epilog service units #6040

garlick · 2024-06-13T18:56:06Z

Problem: future housekeeping scripts should be run in a systemd cgroup for reliable termination, logging to the local systemd journal, and debugging using well known systemd tools.

Add a systemd "oneshot" service unit for housekeeping. The service unit runs a user-provided housekeeping script. It is configured so that systemctl start housekeeping blocks until the the run is complete and its exit code reflects the exit code of the housekeeping script.

Add a helper script that can be configured as an IMP run command. It runs systemctl start housekeeping and traps SIGTERM, which can be sent to enforce a timeout. Upon receipt of SIGTERM, it stops the unit and exits with a nonzero code.

To enable environment variables such as the FLUX_JOB_ID to be passed into the user-provided housekeeping script via the systemd unit, the helper script dumps its environment into /run/housekeeping.env, which is read in by the unit.

Also of note: the user-provided scripts are automatically idempotent when run this way. systemd never starts multiple instances of the unit. If one is running when a start request is received, the second start blocks until the existing run finishes and reports its status.

I split this out to a separate PR because I think it can be used to improve the existing epilog even before housekeeping as proposed in #5818 is available. For example, on my test system, I renamed /etc/flux/system/epilog to housekeeping and configured the IMP with

[run.housekeeping]
allowed-users = [ "flux" ]
allowed-environment = [ "FLUX_*" ]
path = "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-run-housekeeping"

and the job manager with

[job-manager.epilog]
command = [
   "flux", "perilog-run", "epilog",
   "--timeout=15s",
   "--with-imp",
   "-e", "housekeeping"
]

And now my epilog runs in the housekeeping unit. After a job runs, journalctl -u housekeeping says

Jun 13 11:33:49 picl7 systemd[1]: Starting Node Maintenance for Flux...
Jun 13 11:33:49 picl7 housekeeping[55656]: running stress.sh
Jun 13 11:33:49 picl7 housekeeping[55663]: stress: info: [55663] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
Jun 13 11:33:59 picl7 housekeeping[55663]: stress: info: [55663] successful run completed in 10s
Jun 13 11:33:59 picl7 systemd[1]: housekeeping.service: Succeeded.
Jun 13 11:33:59 picl7 systemd[1]: Finished Node Maintenance for Flux.
Jun 13 11:33:59 picl7 systemd[1]: housekeeping.service: Consumed 39.989s CPU time.

(my epilog just burns cpu for 10s)

systemctl status housekeeping provides the usual informative output:

● housekeeping.service - Node Maintenance for Flux
     Loaded: loaded (/lib/systemd/system/housekeeping.service; static)
     Active: activating (start) since Thu 2024-06-13 11:51:21 PDT; 3s ago
   Main PID: 55746 (housekeeping)
      Tasks: 7 (limit: 1599)
     Memory: 856.0K
        CPU: 15.185s
     CGroup: /system.slice/housekeeping.service
             ├─55746 /bin/sh /etc/flux/system/housekeeping
             ├─55749 /bin/sh /etc/flux/system/housekeeping.d/stress.sh
             ├─55750 stress -c 4 -t 10
             ├─55751 stress -c 4 -t 10
             ├─55752 stress -c 4 -t 10
             ├─55753 stress -c 4 -t 10
             └─55754 stress -c 4 -t 10

Jun 13 11:51:21 picl7 systemd[1]: Starting Node Maintenance for Flux...
Jun 13 11:51:21 picl7 housekeeping[55746]: running stress.sh
Jun 13 11:51:21 picl7 housekeeping[55750]: stress: info: [55750] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd

grondo · 2024-06-13T20:38:14Z

What happens on systems where nodes are shared, can the housekeeping unit be run for two different jobs at the same time?

grondo · 2024-06-13T20:50:15Z

I forgot most of what I know about systemd unitfile templates, but perhaps if the the housekeeping unit file was made into a template, it could be started per job with systemctl start housekeeping@jobid and perhaps the jobid could be passed to the housekeeping environment via the template argument instead of writing to a file in /run?

(Even in the exclusive node case, this would allow a sysadmin to quickly determine for what job a housekeeping script was running)

Just an idea.

garlick · 2024-06-13T21:00:26Z

What happens on systems where nodes are shared, can the housekeeping unit be run for two different jobs at the same time?

If multiple jobs sharing a node start the epilog at once, the first one in will start the unit and the rest will block until that one is done. The script would run with environment variables associated with the first job.

Yeah I see your point. This is maybe not the best idea for epilog on those systems (or maybe on any systems) where the prolog and epilog contain matched do/undo actions that apply to a single job. Perhaps this should just be part of the housekeeping PR.

garlick · 2024-06-13T21:10:16Z

perhaps if the the housekeeping unit file was made into a template, it could be started per job with systemctl start housekeeping@jobid

That would solve that problem I think.

If we're thinking of housekeeping as doing stuff that is relatively independent of a job then I like the single unit and the idempotency for that.

We could go all in with this approach and add prolog@jobid and epilog@jobid units I suppose! :-)

grondo · 2024-06-13T21:23:51Z

If we're thinking of housekeeping as doing stuff that is relatively independent of a job then I like the single unit and the idempotency for that.

Yeah, I had kind of forgotten about that.

However, the risk there is that not all per-job prolog/epilog scripts will be guaranteed not to hang or get stuck, and if housekeeping is the only way that partial release can work, then the tendency will be to place even per-job scripts into housekeeping - especially on systems allocating exclusive nodes. If they can't be, well then we might be stuck like we are now holding back all nodes of a job when one node has a stuck epilog.

garlick · 2024-06-13T21:56:57Z

Well that is a compelling point.

One nice thing I just learned is that both journalctl -u and systemctl status accept glob patterns for the unit name.

Maybe I'll just play with templated units a bit and see what I can learn.

garlick · 2024-06-14T00:29:10Z

Re-pushed with a templated unit file, housekeeping@jobid. It sets FLUX_JOB_ID based on the template argument. I was able to configure it to automatically garbage collect, so when you do systemctl status housekeeping@* you only see what's currently running (not past failed units). journalctl -u housekeeping@* works too.

This doesn't pass the other documented perilog environment variables FLUX_OWNER_USERID, FLUX_JOB_USERID, HOME, and USER so maybe we need some way to do that. (Aside: those should probably be added to flux-envirnoment(7)).

garlick · 2024-06-14T14:56:31Z

Pushed another update that reinstates the environment copy via /run.

Would there be interest in having separate units for prolog and epilog? Then the housekeeping unit could just be added later.

It seems like the benefit of running these site provided scripts under systemd is high. I'm pretty sure systemd could reliably stop a fork bomb with its freeze/signal strategy or in later kernels/systemds the cgroup.kill button. Plus having output go to the local journal so that logs for the sequence of runs on a particular node can be viewed together, and the ability to run systemctl status and see at a glance what processes are still running when something is stuck is pretty handy.

grondo · 2024-06-14T15:13:15Z

Yes, that seems like a good idea, though prolog and epilog are currently just arbitrarily configured scripts that are configured to be run via the IMP, so should the systemd unit files also be site configuration? (Just thinking out loud here).

Plus having output go to the local journal so that logs for the sequence of runs on a particular node can be viewed together,

We should check with the admins - I think currently output from the prolog and epilog are (somehow) forwarded to the management node so that they can be searched in one place. As long as there's still a way to do that, this should be fine.

the ability to run systemctl status and see at a glance what processes are still running when something is stuck is pretty handy.

FYI - pstree -Tplu flux works great for this now (and doesn't have the weird auto pager behavior like systemctl status) I presume this would break that approach, however probably fine as long as there is a way to do it. systemctl status would also be able to show orphan processes so that's a clear win.

Another thought, should these service units perhaps have flux in the name? There could be another housekeeping or prolog/epilog service offered by another project, or even a site local one.

garlick · 2024-06-14T15:52:38Z

Yes, that seems like a good idea, though prolog and epilog are currently just arbitrarily configured scripts that are configured to be run via the IMP, so should the systemd unit files also be site configuration? (Just thinking out loud here).

If we're encouraging this approach (which I think we should) then there is probably a support advantage in having it work uniformly at all sites rather than let each site cook up their own unit files. Plus providing the unit files reduces the inertia required to get it done.

I guess it's obvious but if we implement a "real" node prolog/epilog in place of overloading the job manager perilog, then this approach of having the IMP start a systemd unit could be retained.

I think currently output from the prolog and epilog are (somehow) forwarded to the management node so that they can be searched in one place. As long as there's still a way to do that, this should be fine.

This change would make it so that only an indication of failure is logged to flux dmesg e.g.

[ +11.295160] job-manager[0]: ƒ4rs2fRJWN3: epilog: stderr: ƒ4rs2fRJWN3: rank 0 failed epilog, draining

(Hey that log message should include the hostname)

The detailed output/errors would go to the journal. Any thoughts on that @ryanday36?

I assumed that logs that go to the systemd journal are collectable in ways already familiar to the sys admins (like rsyslog).

pstree -Tplu flux works great for this now

Yeah, this process tree starts out as root so that won't work :-( I mean something like this but it kinda sucks to type

$ pstree -Tpl $(pgrep housekeeping)
housekeeping(139410)───stress.sh(139417)───stress(139418)─┬─stress(139419)
                                                          ├─stress(139420)
                                                          ├─stress(139421)
                                                          └─stress(139422)

Another thought, should these service units perhaps have flux in the name? There could be another housekeeping or prolog/epilog service offered by another project, or even a site local one.

Agreed.

grondo · 2024-06-14T15:55:22Z

If we're encouraging this approach (which I think we should) then there is probably a support advantage in having it work uniformly at all sites rather than let each site cook up their own unit files. Plus providing the unit files reduces the inertia required to get it done.

Ok, that sounds great. I'm all for this approach. It also gives us a chance to somehow recover state of prolog/epilog/housekeeping workloads after a restart of the local broker or system instance.

garlick · 2024-06-14T19:58:40Z

OK, this now adds prolog/epilog service units, not housekeeping.

I added an update to the admin guide also.

garlick · 2024-06-14T21:25:36Z

just pushed some tweaks to the admin description to fix my poor word choices and also add the --with-imp option to the perilog-run example.

garlick · 2024-07-02T16:44:23Z

I fixed this up based on what we learned wrapping the housekeeping scripts in a systemd unit.

grondo

This LGTM!

garlick · 2024-07-08T13:35:57Z

I didn't notice the approval - thanks! Setting MWP.

Problem: prolog/epilog scripts should be run in a systemd cgroup for reliable termination, logging to the local systemd journal, and debugging using well known systemd tools. Add systemd oneshot service units for prolog and epilog, patterned after the housekeeping one, and helper scripts that can be configured as IMP run commands.

codecov · 2024-10-06T14:30:20Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.32%. Comparing base (cc505fd) to head (3fdae54).
Report is 562 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6040      +/-   ##
==========================================
- Coverage   83.33%   83.32%   -0.02%     
==========================================
  Files         521      521              
  Lines       84391    84391              
==========================================
- Hits        70325    70316       -9     
- Misses      14066    14075       +9

see 8 files with indirect coverage changes

garlick force-pushed the hk_unit branch from ede3103 to 10b3915 Compare June 14, 2024 00:20

garlick force-pushed the hk_unit branch from 10b3915 to 51bcea3 Compare June 14, 2024 14:48

garlick force-pushed the hk_unit branch from 51bcea3 to a660a94 Compare June 14, 2024 19:57

garlick changed the title ~~systemd: add housekeeping service unit~~ systemd: add prolog/epilog service units Jun 14, 2024

garlick force-pushed the hk_unit branch from a660a94 to 386215c Compare June 14, 2024 21:24

garlick mentioned this pull request Jun 20, 2024

job-manager: add support for housekeeping scripts with partial release of resources #5818

Merged

garlick force-pushed the hk_unit branch from 386215c to 2962c6d Compare July 2, 2024 16:40

grondo approved these changes Jul 2, 2024

View reviewed changes

garlick added the merge-when-passing label Jul 8, 2024

garlick force-pushed the hk_unit branch from 2962c6d to 3fdae54 Compare July 8, 2024 13:36

mergify bot merged commit e2c21ee into flux-framework:master Jul 8, 2024
35 checks passed

garlick deleted the hk_unit branch August 7, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

systemd: add prolog/epilog service units #6040

systemd: add prolog/epilog service units #6040

garlick commented Jun 13, 2024 •

edited

Loading

grondo commented Jun 13, 2024

grondo commented Jun 13, 2024 •

edited

Loading

garlick commented Jun 13, 2024

garlick commented Jun 13, 2024

grondo commented Jun 13, 2024

garlick commented Jun 13, 2024

garlick commented Jun 14, 2024 •

edited

Loading

garlick commented Jun 14, 2024

grondo commented Jun 14, 2024

garlick commented Jun 14, 2024

grondo commented Jun 14, 2024

garlick commented Jun 14, 2024

garlick commented Jun 14, 2024

garlick commented Jul 2, 2024

grondo left a comment

garlick commented Jul 8, 2024

codecov bot commented Oct 6, 2024

systemd: add prolog/epilog service units #6040

systemd: add prolog/epilog service units #6040

Conversation

garlick commented Jun 13, 2024 • edited Loading

grondo commented Jun 13, 2024

grondo commented Jun 13, 2024 • edited Loading

garlick commented Jun 13, 2024

garlick commented Jun 13, 2024

grondo commented Jun 13, 2024

garlick commented Jun 13, 2024

garlick commented Jun 14, 2024 • edited Loading

garlick commented Jun 14, 2024

grondo commented Jun 14, 2024

garlick commented Jun 14, 2024

grondo commented Jun 14, 2024

garlick commented Jun 14, 2024

garlick commented Jun 14, 2024

garlick commented Jul 2, 2024

grondo left a comment

Choose a reason for hiding this comment

garlick commented Jul 8, 2024

codecov bot commented Oct 6, 2024

Codecov Report

garlick commented Jun 13, 2024 •

edited

Loading

grondo commented Jun 13, 2024 •

edited

Loading

garlick commented Jun 14, 2024 •

edited

Loading