Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd: add prolog/epilog service units #6040

Merged
merged 1 commit into from
Jul 8, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Jun 13, 2024

Problem: future housekeeping scripts should be run in a systemd cgroup for reliable termination, logging to the local systemd journal, and debugging using well known systemd tools.

Add a systemd "oneshot" service unit for housekeeping. The service unit runs a user-provided housekeeping script. It is configured so that systemctl start housekeeping blocks until the the run is complete and its exit code reflects the exit code of the housekeeping script.

Add a helper script that can be configured as an IMP run command. It runs systemctl start housekeeping and traps SIGTERM, which can be sent to enforce a timeout. Upon receipt of SIGTERM, it stops the unit and exits with a nonzero code.

To enable environment variables such as the FLUX_JOB_ID to be passed into the user-provided housekeeping script via the systemd unit, the helper script dumps its environment into /run/housekeeping.env, which is read in by the unit.

Also of note: the user-provided scripts are automatically idempotent when run this way. systemd never starts multiple instances of the unit. If one is running when a start request is received, the second start blocks until the existing run finishes and reports its status.

I split this out to a separate PR because I think it can be used to improve the existing epilog even before housekeeping as proposed in #5818 is available. For example, on my test system, I renamed /etc/flux/system/epilog to housekeeping and configured the IMP with

[run.housekeeping]
allowed-users = [ "flux" ]
allowed-environment = [ "FLUX_*" ]
path = "/usr/lib/aarch64-linux-gnu/flux/cmd/flux-run-housekeeping"

and the job manager with

[job-manager.epilog]
command = [
   "flux", "perilog-run", "epilog",
   "--timeout=15s",
   "--with-imp",
   "-e", "housekeeping"
]

And now my epilog runs in the housekeeping unit. After a job runs, journalctl -u housekeeping says

Jun 13 11:33:49 picl7 systemd[1]: Starting Node Maintenance for Flux...
Jun 13 11:33:49 picl7 housekeeping[55656]: running stress.sh
Jun 13 11:33:49 picl7 housekeeping[55663]: stress: info: [55663] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
Jun 13 11:33:59 picl7 housekeeping[55663]: stress: info: [55663] successful run completed in 10s
Jun 13 11:33:59 picl7 systemd[1]: housekeeping.service: Succeeded.
Jun 13 11:33:59 picl7 systemd[1]: Finished Node Maintenance for Flux.
Jun 13 11:33:59 picl7 systemd[1]: housekeeping.service: Consumed 39.989s CPU time.

(my epilog just burns cpu for 10s)

systemctl status housekeeping provides the usual informative output:

● housekeeping.service - Node Maintenance for Flux
     Loaded: loaded (/lib/systemd/system/housekeeping.service; static)
     Active: activating (start) since Thu 2024-06-13 11:51:21 PDT; 3s ago
   Main PID: 55746 (housekeeping)
      Tasks: 7 (limit: 1599)
     Memory: 856.0K
        CPU: 15.185s
     CGroup: /system.slice/housekeeping.service
             ├─55746 /bin/sh /etc/flux/system/housekeeping
             ├─55749 /bin/sh /etc/flux/system/housekeeping.d/stress.sh
             ├─55750 stress -c 4 -t 10
             ├─55751 stress -c 4 -t 10
             ├─55752 stress -c 4 -t 10
             ├─55753 stress -c 4 -t 10
             └─55754 stress -c 4 -t 10

Jun 13 11:51:21 picl7 systemd[1]: Starting Node Maintenance for Flux...
Jun 13 11:51:21 picl7 housekeeping[55746]: running stress.sh
Jun 13 11:51:21 picl7 housekeeping[55750]: stress: info: [55750] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd

@grondo
Copy link
Contributor

grondo commented Jun 13, 2024

What happens on systems where nodes are shared, can the housekeeping unit be run for two different jobs at the same time?

@grondo
Copy link
Contributor

grondo commented Jun 13, 2024

I forgot most of what I know about systemd unitfile templates, but perhaps if the the housekeeping unit file was made into a template, it could be started per job with systemctl start housekeeping@jobid and perhaps the jobid could be passed to the housekeeping environment via the template argument instead of writing to a file in /run?

(Even in the exclusive node case, this would allow a sysadmin to quickly determine for what job a housekeeping script was running)

Just an idea.

@garlick
Copy link
Member Author

garlick commented Jun 13, 2024

What happens on systems where nodes are shared, can the housekeeping unit be run for two different jobs at the same time?

If multiple jobs sharing a node start the epilog at once, the first one in will start the unit and the rest will block until that one is done. The script would run with environment variables associated with the first job.

Yeah I see your point. This is maybe not the best idea for epilog on those systems (or maybe on any systems) where the prolog and epilog contain matched do/undo actions that apply to a single job. Perhaps this should just be part of the housekeeping PR.

@garlick
Copy link
Member Author

garlick commented Jun 13, 2024

perhaps if the the housekeeping unit file was made into a template, it could be started per job with systemctl start housekeeping@jobid

That would solve that problem I think.

If we're thinking of housekeeping as doing stuff that is relatively independent of a job then I like the single unit and the idempotency for that.

We could go all in with this approach and add prolog@jobid and epilog@jobid units I suppose! :-)

@grondo
Copy link
Contributor

grondo commented Jun 13, 2024

If we're thinking of housekeeping as doing stuff that is relatively independent of a job then I like the single unit and the idempotency for that.

Yeah, I had kind of forgotten about that.

However, the risk there is that not all per-job prolog/epilog scripts will be guaranteed not to hang or get stuck, and if housekeeping is the only way that partial release can work, then the tendency will be to place even per-job scripts into housekeeping - especially on systems allocating exclusive nodes. If they can't be, well then we might be stuck like we are now holding back all nodes of a job when one node has a stuck epilog.

@garlick
Copy link
Member Author

garlick commented Jun 13, 2024

Well that is a compelling point.

One nice thing I just learned is that both journalctl -u and systemctl status accept glob patterns for the unit name.

Maybe I'll just play with templated units a bit and see what I can learn.

@garlick
Copy link
Member Author

garlick commented Jun 14, 2024

Re-pushed with a templated unit file, housekeeping@jobid. It sets FLUX_JOB_ID based on the template argument. I was able to configure it to automatically garbage collect, so when you do systemctl status housekeeping@* you only see what's currently running (not past failed units). journalctl -u housekeeping@* works too.

This doesn't pass the other documented perilog environment variables FLUX_OWNER_USERID, FLUX_JOB_USERID, HOME, and USER so maybe we need some way to do that. (Aside: those should probably be added to flux-envirnoment(7)).

@garlick
Copy link
Member Author

garlick commented Jun 14, 2024

Pushed another update that reinstates the environment copy via /run.

Would there be interest in having separate units for prolog and epilog? Then the housekeeping unit could just be added later.

It seems like the benefit of running these site provided scripts under systemd is high. I'm pretty sure systemd could reliably stop a fork bomb with its freeze/signal strategy or in later kernels/systemds the cgroup.kill button. Plus having output go to the local journal so that logs for the sequence of runs on a particular node can be viewed together, and the ability to run systemctl status and see at a glance what processes are still running when something is stuck is pretty handy.

@grondo
Copy link
Contributor

grondo commented Jun 14, 2024

Yes, that seems like a good idea, though prolog and epilog are currently just arbitrarily configured scripts that are configured to be run via the IMP, so should the systemd unit files also be site configuration? (Just thinking out loud here).

Plus having output go to the local journal so that logs for the sequence of runs on a particular node can be viewed together,

We should check with the admins - I think currently output from the prolog and epilog are (somehow) forwarded to the management node so that they can be searched in one place. As long as there's still a way to do that, this should be fine.

the ability to run systemctl status and see at a glance what processes are still running when something is stuck is pretty handy.

FYI - pstree -Tplu flux works great for this now (and doesn't have the weird auto pager behavior like systemctl status) I presume this would break that approach, however probably fine as long as there is a way to do it. systemctl status would also be able to show orphan processes so that's a clear win.

Another thought, should these service units perhaps have flux in the name? There could be another housekeeping or prolog/epilog service offered by another project, or even a site local one.

@garlick
Copy link
Member Author

garlick commented Jun 14, 2024

Yes, that seems like a good idea, though prolog and epilog are currently just arbitrarily configured scripts that are configured to be run via the IMP, so should the systemd unit files also be site configuration? (Just thinking out loud here).

If we're encouraging this approach (which I think we should) then there is probably a support advantage in having it work uniformly at all sites rather than let each site cook up their own unit files. Plus providing the unit files reduces the inertia required to get it done.

I guess it's obvious but if we implement a "real" node prolog/epilog in place of overloading the job manager perilog, then this approach of having the IMP start a systemd unit could be retained.

I think currently output from the prolog and epilog are (somehow) forwarded to the management node so that they can be searched in one place. As long as there's still a way to do that, this should be fine.

This change would make it so that only an indication of failure is logged to flux dmesg e.g.

[ +11.295160] job-manager[0]: ƒ4rs2fRJWN3: epilog: stderr: ƒ4rs2fRJWN3: rank 0 failed epilog, draining

(Hey that log message should include the hostname)

The detailed output/errors would go to the journal. Any thoughts on that @ryanday36?

I assumed that logs that go to the systemd journal are collectable in ways already familiar to the sys admins (like rsyslog).

pstree -Tplu flux works great for this now

Yeah, this process tree starts out as root so that won't work :-( I mean something like this but it kinda sucks to type

$ pstree -Tpl $(pgrep housekeeping)
housekeeping(139410)───stress.sh(139417)───stress(139418)─┬─stress(139419)
                                                          ├─stress(139420)
                                                          ├─stress(139421)
                                                          └─stress(139422)

Another thought, should these service units perhaps have flux in the name? There could be another housekeeping or prolog/epilog service offered by another project, or even a site local one.

Agreed.

@grondo
Copy link
Contributor

grondo commented Jun 14, 2024

If we're encouraging this approach (which I think we should) then there is probably a support advantage in having it work uniformly at all sites rather than let each site cook up their own unit files. Plus providing the unit files reduces the inertia required to get it done.

Ok, that sounds great. I'm all for this approach. It also gives us a chance to somehow recover state of prolog/epilog/housekeeping workloads after a restart of the local broker or system instance.

@garlick garlick changed the title systemd: add housekeeping service unit systemd: add prolog/epilog service units Jun 14, 2024
@garlick
Copy link
Member Author

garlick commented Jun 14, 2024

OK, this now adds prolog/epilog service units, not housekeeping.

I added an update to the admin guide also.

@garlick
Copy link
Member Author

garlick commented Jun 14, 2024

just pushed some tweaks to the admin description to fix my poor word choices and also add the --with-imp option to the perilog-run example.

@garlick
Copy link
Member Author

garlick commented Jul 2, 2024

I fixed this up based on what we learned wrapping the housekeeping scripts in a systemd unit.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

@garlick
Copy link
Member Author

garlick commented Jul 8, 2024

I didn't notice the approval - thanks! Setting MWP.

Problem: prolog/epilog scripts should be run in a systemd cgroup
for reliable termination, logging to the local systemd journal, and
debugging using well known systemd tools.

Add systemd oneshot service units for prolog and epilog, patterned
after the housekeeping one, and helper scripts that can be configured
as IMP run commands.
@mergify mergify bot merged commit e2c21ee into flux-framework:master Jul 8, 2024
35 checks passed
@garlick garlick deleted the hk_unit branch August 7, 2024 14:15
Copy link

codecov bot commented Oct 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.32%. Comparing base (cc505fd) to head (3fdae54).
Report is 562 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6040      +/-   ##
==========================================
- Coverage   83.33%   83.32%   -0.02%     
==========================================
  Files         521      521              
  Lines       84391    84391              
==========================================
- Hits        70325    70316       -9     
- Misses      14066    14075       +9     

see 8 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants