Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support all kind of systemd units #941

Open
wants to merge 1 commit into
base: stretch
Choose a base branch
from

Conversation

autra
Copy link
Contributor

@autra autra commented Apr 19, 2020

The problem

Impossible to add systemd units of other types than services in the installation of app.

This is due to the fact yunohost appends unit name with '.service' all the time.

I wanted this change because in this pr for diaspora I want to add only the .target unit file to yunohost services, and not the 2 or 3 systemd services for diaspora. And then I thought it'd be cool to be able to track all sort of units (mounts could be useful for instance).

Fixes YunoHost/issues#1519

Solution

Make yunohost behaves like systemd: only append .service if the unit name does not end with a possible unit suffix.

As only service units can be "running", we need to test for both "running" and "active" to consider a unit to be in the successful state. Therefore, some changes are needed in yunohost-admin too (I'm opening a PR soon).

PR Status

I tested on my instance, needs a PR in yunohost-admin too.

How to test

use sudo yunohost service add on a socket or device unit for instance (find all the units with systemctl list-units)

Validation

  • Principle agreement 0/2 :
  • Quick review 0/1 :
  • Simple test 0/1 :
  • Deep review 0/1 :

@@ -2790,7 +2790,7 @@ def replace_alias(service):
services.append("fail2ban")

# List services currently down and raise an exception if any are found
faulty_services = [s for s in services if service_status(s)["status"] != "running"]
faulty_services = [s for s in services if service_status(s)["status"] not in ["running", "active"]]
Copy link
Member

@alexAubin alexAubin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running systemctl to see the status of all units, I see that some also are :

  • plugged (for .device)
  • mounted (for .mount)
  • listening (some .socket but not all ..)
  • waiting (for .timer)

I don't know if you have more insight on this and/or if we should consider all "okay" states ... We could also flatten the whole thing when we're fetching the status, like

if status in ["running", "active", "listening", "mounted", ...]:
   status = "running"
   # or something like "up" or idk

... or we could also assume nobody's gonna use any funky service :D but the point is to support many service type, and now that I read this PR, I think for example borg_ynh use to timer services... (which are probably "waiting") though they are not registered in yunohost

Copy link
Contributor Author

@autra autra Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, there's a bunch of other possible states. Reading the doc, we have actually 3 sets of states (each having different possible values):

  • LOAD state: basically whether or not systemd found a valid unit file
  • ACTIVE state: general state, common to each unit type
  • SUB: substate, specific to instance types

The possible values (given by systemctl --state=help) can change from release to release, although I expect that those for LOAD and ACTIVE would remain quite stable. Actually, between stretch and buster, only ACTIVE states didn't change. LOAD has a new value (bad-setting), and SUB has several changes.

So for an admin user POV, I propose to:

  • use ACTIVE state to decide whether it's green or red (and the little associated picto)
  • use SUB state for the displayed text in all cases

For mapping between active and the color + picto, it'be nice to actually have 3 states:

  • red for "failed" and maybe "inactive" (but not sure, see below)
  • green for "active",
  • orange for all the -ing state: reloading, activating, deactivating.

For this PR, I could:

  • (have a look at 30-services.py and)
  • use active AND sub-state. I think I'd put activeState into status, and substate into statusText or something like that.

One question I still have: do we really want to treat "inactive" as error? Putting things another way, one-shot service units that are activated from a condition will be "inactive" most of the time. Do we expect/allow app packager to add them to yunohost services? I would say that this does not really make sense, so I'd vote for treating "inactive" as error.

WDYT @alexAubin ?

Copy link
Member

@alexAubin alexAubin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To give you the full story : I think previously we were using the ACTIVE state to know wether or not a service is running ... Or maybe we were in fact displaying both the 'ACTIVE' and 'SUBSTATE'. It's good most of the time, but unfortunately there are stupid edge case to consider and the bottom line imho is that relying on systemd's data is not enough to have a reliable answer regarding the status of a service. Typically, some services are active/exited. Many people are confused by the fact that 'exited' may not be bad, yet it's not clear what/why that happens... (We had several users asking if that was okay that postfix was exited or yunohost-firewall was exited) Most notable examples are :

  • postfix, which actually launches postfix@- which is the real service. In fact I saw situation where if the actual postfix process got killed, postfix would still appear as a happy active/exited service and you'll spend an additional 3 hours looking for wtf is going on before finally restarting the service out of desperation and turns out it works because it restarted the "real" service postfix@-. So naively you want to absolutely avoid displaying those service as "happy green" if they are in fact broken, even though systemctl tells you "active/exited" in green.
  • similar thing with postgresql and postgresql@9.6-main
  • I recently understood that it's all related to the RemainAfterExit and ConsistsOf / PartOf stuff (c.f. systemctl show postfix | grep ConsistsOf and similar for postfix@- and PartOf)
  • yunohost-firewall is also a similar service though different (is it a OneTime ? idk) ... it just feed a bunch of rules to iptables then exit. So definitely not a daemon. For this kind of stuff, I created in 3.8 a test_status key in services.yml which allows to specify a test command to know if the service is "running" or not. And for the firewall, I just do a stupid grep on iptables-save which is probably not really accurate but good enough for a first iteration...
  • same kind of funky stuff with slapd which is by default configured with RemainAfterExit=yes (seems like it's the case for all legacy init.d services ... to be confirmed). So if slapd was killed because of OOM, systemd would still display the service as green active/exited. This specific slapd case was fixed recently because we override the setting, but could be the case for other legacy init.d services...
  • today somebody in 3.8.1 reports that the diagnosis shows Lufi as failed, because it's also active/exited ... (c.f. the service config which shows RemainAfterExit=yes and my understanding is that the carton command just creates a child then exit). In fact kind of scenario, imho we would need the packager to specify the test_status command for us to be able to have an accurate info (or could it be that there's some trick to do with the .pid thing ? idk...)

Sooo sorry for the wall of text, but to summarize I basically agree with you BUT imho we can't trust this 'active' state from systemd and we should either design a strategy based on the values of RemainAfterExit / ConsistsOf or other available info AND/OR have some custom tests for very specific services...

Copy link
Member

@alexAubin alexAubin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But note that we're getting carried away from the original problem this PR tries to address .. We can also decide to just forget about this and go for a solution that hard-code the usual good values for non-.service units and flatten it into a simple running (or ok or whatever) and that could be good enough 😅

Copy link
Member

@alexAubin alexAubin Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annnnd yeah (sorry for flooding) I realize what I describe is apriori only about .service unit and maybe we can just factorize the two problematics :

if not .service unit:
    # We can trust the active / substate data
else:
    # We have to do some trickeries about the "exited" state

Copy link
Contributor Author

@autra autra Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed answer!

Yes we are a bit off-topic, for this PR, but let's continue the discussion anyway, because I feel like it's gonna lead us somewhere :-) Another wall of text ahead.

TL;DR:

  • for this PR, let's do as you said in your last comment.
  • for a more long-term vision, please read ahead

recently understood that it's all related to the RemainAfterExit and ConsistsOf / PartOf stuff (c.f. systemctl show postfix | grep ConsistsOf and similar for postfix@- and PartOf)

How would you feel if we followed the ConsistOf relationship and also test if "child" units are ok before considering a ynh service to be green? I feel like the spirit of yunohost services is more like "pieces of functionality constantly running" than like real OS services or daemons anyway. From a user POV, this seems more relevant. If I take the diaspora example, the admin does not care if sidekiq or unicorn is ok, all they want to know is whether or not diaspora as a whole is ok, and if not: try to restart it (and in this case, restarting the target will restart everything). This behaviour should also cover the cases of postfix and postgresql, what do you think?

For more advanced user, either they can ssh and restart things individually, or in the future we could display subservices in the service detail page maybe? Then admins would be able to see exactly why a unit is considered failed.

For yunohost-firewall: I'd be personnally ok with it to be green if the initial invocation succeeded. I also like what you have done (comparing the actual content of iptable rules to the one that should be there, if I understand correctly). I feel like it is really what I was talking about above : the admin just want to know whether or not the firewall is ok and don't really care about running processes...

same kind of funky stuff with slapd which is by default configured with RemainAfterExit=yes (seems like it's the case for all legacy init.d services ... to be confirmed). So if slapd was killed because of OOM, systemd would still display the service as green active/exited. This specific slapd case was fixed recently because we override the setting, but could be the case for other legacy init.d services...

Yes, and that's strange... because from what I've tested, old init.d scripts are created as Type=forking, RemainAfterExit=yes, and still from what I've tested, if the forked process exits with a code != 0, the service is correctly marked as failed even with RemainAfterExit=yes. That being said, all those tests weren't done with the same systemd version, and I indeed reproduce this funky stuff with slapd on my ynh instance, so maybe there's something with systemd versions... I'm gonna make some more tests.

* today somebody in 3.8.1 reports that the diagnosis shows Lufi as failed, because it's also active/exited ... (c.f. [the service config](https://github.com/YunoHost-Apps/lufi_ynh/blob/master/conf/systemd.service#L11) [](/YunoHost-Apps/lufi_ynh/blob/HEAD@{2020-04-22T13:50:42Z}/conf/systemd.service#L11)

I think that's because Lufi systemd file should actually be of forking Type.

Maybe it has been done that way (type simple instead of forking) because of the confusing way the systemd doc is phrased:

If set to simple (the default if ExecStart= is specified but neither Type= nor BusName= are), the service manager will consider the unit started immediately after the main service process has been forked off.

The fork of the main service process is actually not the fork of the carton process. My understanding is that the startup process of a service unit goes like this:

  • fork of main systemd process
  • invocation of ExecStart command
  • This command may fork (should be of Forking type, case of carton) or not (in this case it should be simple, and systemd will display as failed if the ExecStart command exits with return code != 0).

So I guess we could trust systemd for units of type "simple". If it exits with code =0 when it's not supposed to, I guess it's more a bug in the script itself right?

Copy link
Member

@alexAubin alexAubin Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel if we followed the ConsistOf relationship and also test if "child" units are ok before considering a ynh service to be green?

Yes that sounds like the logical thing to do ... During some previous work I tried exactly this, and not sure what happened but I ended up not doing it and I don't remember why ... Ended up just having a key "actual_systemd_service" that is used when checking the status and logs ... Maybe that's because it brought too many questions as in "What if there are multiple 'ConsistOf'" or "When displaying the logs, what should we display, the parent or the child(s)" @_@

I feel like the spirit of yunohost services is more like "pieces of functionality constantly running" than like real OS services or daemons anyway. From a user POV, this seems more relevant. If I take the diaspora example, the admin does not care if sidekiq or unicorn is ok, all they want to know is whether or not diaspora as a whole is ok

Yes indeed, but are you suggesting that we create some service like $appname.service that would ConsistOf subservice added by the app ? That would be pretty fancy indeed but the complexity cost of actually implementing it makes me thing it's not worth it compared to what it brings ?

I think that's because Lufi systemd file should actually be of forking Type.

Created a PR here YunoHost-Apps/lufi_ynh#36 though I'm not 100% about all the implications it can have (maybe there's some tight-coupling with ynh_systemd_action here, idk, should run a package check to be sure)

So I guess we could trust systemd for units of type "simple". If it exits with code =0 when it's not supposed to, I guess it's more a bug in the script itself right?

Uuuuuh I don't know haha I can't think properly enough right now, maybe too much coffee xD

@alexAubin
Copy link
Member

@alexAubin alexAubin commented Apr 22, 2020

(As a side note, some service status check also happens in data/hooks/diagnosis/30-services.py ... we gotta fix that before merging ideally, once we've clarified what check to perform exactly)

@zamentur zamentur added this to the 4.3.x milestone Jan 3, 2021
@zamentur zamentur removed this from the 4.x milestone Jan 4, 2021
@zamentur zamentur added this to 4.x in Pending Jan 4, 2021
@autra
Copy link
Contributor Author

@autra autra commented Mar 15, 2022

All right, it's been a long time, so maybe I can summarize what I intent to do/try:

  • PR 1 (this PR?): add support for .target unit type only for now, not .mount, .socket etc... That's actually what I need for diaspora and it would make things simpler by limiting the amount of edge cases we'll have to consider (so no support for .socket, .mount etc types for now). Thus change the "running"/"En cours d'exécution" phrasing to "Active" or "ok", just to be agnostic regardless of the fact the unit is really running or not.
  • PR 2: Always follow the "ConsistOf" relationship, because I don't see a case where we should not in the context of yunohost services. It would be useful for .target as well as .service like postfix and postgresql.service for instance (actually seeing that both of them have ExecStart=/bin/true, they should be target, but that's something to tell to the debian maintainers). To be extra clear, I think our use case is entirely covered by using the ActiveState of both the considered unit and the units in ConsistsOf. Additional benefits : units with RemainOnExit=false will also be considered as ok when children are ok (which they should be imo).
  • PR 2 bis: add a piece of interface in the service page listing subservices and their state ? I think it'd be a nice addition.
  • Wait for the systemd version of bullseye to fix the slapd case: in bullseye, the state is correctly reported when the subprocess is down (systemd improved there, it tracks them now). The override file ynh installs is completely fine for me (in Bullseye I think the RemainAfterExit is not really useful, but not problematic either)

And I think we'll be good. @alexAubin what do you think?

@autra
Copy link
Contributor Author

@autra autra commented Mar 15, 2022

Note, I've just seen the "actual_systemd_service" mechanism: I think it could be removed by supporting ConsistsOf if we register mariadb instead of mysql. If I understand well, it's entirely for core services, so no strong opinion about it except the fact I generally like to delete code :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
3 participants