New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support all kind of systemd units #941
base: stretch
Are you sure you want to change the base?
Conversation
@@ -2790,7 +2790,7 @@ def replace_alias(service): | |||
services.append("fail2ban") | |||
|
|||
# List services currently down and raise an exception if any are found | |||
faulty_services = [s for s in services if service_status(s)["status"] != "running"] | |||
faulty_services = [s for s in services if service_status(s)["status"] not in ["running", "active"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running systemctl
to see the status of all units, I see that some also are :
plugged
(for .device)mounted
(for .mount)listening
(some .socket but not all ..)waiting
(for .timer)
I don't know if you have more insight on this and/or if we should consider all "okay" states ... We could also flatten the whole thing when we're fetching the status, like
if status in ["running", "active", "listening", "mounted", ...]:
status = "running"
# or something like "up" or idk
... or we could also assume nobody's gonna use any funky service :D but the point is to support many service type, and now that I read this PR, I think for example borg_ynh use to timer services... (which are probably "waiting") though they are not registered in yunohost
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right, there's a bunch of other possible states. Reading the doc, we have actually 3 sets of states (each having different possible values):
- LOAD state: basically whether or not systemd found a valid unit file
- ACTIVE state: general state, common to each unit type
- SUB: substate, specific to instance types
The possible values (given by systemctl --state=help
) can change from release to release, although I expect that those for LOAD and ACTIVE would remain quite stable. Actually, between stretch and buster, only ACTIVE states didn't change. LOAD has a new value (bad-setting
), and SUB has several changes.
So for an admin user POV, I propose to:
- use ACTIVE state to decide whether it's green or red (and the little associated picto)
- use SUB state for the displayed text in all cases
For mapping between active and the color + picto, it'be nice to actually have 3 states:
- red for "failed" and maybe "inactive" (but not sure, see below)
- green for "active",
- orange for all the -ing state: reloading, activating, deactivating.
For this PR, I could:
- (have a look at 30-services.py and)
- use active AND sub-state. I think I'd put activeState into status, and substate into statusText or something like that.
One question I still have: do we really want to treat "inactive" as error? Putting things another way, one-shot service units that are activated from a condition will be "inactive" most of the time. Do we expect/allow app packager to add them to yunohost services? I would say that this does not really make sense, so I'd vote for treating "inactive" as error.
WDYT @alexAubin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To give you the full story : I think previously we were using the ACTIVE
state to know wether or not a service is running ... Or maybe we were in fact displaying both the 'ACTIVE' and 'SUBSTATE'. It's good most of the time, but unfortunately there are stupid edge case to consider and the bottom line imho is that relying on systemd's data is not enough to have a reliable answer regarding the status of a service. Typically, some services are active/exited. Many people are confused by the fact that 'exited' may not be bad, yet it's not clear what/why that happens... (We had several users asking if that was okay that postfix was exited or yunohost-firewall was exited) Most notable examples are :
- postfix, which actually launches postfix@- which is the real service. In fact I saw situation where if the actual postfix process got killed, postfix would still appear as a happy active/exited service and you'll spend an additional 3 hours looking for wtf is going on before finally restarting the service out of desperation and turns out it works because it restarted the "real" service postfix@-. So naively you want to absolutely avoid displaying those service as "happy green" if they are in fact broken, even though systemctl tells you "active/exited" in green.
- similar thing with postgresql and postgresql@9.6-main
- I recently understood that it's all related to the
RemainAfterExit
andConsistsOf
/PartOf
stuff (c.f.systemctl show postfix | grep ConsistsOf
and similar for postfix@- and PartOf) - yunohost-firewall is also a similar service though different (is it a OneTime ? idk) ... it just feed a bunch of rules to iptables then exit. So definitely not a daemon. For this kind of stuff, I created in 3.8 a
test_status
key in services.yml which allows to specify a test command to know if the service is "running" or not. And for the firewall, I just do a stupid grep on iptables-save which is probably not really accurate but good enough for a first iteration... - same kind of funky stuff with
slapd
which is by default configured withRemainAfterExit=yes
(seems like it's the case for all legacy init.d services ... to be confirmed). So if slapd was killed because of OOM, systemd would still display the service as green active/exited. This specific slapd case was fixed recently because we override the setting, but could be the case for other legacy init.d services... - today somebody in 3.8.1 reports that the diagnosis shows Lufi as failed, because it's also active/exited ... (c.f. the service config which shows RemainAfterExit=yes and my understanding is that the carton command just creates a child then exit). In fact kind of scenario, imho we would need the packager to specify the test_status command for us to be able to have an accurate info (or could it be that there's some trick to do with the .pid thing ? idk...)
Sooo sorry for the wall of text, but to summarize I basically agree with you BUT imho we can't trust this 'active' state from systemd and we should either design a strategy based on the values of RemainAfterExit / ConsistsOf or other available info AND/OR have some custom tests for very specific services...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But note that we're getting carried away from the original problem this PR tries to address .. We can also decide to just forget about this and go for a solution that hard-code the usual good values for non-.service
units and flatten it into a simple running
(or ok
or whatever) and that could be good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annnnd yeah (sorry for flooding) I realize what I describe is apriori only about .service unit and maybe we can just factorize the two problematics :
if not .service unit:
# We can trust the active / substate data
else:
# We have to do some trickeries about the "exited" state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed answer!
Yes we are a bit off-topic, for this PR, but let's continue the discussion anyway, because I feel like it's gonna lead us somewhere :-) Another wall of text ahead.
TL;DR:
- for this PR, let's do as you said in your last comment.
- for a more long-term vision, please read ahead
recently understood that it's all related to the RemainAfterExit and ConsistsOf / PartOf stuff (c.f. systemctl show postfix | grep ConsistsOf and similar for postfix@- and PartOf)
How would you feel if we followed the ConsistOf
relationship and also test if "child" units are ok before considering a ynh service to be green? I feel like the spirit of yunohost services is more like "pieces of functionality constantly running" than like real OS services or daemons anyway. From a user POV, this seems more relevant. If I take the diaspora example, the admin does not care if sidekiq or unicorn is ok, all they want to know is whether or not diaspora as a whole is ok, and if not: try to restart it (and in this case, restarting the target will restart everything). This behaviour should also cover the cases of postfix and postgresql, what do you think?
For more advanced user, either they can ssh and restart things individually, or in the future we could display subservices in the service detail page maybe? Then admins would be able to see exactly why a unit is considered failed.
For yunohost-firewall: I'd be personnally ok with it to be green if the initial invocation succeeded. I also like what you have done (comparing the actual content of iptable rules to the one that should be there, if I understand correctly). I feel like it is really what I was talking about above : the admin just want to know whether or not the firewall is ok and don't really care about running processes...
same kind of funky stuff with slapd which is by default configured with RemainAfterExit=yes (seems like it's the case for all legacy init.d services ... to be confirmed). So if slapd was killed because of OOM, systemd would still display the service as green active/exited. This specific slapd case was fixed recently because we override the setting, but could be the case for other legacy init.d services...
Yes, and that's strange... because from what I've tested, old init.d scripts are created as Type=forking
, RemainAfterExit=yes
, and still from what I've tested, if the forked process exits with a code != 0, the service is correctly marked as failed even with RemainAfterExit=yes. That being said, all those tests weren't done with the same systemd version, and I indeed reproduce this funky stuff with slapd on my ynh instance, so maybe there's something with systemd versions... I'm gonna make some more tests.
* today somebody in 3.8.1 reports that the diagnosis shows Lufi as failed, because it's also active/exited ... (c.f. [the service config](https://github.com/YunoHost-Apps/lufi_ynh/blob/master/conf/systemd.service#L11) [](/YunoHost-Apps/lufi_ynh/blob/HEAD@{2020-04-22T13:50:42Z}/conf/systemd.service#L11)
I think that's because Lufi systemd file should actually be of forking Type.
Maybe it has been done that way (type simple instead of forking) because of the confusing way the systemd doc is phrased:
If set to simple (the default if ExecStart= is specified but neither Type= nor BusName= are), the service manager will consider the unit started immediately after the main service process has been forked off.
The fork of the main service process is actually not the fork of the carton process. My understanding is that the startup process of a service unit goes like this:
- fork of main systemd process
- invocation of ExecStart command
- This command may fork (should be of Forking type, case of carton) or not (in this case it should be simple, and systemd will display as failed if the ExecStart command exits with return code != 0).
So I guess we could trust systemd for units of type "simple". If it exits with code =0 when it's not supposed to, I guess it's more a bug in the script itself right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you feel if we followed the ConsistOf relationship and also test if "child" units are ok before considering a ynh service to be green?
Yes that sounds like the logical thing to do ... During some previous work I tried exactly this, and not sure what happened but I ended up not doing it and I don't remember why ... Ended up just having a key "actual_systemd_service" that is used when checking the status and logs ... Maybe that's because it brought too many questions as in "What if there are multiple 'ConsistOf'" or "When displaying the logs, what should we display, the parent or the child(s)" @_@
I feel like the spirit of yunohost services is more like "pieces of functionality constantly running" than like real OS services or daemons anyway. From a user POV, this seems more relevant. If I take the diaspora example, the admin does not care if sidekiq or unicorn is ok, all they want to know is whether or not diaspora as a whole is ok
Yes indeed, but are you suggesting that we create some service like $appname.service that would ConsistOf subservice added by the app ? That would be pretty fancy indeed but the complexity cost of actually implementing it makes me thing it's not worth it compared to what it brings ?
I think that's because Lufi systemd file should actually be of forking Type.
Created a PR here YunoHost-Apps/lufi_ynh#36 though I'm not 100% about all the implications it can have (maybe there's some tight-coupling with ynh_systemd_action here, idk, should run a package check to be sure)
So I guess we could trust systemd for units of type "simple". If it exits with code =0 when it's not supposed to, I guess it's more a bug in the script itself right?
Uuuuuh I don't know haha I can't think properly enough right now, maybe too much coffee xD
(As a side note, some service status check also happens in |
All right, it's been a long time, so maybe I can summarize what I intent to do/try:
And I think we'll be good. @alexAubin what do you think? |
Note, I've just seen the "actual_systemd_service" mechanism: I think it could be removed by supporting ConsistsOf if we register mariadb instead of mysql. If I understand well, it's entirely for core services, so no strong opinion about it except the fact I generally like to delete code :-) |
The problem
Impossible to add systemd units of other types than services in the installation of app.
This is due to the fact yunohost appends unit name with '.service' all the time.
I wanted this change because in this pr for diaspora I want to add only the
.target
unit file to yunohost services, and not the 2 or 3 systemd services for diaspora. And then I thought it'd be cool to be able to track all sort of units (mounts could be useful for instance).Fixes YunoHost/issues#1519
Solution
Make yunohost behaves like systemd: only append .service if the unit name does not end with a possible unit suffix.
As only service units can be "running", we need to test for both "running" and "active" to consider a unit to be in the successful state. Therefore, some changes are needed in yunohost-admin too (I'm opening a PR soon).
PR Status
I tested on my instance, needs a PR in yunohost-admin too.
How to test
use
sudo yunohost service add
on a socket or device unit for instance (find all the units withsystemctl list-units
)Validation