New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd: improved startup failure detection for simple services #24778
base: devel
Are you sure you want to change the base?
Conversation
@@ -457,6 +487,21 @@ def main(): | |||
(rc, out, err) = module.run_command("%s %s '%s'" % (systemctl, action, unit)) | |||
if rc != 0: | |||
module.fail_json(msg="Unable to %s service %s: %s" % (action, unit, err)) | |||
|
|||
if module.params['detect_early_failure']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this make sense with a stop? in most cases 'reload' is just sending a signal to the daemon ... not sure it applies to that action either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reproduce it for "stop" as well. If ExecStop is defined and fails, the service goes into the "failed" state (the daemon will still be terminated). Question is, do we want to report this as a failure or not.
I agree that it makes little sense for "restart", and neither for non-service units (e.g. mount units). so maybe restrict it to just starting and stopping of service units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and restart, of course
…art,stop,restart of service units
SUMMARY
The current systemd module doesn't detect early startup failures in services with type=simple, i.e. services whose ExecStart command should be a long-running foreground process. If that process exits with an error immediately, e.g. because of a faulty command line, the systemd module won't detect this and report success. This is due to the fact that the called
systemctl start <service>
, for type=simple services, just calls systemd to fire off the ExecStart command and then exits successfully (which is reasonable given the fact that systemd cannot know how the started binary will behave and when and how it'll exit, so it only starts it and reports success). This is obviously problematic as your playbook may run successfully and still leave the machine behind with services that failed to start.Something similar can happen when stopping a service with a failing ExecStop command (if defined) -- the service will enter the "failed" instead of the "inactive" state, but the systemd module will report success.
This PR introduces a module parameter
detect_early_failure
(bool, default false) and associatedearly_failure_timeout
(float, default 1.0). Ifdetect_early_failure
is set to true, the systemd module, after successfully performing the requested state change (start/stop), waitsearly_failure_timeout
seconds and then checks whether the service has entered the failed state, and if so, reports an error.This is obviously a heuristic approach, but fwiw it's the best we can do to detect startup failures of simple services. In an ideal world, all services would be of type=notify, and notify systemd (via sd_notify(3)) when they've initialized successfully or otherwise exit without notifying systemd. For those types of services, systemd will wait until either the notification or the service exit without notification, and thus the Ansible systemd module will work correctly without
detect_early_failure
. But as long as not all services have been modified to usesd_notify
, this flag is useful.ISSUE TYPE
COMPONENT NAME
systemd
ANSIBLE VERSION
ADDITIONAL INFORMATION
Sample session to reproduce the behaviour:
Test service, type=simple, ExecStart is a command line that fails immediately:
Start it:
=> reports success. (the fact that the returned status.ActiveState is inactive is arguably an upstream bug -- it returns the service status from immediately before the change)
However, the service has failed immediately due to the wrong ExecStart command line:
With
detect_early_failure
enabled, the startup failure is detected correctly: