Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd: improved startup failure detection for simple services #24778

Open
wants to merge 2 commits into
base: devel
Choose a base branch
from

Conversation

multi-io
Copy link

@multi-io multi-io commented May 18, 2017

SUMMARY

The current systemd module doesn't detect early startup failures in services with type=simple, i.e. services whose ExecStart command should be a long-running foreground process. If that process exits with an error immediately, e.g. because of a faulty command line, the systemd module won't detect this and report success. This is due to the fact that the called systemctl start <service>, for type=simple services, just calls systemd to fire off the ExecStart command and then exits successfully (which is reasonable given the fact that systemd cannot know how the started binary will behave and when and how it'll exit, so it only starts it and reports success). This is obviously problematic as your playbook may run successfully and still leave the machine behind with services that failed to start.

Something similar can happen when stopping a service with a failing ExecStop command (if defined) -- the service will enter the "failed" instead of the "inactive" state, but the systemd module will report success.

This PR introduces a module parameter detect_early_failure (bool, default false) and associated early_failure_timeout (float, default 1.0). If detect_early_failure is set to true, the systemd module, after successfully performing the requested state change (start/stop), waits early_failure_timeout seconds and then checks whether the service has entered the failed state, and if so, reports an error.

This is obviously a heuristic approach, but fwiw it's the best we can do to detect startup failures of simple services. In an ideal world, all services would be of type=notify, and notify systemd (via sd_notify(3)) when they've initialized successfully or otherwise exit without notifying systemd. For those types of services, systemd will wait until either the notification or the service exit without notification, and thus the Ansible systemd module will work correctly without detect_early_failure. But as long as not all services have been modified to use sd_notify, this flag is useful.

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME

systemd

ANSIBLE VERSION
ansible 2.4.0 (devel 859729fb01) last updated 2017/05/18 15:23:55 (GMT +200)
  config file = /Users/oklischat/src/ansible/ansible.cfg
  configured module search path = [u'/Users/oklischat/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/oklischat/src/ansible/lib/ansible
  executable location = /Users/oklischat/src/ansible/bin/ansible
  python version = 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)]
ADDITIONAL INFORMATION

Sample session to reproduce the behaviour:

Test service, type=simple, ExecStart is a command line that fails immediately:

root@ubuntu-xenial:~# cat /etc/systemd/system/earlyfailtest.service
[Unit]
Description=earlyfailtest service
Before=multi-user.target
Conflicts=shutdown.target

[Service]
Type=simple
IgnoreSIGPIPE=no
ExecStart=/bin/ls -wrongparams

[Install]
WantedBy=multi-user.target

root@ubuntu-xenial:~#

Start it:

$ ansible -i inventory -u ubuntu -b testhost -m systemd -a 'name=earlyfailtest.service state=started'
testhost | SUCCESS => {
    "changed": true,
    "name": "earlyfailtest.service",
    "state": "started",
    "status": {
        "ActiveEnterTimestampMonotonic": "0",
        "ActiveExitTimestampMonotonic": "0",
        "ActiveState": "inactive",
....
    }
}
$ 

=> reports success. (the fact that the returned status.ActiveState is inactive is arguably an upstream bug -- it returns the service status from immediately before the change)

However, the service has failed immediately due to the wrong ExecStart command line:

root@ubuntu-xenial:~# systemctl status earlyfailtest.service
● earlyfailtest.service - earlyfailtest service
   Loaded: loaded (/etc/systemd/system/earlyfailtest.service; disabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2017-05-18 13:56:42 UTC; 1min 27s ago
  Process: 29887 ExecStart=/bin/ls -wrongparams (code=exited, status=2)
 Main PID: 29887 (code=exited, status=2)

May 18 13:56:42 ubuntu-xenial systemd[1]: Started earlyfailtest service.
May 18 13:56:42 ubuntu-xenial ls[29887]: /bin/ls: invalid line width: ‘rongparams’
May 18 13:56:42 ubuntu-xenial systemd[1]: earlyfailtest.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
May 18 13:56:42 ubuntu-xenial systemd[1]: earlyfailtest.service: Unit entered failed state.
May 18 13:56:42 ubuntu-xenial systemd[1]: earlyfailtest.service: Failed with result 'exit-code'.
root@ubuntu-xenial:~#

With detect_early_failure enabled, the startup failure is detected correctly:

$ ansible -i inventory -u ubuntu -b testhost -m systemd -a 'name=earlyfailtest.service state=started detect_early_failure=yes'
testhost | FAILED! => {
    "changed": false,
    "failed": true,
    "msg": "Failed to start service earlyfailtest.service: Service went into failed state early"
}
(stack2)oklischat@oklischat:~/src/ansible$

@ansibot ansibot added affects_2.4 core_review feature_pull_request module needs_triage labels May 18, 2017
@bcoca bcoca self-assigned this May 18, 2017
@bcoca bcoca removed the needs_triage label May 18, 2017
@@ -457,6 +487,21 @@ def main():
(rc, out, err) = module.run_command("%s %s '%s'" % (systemctl, action, unit))
if rc != 0:
module.fail_json(msg="Unable to %s service %s: %s" % (action, unit, err))

if module.params['detect_early_failure']:
Copy link
Member

@bcoca bcoca May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make sense with a stop? in most cases 'reload' is just sending a signal to the daemon ... not sure it applies to that action either.

Copy link
Author

@multi-io multi-io May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reproduce it for "stop" as well. If ExecStop is defined and fails, the service goes into the "failed" state (the daemon will still be terminated). Question is, do we want to report this as a failure or not.

I agree that it makes little sense for "restart", and neither for non-service units (e.g. mount units). so maybe restrict it to just starting and stopping of service units?

Copy link
Member

@bcoca bcoca May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and restart, of course

@ansibot ansibot added stale_ci support:core labels Jun 23, 2017
@ansibot ansibot added the new_contributor label Oct 18, 2017
@ansibot ansibot added needs_rebase needs_revision and removed core_review new_contributor labels Nov 3, 2017
@ansibot ansibot added the new_contributor label Feb 6, 2018
@ansibot ansibot added feature and removed feature_pull_request labels Mar 2, 2018
@bcoca bcoca added the P3 label Nov 16, 2018
@ansibot ansibot added the system label Feb 17, 2019
@ansibot ansibot added collection collection:community.general needs_collection_redirect support:community and removed support:core labels May 17, 2020
@ansibot ansibot added support:core and removed collection:community.general needs_collection_redirect support:community labels May 27, 2020
@ansibot ansibot added pre_azp and removed stale_ci labels Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects_2.4 collection feature has_issue module needs_rebase needs_revision new_contributor P3 pre_azp support:core system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants