Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky multipathd service #803

Closed
jlebon opened this issue Apr 28, 2021 · 9 comments · Fixed by coreos/fedora-coreos-config#1233
Closed

flaky multipathd service #803

jlebon opened this issue Apr 28, 2021 · 9 comments · Fixed by coreos/fedora-coreos-config#1233

Comments

@jlebon
Copy link
Member

jlebon commented Apr 28, 2021

Sometimes, the generic kola check for failed systemd services trips on multipathd:

--- FAIL: coreos.tls.fetch-urls (120.75s)
        harness.go:976: Cluster failed starting machines: machine "5cba2624-8aa0-4fbb-9023-cc0287c3f1e7" failed basic checks: some systemd units failed:
● multipathd.service loaded failed failed Device-Mapper Multipath Device Controller

This one was hit during this week's next release:
coreos/fedora-coreos-streams#296

kola.zip

@dustymabe
Copy link
Member

Failed in the coreos.ignition.security.tls test in jenkins / kola-openstack / #297

Jul 31 23:59:52.678273 udevadm[471]: systemd-udev-settle.service is deprecated. Please fix multipathd-configure.service, multipathd.service not to pull it in.
Jul 31 23:59:52.769000 audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=multipathd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Jul 31 23:59:52.779152 multipathd[497]: --------start up--------
Jul 31 23:59:52.779152 multipathd[497]: read /etc/multipath.conf
Jul 31 23:59:52.782074 multipathd[497]: /etc/multipath.conf does not exist, blacklisting all devices.
Jul 31 23:59:52.782074 multipathd[497]: You can run "/sbin/mpathconf --enable" to create
Jul 31 23:59:52.782074 multipathd[497]: /etc/multipath.conf. See man mpathconf(8) for more details
Jul 31 23:59:52.782074 multipathd[497]: /etc/multipath.conf does not exist, blacklisting all devices.
Jul 31 23:59:52.782074 multipathd[497]: You can run "/sbin/mpathconf --enable" to create
Jul 31 23:59:52.782074 multipathd[497]: /etc/multipath.conf. See man mpathconf(8) for more details
Jul 31 23:59:52.794873 multipathd[497]: path checkers start up
Jul 31 23:59:52.794873 multipathd[497]: failed to increase buffer size
Aug  1 00:00:00.433545 multipathd[838]: 9.535643 | /etc/multipath.conf does not exist, blacklisting all devices.
Aug  1 00:00:00.433545 multipathd[838]: 9.535689 | You can run "/sbin/mpathconf --enable" to create
Aug  1 00:00:00.433545 multipathd[838]: 9.535697 | /etc/multipath.conf. See man mpathconf(8) for more details
Aug  1 00:00:00.441325 multipathd[838]: error -5 receiving packet
Aug  1 00:00:00.444174 multipathd[497]: --------shut down-------
Aug  1 00:00:00.437429 systemd[1]: multipathd.service: Control process exited, code=exited, status=1/FAILURE
Aug  1 00:00:00.475000 audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=multipathd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Aug  1 00:00:00.474215 systemd[1]: multipathd.service: Failed with result 'exit-code'.
Aug  1 00:00:01.594815 multipathd[838]: 9.535643 | /etc/multipath.conf does not exist, blacklisting all devices.
Aug  1 00:00:01.594825 multipathd[838]: 9.535689 | You can run "/sbin/mpathconf --enable" to create
Aug  1 00:00:01.594834 multipathd[838]: 9.535697 | /etc/multipath.conf. See man mpathconf(8) for more details
Aug  1 00:00:01.594843 multipathd[838]: error -5 receiving packet
Aug  1 00:00:01.594863 multipathd[497]: --------shut down-------
Aug  1 00:00:01.594881 systemd[1]: multipathd.service: Control process exited, code=exited, status=1/FAILURE
Aug  1 00:00:01.594938 systemd[1]: multipathd.service: Failed with result 'exit-code'.
Aug  1 00:00:01.597306 systemd[1]: Listening on multipathd control socket.
Aug  1 00:00:02.260813 udevadm[979]: systemd-udev-settle.service is deprecated. Please fix multipathd.service not to pull it in.

@dustymabe
Copy link
Member

In https://jenkins-fedora-coreos.apps.ocp.ci.centos.org/blue/organizations/jenkins/kola-openstack/detail/kola-openstack/317/pipeline

21:22:30      --- FAIL: ostree.hotfix/rollback (33.74s)
21:22:30              unlock.go:248: Failed to reboot machine: machine "26587a9f-63fd-4ec3-9f71-7c485ab78c55" failed basic checks: some systemd units failed:
21:22:30      multipathd.service

@dustymabe
Copy link
Member

in https://jenkins-fedora-coreos.apps.ocp.ci.centos.org/blue/organizations/jenkins/kola-aws/detail/kola-aws/578/pipeline

[2021-09-14T19:10:54.706Z] --- FAIL: rhcos.selinux.boolean.persist (425.31s)
[2021-09-14T19:10:54.706Z]         selinux.go:83: failed to reboot machine: machine "i-0e8d2ea4dd5dc2fbb" failed basic checks: some systemd units failed:
[2021-09-14T19:10:54.706Z] multipathd.service

Can someone dig into this?

@lucab
Copy link
Contributor

lucab commented Sep 17, 2021

I had a look at this today and I think this is a race between the ExecStop=/sbin/multipathd shutdown in the dracut unit and the daemon actually shutting down.
I managed to reproduce it outside of initrd by having two shells tight-looping through multipathd shutdown on one side and systemctl restart multipathd.service in parallel on the other side. It seems a quite rare race as I only observed it a couple of time while leaving these parallel-shutdown loops running for ~30 minutes.

The multipath codebase is a rough C one with concurrency baked on bare pthread synchronization, so I'll just postulate there is a "socket I/O race on exit" hiding somewhere in it.
The upstream real-root service unit does not have any ExecStop= setting, so I think a good fix here is just to re-align the unit in dracut by dropping the stop command.

@lucab
Copy link
Contributor

lucab commented Sep 17, 2021

PR at dracutdevs/dracut#1606.

lucab added a commit to lucab/dracut that referenced this issue Sep 23, 2021
This removes the 'ExecStop=' field from `multipathd.service`.
Sometimes CI runs do encounter a failure related to this
service in initrd, which seems to be stemming from a socket
I/O race between the client and the server on shutdown.
It looks like the client (`multipathd shutdown`) can lose the race,
hit an I/O error, and cause the whole unit to fail (even if the server
managed to shutdown properly already).

Notably, the upstream unit does not have such stop command
as the daemon can already perform a graceful exit through
its signal handler.

As such, this commit partially re-aligns the two units,
trying to sidestep any of the existing races.

Refs:
 * coreos/fedora-coreos-tracker#803
 * https://github.com/opensvc/multipath-tools/blob/0.8.7/multipathd/multipathd.service
johannbg pushed a commit to dracutdevs/dracut that referenced this issue Sep 23, 2021
This removes the 'ExecStop=' field from `multipathd.service`.
Sometimes CI runs do encounter a failure related to this
service in initrd, which seems to be stemming from a socket
I/O race between the client and the server on shutdown.
It looks like the client (`multipathd shutdown`) can lose the race,
hit an I/O error, and cause the whole unit to fail (even if the server
managed to shutdown properly already).

Notably, the upstream unit does not have such stop command
as the daemon can already perform a graceful exit through
its signal handler.

As such, this commit partially re-aligns the two units,
trying to sidestep any of the existing races.

Refs:
 * coreos/fedora-coreos-tracker#803
 * https://github.com/opensvc/multipath-tools/blob/0.8.7/multipathd/multipathd.service
@lucab
Copy link
Contributor

lucab commented Sep 24, 2021

The PR got merged upstream. As it's going to take some time to land in a new dracut packaged release, coreos/fedora-coreos-config#1233 carries an equivalent workaround through our config overrides.

@dustymabe
Copy link
Member

The fix for this went into testing stream release 34.20211004.2.0. Please try out the new release and report issues.

@dustymabe dustymabe added status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. and removed status/pending-testing-release Fixed upstream. Waiting on a testing release. labels Oct 18, 2021
@dustymabe
Copy link
Member

The fix for this went into stable stream release 34.20211004.3.1

@dustymabe dustymabe removed the status/pending-stable-release Fixed upstream and in testing. Waiting on stable release. label Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants