jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service #12147

ghost · 2016-11-23T06:53:51Z

http://tracker.ceph.com/issues/18007
http://tracker.ceph.com/issues/18009

ghost · 2016-11-23T07:31:52Z

jenkins test this please (changing base)

…m at boot time Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost · 2016-12-01T22:15:09Z

DNM because the backport is incomplete, it misses #12241

A ceph udev action may be triggered before the local file systems are mounted because there is no ordering in udev. The ceph udev action delegates asynchronously to systemd via ceph-disk@.service which will fail if (for instance) the LVM partition required to mount /var/lib/ceph is not available yet. The systemd unit will retry a few times but will eventually fail permanently. The sysadmin can systemctl reset-fail at a later time and it will succeed. Add a dependency to ceph-disk@.service so that it waits until the local file systems are mounted: After=local-fs.target Since local-fs.target depends on lvm, it will wait until the lvm partition (as well as any dm devices) is ready and mounted before attempting to activate the OSD. It may still fail because the corresponding journal/data partition is not ready yet (which is expected) but it will no longer fail because the lvm/filesystems/dm are not ready. Fixes: http://tracker.ceph.com/issues/17889 Signed-off-by: Loic Dachary <loic@dachary.org> (cherry picked from commit d954de5)

The udev rules that set the owner/group of the OSD devices are racing with 50-udev-default.rules and depending on which udev event fires last, ownership may not be as expected. Since ceph-disk trigger --sync runs as root, always happens after dm/lvm/filesystem units are complete and before activation, it is a good time to set the ownership of the device. It does not eliminate all races: a script running after systemd local-fs.target and firing a udev event may create a situation where the permissions of the device are temporarily reverted while the activation is running. Fixes: http://tracker.ceph.com/issues/17813 Signed-off-by: Loic Dachary <loic@dachary.org> (cherry picked from commit 72f0b2a)

Instead of the default 100ms pause before trying to restart an OSD, wait 20 seconds instead and retry 30 times instead of 3. There is no scenario in which restarting an OSD almost immediately after it failed would get a better result. It is possible that a failure to start is due to a race with another systemd unit at boot time. For instance if ceph-disk@.service is delayed, it may start after the OSD that needs it. A long pause may give the racing service enough time to complete and the next attempt to start the OSD may succeed. This is not a sound alternative to resolve a race, it only makes the OSD boot process less sensitive. In the example above, the proper fix is to enable --runtime ceph-osd@.service so that it cannot race at boot time. The wait delay should not be minutes to preserve the current runtime behavior. For instance, if an OSD is killed or fails and restarts after 10 minutes, it will be marked down by the ceph cluster. This is not a change that could break things but it is significant and should be avoided. Refs: http://tracker.ceph.com/issues/17889 Signed-off-by: Loic Dachary <loic@dachary.org> (cherry picked from commit b388737)

If ceph-osd@.service is enabled for a given device (say /dev/sdb1 for osd.3) the ceph-osd@3.service will race with ceph-disk@dev-sdb1.service at boot time. Enabling ceph-osd@3.service is not necessary at boot time because ceph-disk@dev-sdb1.service calls ceph-disk activate /dev/sdb1 which calls systemctl start ceph-osd@3 The systemctl enable/disable ceph-osd@.service called by ceph-disk activate is changed to add the --runtime option so that ceph-osd units are lost after a reboot. They are recreated when ceph-disk activate is called at boot time so that: systemctl stop ceph knows which ceph-osd@.service to stop when a script or sysadmin wants to stop all ceph services. Before enabling ceph-osd@.service (that happens at every boot time), make sure the permanent enablement in /etc/systemd is removed so that only the one added by systemctl enable --runtime in /run/systemd remains. This is useful to upgrade an existing cluster without creating a situation that is even worse than before because ceph-disk@.service races against two ceph-osd@.service (one in /etc/systemd and one in /run/systemd). Fixes: http://tracker.ceph.com/issues/17889 Signed-off-by: Loic Dachary <loic@dachary.org> (cherry picked from commit 539385b)

… with ceph-osd@.service Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost · 2016-12-21T23:41:40Z

check this please

wido · 2017-01-20T10:57:42Z

I would really like to see this one merged and backported in to Jewel 10.2.6. This issue is hitting multiple users I know of.

smithfarm · 2017-01-20T11:14:58Z

@dachary This PR was included in this [1] integration branch which already passed a ceph-disk suite at [2].

[1] http://tracker.ceph.com/issues/17851#note-17
[2] http://tracker.ceph.com/issues/17851#note-18

OK to merge?

ghost self-assigned this Nov 23, 2016

ghost added this to the jewel milestone Nov 23, 2016

ghost added bug-fix core labels Nov 23, 2016

ghost changed the base branch from jewel to jewel-next November 23, 2016 07:31

ghost pushed a commit that referenced this pull request Nov 23, 2016

Merge pull request #12147: jewel: OSD udev / systemd may race with lv…

720f5f1

…m at boot time Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost changed the title ~~jewel: OSD udev / systemd may race with lvm at boot time~~ jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service Dec 1, 2016

ghost changed the title ~~jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service~~ DNM: jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service Dec 1, 2016

ghost changed the title ~~DNM: jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service~~ jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service Dec 5, 2016

ldachary added 4 commits December 5, 2016 09:51

ghost pushed a commit that referenced this pull request Dec 5, 2016

Merge pull request #12147: jewel: ceph-disk: ceph-disk@.service races…

f463bed

… with ceph-osd@.service Reviewed-by: Loic Dachary <ldachary@redhat.com>

ghost changed the base branch from jewel-next to jewel December 21, 2016 23:28

ghost merged commit 174ed80 into ceph:jewel Jan 20, 2017

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service #12147

jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service #12147

ghost commented Nov 23, 2016 •

edited by ghost

ghost commented Nov 23, 2016

ghost commented Dec 1, 2016

ghost commented Dec 21, 2016

wido commented Jan 20, 2017

smithfarm commented Jan 20, 2017

jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service #12147

jewel: ceph-disk: ceph-disk@.service races with ceph-osd@.service #12147

Conversation

ghost commented Nov 23, 2016 • edited by ghost

ghost commented Nov 23, 2016

ghost commented Dec 1, 2016

ghost commented Dec 21, 2016

wido commented Jan 20, 2017

smithfarm commented Jan 20, 2017

ghost commented Nov 23, 2016 •

edited by ghost