Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not udevadm trigger, ceph-disk does it #365

Merged
merged 1 commit into from Oct 1, 2015
Merged

Conversation

ghost
Copy link

@ghost ghost commented Sep 30, 2015

There is no need for osd create to call udevadm trigger action=add. By
running all actions associated with add for all block devices, it may
also trigger some race conditions. For instance, the following happens
about 10% of the time, when using dmcrypt:

  • ceph-disk prepare --dmcrypt /dev/vdb
  • /dev/vdb1 starts activating via udev
  • ceph-deploy triggers action=add
  • /dev/vdb2 udev chown root /dev/vdb2
  • /dev/vdb1 activate fails because of permission denied on /dev/vdb2
  • /dev/vdb2 udev action chown ceph /dev/vdb2

http://tracker.ceph.com/issues/13299 Fixes: #13299

Signed-off-by: Loic Dachary loic@dachary.org

There is no need for osd create to call udevadm trigger action=add. By
running all actions associated with add for all block devices, it may
also trigger some race conditions. For instance, the following happens
about 10% of the time, when using dmcrypt:

 * ceph-disk prepare --dmcrypt /dev/vdb
 * /dev/vdb1 starts activating via udev
 * ceph-deploy triggers action=add
 * /dev/vdb2 udev chown root /dev/vdb2
 * /dev/vdb1 activate fails because of permission denied on /dev/vdb2
 * /dev/vdb2 udev action chown ceph /dev/vdb2

http://tracker.ceph.com/issues/13299 Fixes: #13299

Signed-off-by: Loic Dachary <loic@dachary.org>
@ceph-jenkins
Copy link
Collaborator

Refer to this link for build results (access rights to CI server needed):
http://jenkins.ceph.com//job/ceph-deploy-pull-requests/1798/
Test PASSed.

@ghost ghost assigned liewegas Sep 30, 2015
@liewegas
Copy link
Member

Nice catch! Can you retarget this at infernalis branch?

Which udev rule is doing the root chown? Is that anything that is in our control? It seems like we avoid the situation most of the time now but the possibliity is always there that we'll get unlucky...

@liewegas
Copy link
Member

in any case, Reviewed-by:

@ghost
Copy link
Author

ghost commented Sep 30, 2015

Which udev rule is doing the root chown? Is that anything that is in our control? It seems like we avoid the situation most of the time now but the possibliity is always there that we'll get unlucky...

It is my understanding that when a udev rule starts, it chown root (or recreate the file maybe, with default owner ?) and then the chown happens shortly afterwards but that leaves a window of opportunity.

@liewegas
Copy link
Member

liewegas commented Oct 1, 2015

Perhaps the "fix" is to not blindly trigger events on all devices, but only on the specific device that changes?

liewegas added a commit that referenced this pull request Oct 1, 2015
do not udevadm trigger, ceph-disk does it

Reviewed-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit fce4154 into master Oct 1, 2015
@liewegas liewegas deleted the wip-13000-activate branch October 1, 2015 00:21
@codenrhoden
Copy link
Contributor

When looking at some of the ceph-disk changes recently, I saw that it seemed to do all the necessary udevadm calls needed, and that made me think of places I had seen it in ceph-deploy and wonder if those were still necessary.

Glad to see this cleanup! Thanks a bunch.

@ghost
Copy link
Author

ghost commented Oct 1, 2015

Perhaps the "fix" is to not blindly trigger events on all devices, but only on the specific device that changes?

I think that's what ceph-disk does now.

@ghost
Copy link
Author

ghost commented Oct 1, 2015

When looking at some of the ceph-disk changes recently, I saw that it seemed to do all the necessary udevadm calls needed

It always did the necessary udevadm / partprobe. The bugs were when it did too many and it created race conditions. For instance calling partprobe actually removes devices before adding them again. I let you imagine what that can do to an ongoing osd activation :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants