Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-volume: cache the return value of util.disk.get_devices #37319

Closed
wants to merge 1 commit into from

Conversation

guits
Copy link
Contributor

@guits guits commented Sep 22, 2020

Calls to get_devices result in udev messages being
created because of the opening and closing of devices
in util.disk.is_locked_raw_device. In systems with large
numbers of devices race conditions can occur with udev
that cause unintended consequenses. This effect is magnified
when many devices are passed to a command likes lvm batch because
get_devices will be called for every device given, which then runs
is_locked_raw_device against every device on the system. These issues have
been observed in both https://bugzilla.redhat.com/show_bug.cgi?id=1822134
and https://bugzilla.redhat.com/show_bug.cgi?id=1878500.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1878500
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1822134

Signed-off-by: Andrew Schoen aschoen@redhat.com

@guits guits requested a review from a team as a code owner September 22, 2020 16:49
@guits
Copy link
Contributor Author

guits commented Sep 22, 2020

jenkins test ceph-volume tox

@guits
Copy link
Contributor Author

guits commented Sep 22, 2020

jenkins test ceph-volume all

Calls to get_devices result in udev messages being
created because of the opening and closing of devices
in util.disk.is_locked_raw_device. In systems with large
numbers of devices race conditions can occur with udev
that cause unintended consequenses. This effect is magnified
when many devices are passed to a command likes `lvm batch` because
`get_devices` will be called for every device given, which then runs
`is_locked_raw_device` against every device on the system. These issues have
been observed in both https://bugzilla.redhat.com/show_bug.cgi?id=1822134
and https://bugzilla.redhat.com/show_bug.cgi?id=1878500.

Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1822134
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1878500
Closes: https://tracker.ceph.com/issues/47502
Closes: https://tracker.ceph.com/issues/47585

Signed-off-by: Andrew Schoen <aschoen@redhat.com>
@guits
Copy link
Contributor Author

guits commented Sep 22, 2020

jenkins test ceph-volume all

@guits
Copy link
Contributor Author

guits commented Sep 22, 2020

jenkins test ceph-volume tox

@guits
Copy link
Contributor Author

guits commented Sep 22, 2020

Because it's a race condition I couldn't guarantee 100%, but lately, all tests I've made including this patch couldn't reproduce the issue whereas I could without this patch.

@jan--f
Copy link
Contributor

jan--f commented Sep 23, 2020

Let's use https://docs.python.org/3/library/functools.html#functools.lru_cache where we can.

I'm aware that we need to add our own implementation (exactly like memoize) to nautilus, but I think this is preferable over having nautilus' py2 dependency influence what we implement on master.

@jan--f
Copy link
Contributor

jan--f commented Sep 23, 2020

Also I fear that this and #37274 will not play well together.

@andrewschoen
Copy link
Contributor

Let's use https://docs.python.org/3/library/functools.html#functools.lru_cache where we can.

I'm aware that we need to add our own implementation (exactly like memoize) to nautilus, but I think this is preferable over having nautilus' py2 dependency influence what we implement on master.

I definitely need a fix for this in nautilus and luminous that works with py2. Are you suggesting that ceph-volume@master should quit supporting py2 and that we should target this PR for nautilus instead? If that's the case I guess it's no longer necessary to backport all of the ceph-volume code from master to previous versions.

@andrewschoen
Copy link
Contributor

I've also been wondering about the value of d5de958, do we really even need this check? Or is there a different way to see if a device is locked? In testing with a customer they were able to return 0 at the very beginning of util.disk.is_locked_raw_device and their deployment worked as expected.

@ktdreyer
Copy link
Member

Can we use https://pypi.org/project/backports.functools-lru-cache/ for py2 compatibility? I've used this library in other projects.

@andrewschoen
Copy link
Contributor

Can we use https://pypi.org/project/backports.functools-lru-cache/ for py2 compatibility? I've used this library in other projects.

There's also this https://pypi.org/project/functools32/, but it looks like backports.functools-lru-cache has a newer version.

@ktdreyer
Copy link
Member

I think that "ceph-volume in master should drop py2" is a pretty big change, and I think we should take this PR as-is, since Guillaume's already spent time testing it.

If we want master to drop py2 support, that makes it a lot harder for us to backport ceph-volume changes to nautilus upstream.

@ktdreyer ktdreyer self-requested a review September 23, 2020 17:22
@ktdreyer
Copy link
Member

We could add a comment to the new "memoize" decorator method, like "TODO: Replace this with functools.lru_cache once we drop Python < 3.3 support"

@jan--f
Copy link
Contributor

jan--f commented Sep 24, 2020

I think that "ceph-volume in master should drop py2" is a pretty big change, and I think we should take this PR as-is, since Guillaume's already spent time testing it.

If we want master to drop py2 support, that makes it a lot harder for us to backport ceph-volume changes to nautilus upstream.

I'm aware but ceph-volume in master and octopus has dropped python2 support a while ago. I don't see the problem of using python3 code here and replacing this with our own implementation in the nautilus backport.

@guits guits changed the base branch from master to nautilus September 24, 2020 15:59
@guits guits changed the base branch from nautilus to master September 24, 2020 15:59
@andrewschoen
Copy link
Contributor

@jan--f should we retarget this PR to nautilus then and make a new PR to master using lru_cache?

@jan--f
Copy link
Contributor

jan--f commented Sep 24, 2020

I would prefer to fix it on master using lru_cache and then go through he usual backport routine. This preserves the history better imho.

Also there is a push to merge parallel device retrieval in #37013 or preferably #37274 which I think in conjunction with this PR will negatively affect the performance improvements. So either this or the parallel retrieval should take precedence. Probably this PR here.

@guits
Copy link
Contributor Author

guits commented Sep 24, 2020

@jan--f should we retarget this PR to nautilus then and make a new PR to master using lru_cache?

in that case, if we are going to use lru_cache on master/octopus, why not using https://pypi.org/project/backports.functools-lru-cache/ for nautilus/luminous ?

@andrewschoen
Copy link
Contributor

@jan--f should we retarget this PR to nautilus then and make a new PR to master using lru_cache?

in that case, if we are going to use lru_cache on master/octopus, why not using https://pypi.org/project/backports.functools-lru-cache/ for nautilus/luminous ?

I've been working on a branch that does just that https://github.com/ceph/ceph/compare/wip-lru-cache

@guits
Copy link
Contributor Author

guits commented Sep 24, 2020

@andrewschoen @jan--f I think we should close this one in favor of #37398

@guits
Copy link
Contributor Author

guits commented Sep 29, 2020

closing this in favor of #37398

@guits guits closed this Sep 29, 2020
@guits guits deleted the wip-udev-fix branch September 29, 2020 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants