New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph-volume: cache the return value of util.disk.get_devices #37319
Conversation
|
jenkins test ceph-volume tox |
|
jenkins test ceph-volume all |
Calls to get_devices result in udev messages being created because of the opening and closing of devices in util.disk.is_locked_raw_device. In systems with large numbers of devices race conditions can occur with udev that cause unintended consequenses. This effect is magnified when many devices are passed to a command likes `lvm batch` because `get_devices` will be called for every device given, which then runs `is_locked_raw_device` against every device on the system. These issues have been observed in both https://bugzilla.redhat.com/show_bug.cgi?id=1822134 and https://bugzilla.redhat.com/show_bug.cgi?id=1878500. Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1822134 Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1878500 Closes: https://tracker.ceph.com/issues/47502 Closes: https://tracker.ceph.com/issues/47585 Signed-off-by: Andrew Schoen <aschoen@redhat.com>
|
jenkins test ceph-volume all |
|
jenkins test ceph-volume tox |
|
Because it's a race condition I couldn't guarantee 100%, but lately, all tests I've made including this patch couldn't reproduce the issue whereas I could without this patch. |
|
Let's use https://docs.python.org/3/library/functools.html#functools.lru_cache where we can. I'm aware that we need to add our own implementation (exactly like memoize) to nautilus, but I think this is preferable over having nautilus' py2 dependency influence what we implement on master. |
|
Also I fear that this and #37274 will not play well together. |
I definitely need a fix for this in nautilus and luminous that works with py2. Are you suggesting that ceph-volume@master should quit supporting py2 and that we should target this PR for nautilus instead? If that's the case I guess it's no longer necessary to backport all of the ceph-volume code from master to previous versions. |
|
I've also been wondering about the value of d5de958, do we really even need this check? Or is there a different way to see if a device is locked? In testing with a customer they were able to |
|
Can we use https://pypi.org/project/backports.functools-lru-cache/ for py2 compatibility? I've used this library in other projects. |
There's also this https://pypi.org/project/functools32/, but it looks like backports.functools-lru-cache has a newer version. |
|
I think that "ceph-volume in master should drop py2" is a pretty big change, and I think we should take this PR as-is, since Guillaume's already spent time testing it. If we want master to drop py2 support, that makes it a lot harder for us to backport ceph-volume changes to nautilus upstream. |
|
We could add a comment to the new "memoize" decorator method, like "TODO: Replace this with functools.lru_cache once we drop Python < 3.3 support" |
I'm aware but ceph-volume in master and octopus has dropped python2 support a while ago. I don't see the problem of using python3 code here and replacing this with our own implementation in the nautilus backport. |
|
@jan--f should we retarget this PR to nautilus then and make a new PR to master using |
|
I would prefer to fix it on master using Also there is a push to merge parallel device retrieval in #37013 or preferably #37274 which I think in conjunction with this PR will negatively affect the performance improvements. So either this or the parallel retrieval should take precedence. Probably this PR here. |
in that case, if we are going to use |
I've been working on a branch that does just that https://github.com/ceph/ceph/compare/wip-lru-cache |
|
@andrewschoen @jan--f I think we should close this one in favor of #37398 |
|
closing this in favor of #37398 |
Calls to get_devices result in udev messages being
created because of the opening and closing of devices
in util.disk.is_locked_raw_device. In systems with large
numbers of devices race conditions can occur with udev
that cause unintended consequenses. This effect is magnified
when many devices are passed to a command likes
lvm batchbecauseget_deviceswill be called for every device given, which then runsis_locked_raw_deviceagainst every device on the system. These issues havebeen observed in both https://bugzilla.redhat.com/show_bug.cgi?id=1822134
and https://bugzilla.redhat.com/show_bug.cgi?id=1878500.
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1878500
Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1822134
Signed-off-by: Andrew Schoen aschoen@redhat.com