New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: lockdep fixes #17738
common: lockdep fixes #17738
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Larry Grasped Text Magnificently
retest this please |
Jenkins retest this please |
a48d189
to
f97f163
Compare
Rebased onto current master and re-pushed. I suspect the make check failure is not related to these changes. |
retest this please |
If anyone has insight into the test failure here, I'd appreciate it. It looks like TestObjectRecorder test is failing with a lock inversion problem detected by lockdep. I wouldn't have expected this patchset to change any behaviour under normal circumstances, so I'm wondering if this failure is coincidence? |
https://jenkins.ceph.com/job/ceph-pull-requests/33042/ retest this please |
I tried to run this test vs. a vstart rig and it fails at an earlier point in the test, so I don't have a good way to reproduce it here. Undoubtedly I'm doing something wrong but since I'm not at all familiar with the actual code here, I can't be sure. The lockdep output is not particularly helpful, as the stack traces all look identical (unless I'm missing something). I'll keep poking at this, but I may need to do a bisect via jenkins to figure out whether this is something I broke. |
The build says: src/client/Client.cc: In member function ‘void Client::trim_caps(MetaSession*, int)’: src/client/Client.cc:4121:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (s->caps.size() > max) ~~~~~~~~~~~~~~~^~~~~ Signed-off-by: Jeff Layton <jlayton@redhat.com>
We can do it under the same mutex, which should be more efficient. Signed-off-by: Jeff Layton <jlayton@redhat.com>
If the cct is unregistered while other threads are flogging mutexes, then we can hit all sorts of bugs. Ensure that we handle that situation sanely, by checking that g_lockdep is still set after we take the lockdep_mutex. Also, remove an assertion from lockdep_unregister, and just turn it into an immediate return. It's possible to have a call to lockdep_unregister_ceph_context, and then a call to lockdep_register_ceph_context while a mutex is being held by another task. In that case, it's possible the lock does not exist in the map when we go to unregister it. That's not a bug though, just a natural consequence of that series of actions. Tracker: http://tracker.ceph.com/issues/20988 Signed-off-by: Jeff Layton <jlayton@redhat.com>
f97f163
to
01863bb
Compare
...and make g_lockdep a bool. Signed-off-by: Jeff Layton <jlayton@redhat.com>
Spawn threads that bring up a bunch of ceph_mounts with individual CephContext objects, and then tear them down in parallel. Tracker: http://tracker.ceph.com/issues/20988 Signed-off-by: Jeff Layton <jlayton@redhat.com>
Ahh, I think I see the bug. It was a wrong condition in the patch to close up the races. I have a buggy series building now and am not sure how to tell Jenkins to cancel it. I assume it'll pick up the correct series after those make check runs fail. |
retest this please |
866778e
to
8252f31
Compare
Ok, latest set looks good! |
@@ -24,14 +24,23 @@ namespace ceph | |||
} | |||
} | |||
|
|||
void do_init() { | |||
static CephContext* cct = nullptr; | |||
static CephContext* cct; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still needs initialization to null, as far as I can see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global variables are always initialized to 0, according to C11 standard. I'm fairly sure that's the case in c++ too. Handy stackoverflow article here:
https://stackoverflow.com/questions/16015656/are-global-variables-always-initialized-to-zero-in-c
ISTR being told years ago that explicitly initializing them like this ends up being extra instructions when the program is started up. Maybe compliers are smart enough to optimize that away now, but there really is no reason to initialize it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I have learned something new!
All looks reasonable to me minus the inline comment in test_mutex.cc |
@jtlayton iiuc, the issue fixed by this PR also impacts luminous. what do you think? |
Yes. I think we want to backport this to all shipping releases (maybe even to jewel if it's still limping along). It should be fairly simple to do as it's mostly lockdep internal changes. |
thanks @jtlayton! |
This is mostly intended as a fix for this tracker bug:
http://tracker.ceph.com/issues/20988
...though there are a couple of other small fixes in here too, and a new testcase.
The fix for 20988 is basically to ensure that when we find that we've raced with the lockdep cct being deregistered that we handle it sanely. There's also a small optimization in here, where we were taking a mutex twice in succession instead of doing all of the work under the same lock, a compile-time fix for a warning, and a cleanup of a testcase that was manipulating g_lockdep unsafely.
This should be backportable as well.