Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: lockdep fixes #17738

Merged
merged 5 commits into from Sep 20, 2017
Merged

common: lockdep fixes #17738

merged 5 commits into from Sep 20, 2017

Conversation

jtlayton
Copy link
Contributor

This is mostly intended as a fix for this tracker bug:

http://tracker.ceph.com/issues/20988

...though there are a couple of other small fixes in here too, and a new testcase.

The fix for 20988 is basically to ensure that when we find that we've raced with the lockdep cct being deregistered that we handle it sanely. There's also a small optimization in here, where we were taking a mutex twice in succession instead of doing all of the work under the same lock, a compile-time fix for a warning, and a cleanup of a testcase that was manipulating g_lockdep unsafely.

This should be backportable as well.

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@liewegas liewegas changed the title lockdep fixes common:lockdep fixes Sep 14, 2017
@liewegas liewegas changed the title common:lockdep fixes common: lockdep fixes Sep 14, 2017
Copy link
Contributor

@adamemerson adamemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Larry Grasped Text Magnificently

@jtlayton
Copy link
Contributor Author

retest this please

@joscollin
Copy link
Member

Jenkins retest this please

@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 15, 2017

Rebased onto current master and re-pushed. I suspect the make check failure is not related to these changes.

@joscollin
Copy link
Member

retest this please

@jtlayton
Copy link
Contributor Author

If anyone has insight into the test failure here, I'd appreciate it. It looks like TestObjectRecorder test is failing with a lock inversion problem detected by lockdep. I wouldn't have expected this patchset to change any behaviour under normal circumstances, so I'm wondering if this failure is coincidence?

@tchaikov
Copy link
Contributor

@jtlayton
Copy link
Contributor Author

I tried to run this test vs. a vstart rig and it fails at an earlier point in the test, so I don't have a good way to reproduce it here. Undoubtedly I'm doing something wrong but since I'm not at all familiar with the actual code here, I can't be sure.

The lockdep output is not particularly helpful, as the stack traces all look identical (unless I'm missing something). I'll keep poking at this, but I may need to do a bisect via jenkins to figure out whether this is something I broke.

The build says:

src/client/Client.cc: In member function ‘void Client::trim_caps(MetaSession*, int)’:
src/client/Client.cc:4121:22: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
   if (s->caps.size() > max)
       ~~~~~~~~~~~~~~~^~~~~

Signed-off-by: Jeff Layton <jlayton@redhat.com>
We can do it under the same mutex, which should be more efficient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
If the cct is unregistered while other threads are flogging mutexes,
then we can hit all sorts of bugs. Ensure that we handle that
situation sanely, by checking that g_lockdep is still set after
we take the lockdep_mutex.

Also, remove an assertion from lockdep_unregister, and just turn it into
an immediate return. It's possible to have a call to
lockdep_unregister_ceph_context, and then a call to
lockdep_register_ceph_context while a mutex is being held by another
task.

In that case, it's possible the lock does not exist in the map
when we go to unregister it. That's not a bug though, just a natural
consequence of that series of actions.

Tracker: http://tracker.ceph.com/issues/20988
Signed-off-by: Jeff Layton <jlayton@redhat.com>
...and make g_lockdep a bool.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Spawn threads that bring up a bunch of ceph_mounts with individual
CephContext objects, and then tear them down in parallel.

Tracker: http://tracker.ceph.com/issues/20988
Signed-off-by: Jeff Layton <jlayton@redhat.com>
@jtlayton
Copy link
Contributor Author

jtlayton commented Sep 16, 2017

Ahh, I think I see the bug. It was a wrong condition in the patch to close up the races. I have a buggy series building now and am not sure how to tell Jenkins to cancel it. I assume it'll pick up the correct series after those make check runs fail.

@jtlayton
Copy link
Contributor Author

retest this please

@jtlayton
Copy link
Contributor Author

Ok, latest set looks good!

@@ -24,14 +24,23 @@ namespace ceph
}
}

void do_init() {
static CephContext* cct = nullptr;
static CephContext* cct;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs initialization to null, as far as I can see

Copy link
Contributor Author

@jtlayton jtlayton Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Global variables are always initialized to 0, according to C11 standard. I'm fairly sure that's the case in c++ too. Handy stackoverflow article here:

https://stackoverflow.com/questions/16015656/are-global-variables-always-initialized-to-zero-in-c

ISTR being told years ago that explicitly initializing them like this ends up being extra instructions when the program is started up. Maybe compliers are smart enough to optimize that away now, but there really is no reason to initialize it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have learned something new!

@jcsp
Copy link
Contributor

jcsp commented Sep 18, 2017

All looks reasonable to me minus the inline comment in test_mutex.cc

@tchaikov tchaikov merged commit 497c845 into ceph:master Sep 20, 2017
@tchaikov
Copy link
Contributor

@jtlayton iiuc, the issue fixed by this PR also impacts luminous. what do you think?

@jtlayton jtlayton deleted the wip-jlayton-20988 branch September 20, 2017 11:03
@jtlayton
Copy link
Contributor Author

Yes. I think we want to backport this to all shipping releases (maybe even to jewel if it's still limping along). It should be fairly simple to do as it's mostly lockdep internal changes.

@tchaikov
Copy link
Contributor

thanks @jtlayton!

@jtlayton jtlayton requested a review from a user October 5, 2017 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants