Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: don't leak pgrefs or reservations in SnapTrimmer #15214

Merged
merged 7 commits into from May 26, 2017

Conversation

gregsfortytwo
Copy link
Member

No description provided.

@gregsfortytwo
Copy link
Member Author

Just compile-tested so far, will do some vstart testing and if that works out this should get rolled into somebody's integration run asap. :)

@gregsfortytwo
Copy link
Member Author

Actually, hold up on this. I can't seem to reproduce the crash the way I expected to (give the OSD a lot to trim, shut it down while there are PGs pending ).

void finish(int) override {
pg->lock();
PrimaryLogPG *pgp = pg.get();
if (service->is_stopping() || !pgp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder why pgp could be nullptr? as pgp is a smart ptr and is holding a strong reference to the underlying PrimaryLogPG pointer, so even after pg_map.clear(), the pg is still valid, i think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I added code that zaps the pg reference...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregsfortytwo could you please point me to the place where you zap the pg ref?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg.reset() immediately below in cancel()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be confused by the way "cancel"ing works on something we give the AsyncReserver. It does not guarantee the Context won't be finish()ed. :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, that explains it! i need to look at this part closer before sending out more noise. sorry and thanks !

}
void cancel() {
assert(pg->is_locked());
assert(!canceled);
canceled = true;
pg->osd->snap_reserver.cancel_reservation(pg->get_pgid());
pg.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pg will call pg->put("intptr") in its dtor, so it's not necessary to reset it explicitly, am i right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this code is clearing out the pg reference so that it goes away before the Context is deleted. That's because we can't do a reliable deletion when we cancel a Context that's been given to an AsyncReserver as happens with this one .

That said, the race is pretty small and I don't think it's actually responsible for the bug report(s) I've been given so I'm trying to figure out other ref counting issues we may have.

if (!canceled)
pg->snap_trimmer_machine.process_event(SnapTrimReserved());
pg->unlock();
pgp->snap_trimmer_machine.process_event(SnapTrimReserved());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could early return if (canceled).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to unlock and there aren't any Locker semantics around the extra PG lock interfaces that I'm aware of (not that it would be any shorter in this case).

Copy link
Contributor

@tchaikov tchaikov May 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am thinking about something like this:

  if (canceled)
    return;
  pgp->lock();
  pgp->snap_trimmer_machine.process_event(SnapTrimReserved());
  pgp->unlock();
}

not shorter, but it's clearer this way that cancel is not protected by pg lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is protected by pg lock.

@liewegas liewegas changed the title Don't leak pgrefs or reservations in SnapTrimmer osd: don't leak pgrefs or reservations in SnapTrimmer May 23, 2017
@gregsfortytwo gregsfortytwo changed the title osd: don't leak pgrefs or reservations in SnapTrimmer DNM: osd: don't leak pgrefs or reservations in SnapTrimmer May 24, 2017
@gregsfortytwo
Copy link
Member Author

Anybody have any idea why this build is failing? It's happening on my rex box but I cannot for the life of me figure out why the compiler is using a const-pointer-returning operator->() instead of the regular pointer...

/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc: In member function ‘virtual void C_CompleteSplits::finish(int)’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc:9132:22: error: invalid conversion from ‘const PG*’ to ‘PG*’ [-fpermissive]
PG *pg = i->get();

@tchaikov
Copy link
Contributor

tchaikov commented May 24, 2017

i am able to compile the OSD.cc after rebasing against master in my ubuntu 16.04 docker env. gcc version is 4:5.3.1-1ubuntu1.

ahh, i didn't enable PG_DEBUG_REFS.

@tchaikov
Copy link
Contributor

retest this please. (i don't understand this either.)

@tchaikov
Copy link
Contributor

i am able to reproduce this.

@tchaikov
Copy link
Contributor

std::set
...
Constant BidirectionalIterator (since C++11)

i think this is why we have a const PG* here

@gregsfortytwo
Copy link
Member Author

I thought it might be, but the typing is explicit (not auto) and it's set::iterator, not set::const_iterator!

And actually, there is another failure at PrimaryPGLog.cc:13818 where it is deref'ing a PGRef to get hte snap_trimmer_machine and it also expect that to be const, it isn't, and it's failing. So the issue is definitely something to do with tracked_int_ptr and const-resolving.

@gregsfortytwo
Copy link
Member Author

gregsfortytwo commented May 24, 2017

and indeed PGRef pgr = *i; PG *pg = pgr.get(); works fine. But it still doesn't like PG *pg = i->get(). So it's definitely a type resolution issue...

@gregsfortytwo
Copy link
Member Author

Okay, Yehuda says that iterators on sets always give you const elements back because you can't change them (it breaks sorting). That makes sense.
The problem here is that tracked_int_ptr is returning const PG* from its member-const functions, and that's not how we need the pointers to work. Easy enough to fix but I initially thought that was not an acceptable way to do it.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Fixes: http://tracker.ceph.com/issues/19931

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

This is ready to be looked at again.

assert(!canceled);
canceled = true;
pg.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this doesn't reliably make the ref go away. The asyncreserver could grab pgp, drop the local lock, then on_shutdown runs under pg lock and cancels. What ensures that asyncreserver is not still blocked on pg->lock (or scheduled away) when we assert there are no refs? Do we flush the asyncserver thread or something? (And if we do flush the asyncreserver, can't we make sure the Context goes away that way and avoid this change entirely?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSD::shutdown() loops through and locks the PGs twice. The first time it calls PrimaryPG::on_shutdown(), which invokes this cancel.
The second time, after doing some other work, is when we need the remaining PGRefs to be gone. So we're okay if the finish() function is actually racing live, since it will get the PG lock and drop the ref before we need it to be gone.

@liewegas
Copy link
Member

liewegas commented May 25, 2017 via email

@jdurgin
Copy link
Member

jdurgin commented May 25, 2017

why doesn't QueuePeeringEvt's PGRef cause issues? since asyncreserver doesn't flush callbacks when you cancel them, can't it still be live after the pg is shutdown?

@gregsfortytwo
Copy link
Member Author

@liewegas, AsyncReserver cancellation doesn't actually guarantee the "finish" doesn't get triggered. There are internal races with taking stuff off of the queue and actually invoking them in the finisher thread. So we actually need this dance to prevent it racing to grab the PG lock while the PG is being destroyed anyway.

@gregsfortytwo
Copy link
Member Author

@jdurgin I don't warrant against the presence of all PGRef leaks, but in the QueuePeeringEvt case we do drain its threadpool and clear the peering_wq early enough that it cleans up correctly.

I'm not sure how peering_wq is related to AsyncReserver, but it deletes any callbacks it has in its local set of waiting things.

@jdurgin
Copy link
Member

jdurgin commented May 25, 2017

ah, the reserver_finisher is emptied and stopped in OSDService::shutdown(), prior to clearing the peering_wq, so that prevents any QueuePeeringEvt callbacks from being in flight at the end of OSD::shutdown(). That same finisher is used for the snap_reserver, so the PGRefs will be cleaned up too. The problem is that this all happens after checking for PGRefs in the middle of OSD::shutdown().

Is moving the reserver_finisher cleanup to OSDService::start_shutdown() (called before individual pg shutdown) and making your last commit just Reset() the snap_trim_machine sufficient to cleanup the extra references?

@gregsfortytwo
Copy link
Member Author

It might be, but I was nervous about changing the shutdown order that much. (Keep in mind we have to backport the fix to Jewel with high confidence; I'm a lot happier changing the code we just backported rather than something central to OSD function.) If we were going to do such a thing I'd like to take the time to move most all of OSDService::shutdown() up into start_shutdown. And I'm not sure we actually can adjust things that way (start_shutdown() precedes our draining the op wq); I briefly considered that but it seemed like documenting the pg ref counting interfaces and the ways every queue interacts on shutdown were going to be a much longer project than we're interested in doing pre-luminous.

Note that peering_wq is actually cleared much earlier than that; I'm not sure if the second clear you're looking at is redundant[1] or to deal with extra stuff that might come in via other operations being run. Certainly it's drained after the PGs have been notified about shutdown so I presume they won't be giving away any new refs to themselves.
[1]: actually, looking at the git history, I see Sam introduced both peering_wq.clear() calls without comment in the same commit he added the pg lock ref counting. I kinda think he was trying out a few locations and just missed.

@jdurgin
Copy link
Member

jdurgin commented May 26, 2017

For the short term issue for backport here, let's just disable that ceph_abort() during shutdown. It's not necessary outside of testing and it does seem to be a rabbit hole to fix all the causes.
It can hide behind a osd_debug_pg_refs config option enabled by teuthology/vstart.

@liewegas
Copy link
Member

liewegas commented May 26, 2017 via email

@gregsfortytwo
Copy link
Member Author

  1. is what the Reset() signal does.
  2. I guess maybe? More change to "external" code I didn't want to get into right now though.
  3. I don't know what you mean other than what we already do.

@gregsfortytwo
Copy link
Member Author

Is there actually a problem with the changes I've made? Feels like I'm being bikeshedded as we all rediscover that shutdown is annoying. If you'd like I can pull out all the patches that fixed up the ref counting so the PR is smaller; this is a pretty minimal change to fix a specific bug we introduced in the code being changed...

@liewegas
Copy link
Member

liewegas commented May 26, 2017 via email

@gregsfortytwo
Copy link
Member Author

Oh, I did lose a check for is_canceled prior to grabbing the pg lock. I can put that back.

Even without that though, I really don't think it's a practical race. You'd have to have threads A and B

  1. A holds lock X (pg lock)
  2. B requests lock X
  3. A drops lock X
  4. A acquires and drops many other locks
  5. A requests lock X and gets it before B does

for anything to break.

I'll look at flushing the work queues but it's a much more dangerous change to make on shutdown; we need to do everything in the right order to prevent putting pieces of work back into places they shouldn't be. And even that isn't enough. The Reset() message is needed to "shut down" the snap trimmer state machine and is probably sufficient on its own to prevent all the crashes users have seen. The cancel dance is just me being cautious because it's technically possible.

@liewegas
Copy link
Member

liewegas commented May 26, 2017

Yeah, I get it's the conservative thing to do, but I'm not sure it's worth doing a conservative fix for what amounts to leak debug code. On jewel can we disable the ceph_abort() like @jdurgin suggested? And on master, we can do the simple and more robust/correct thing and not bloat a trivial callback with a weird lock dance.

It seems like something like this would do it? I can't think of a reason why shutting down the finisher used only for the reserver earlier (right before we assert that pg refs are gone) would break something. And if we don't see this again after a while we can backport this too.

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 87177a3..e82f620 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -480,10 +480,14 @@ void OSDService::start_shutdown()
   }
 }
 
-void OSDService::shutdown()
+void OSDService::shutdown_reserver()
 {
   reserver_finisher.wait_for_empty();
   reserver_finisher.stop();
+}
+
+void OSDService::shutdown()
+{
   {
     Mutex::Locker l(watch_lock);
     watch_timer.shutdown();
@@ -3175,6 +3179,8 @@ int OSD::shutdown()
     assert(pg_stat_queue.empty());
   }
 
+  service.shutdown_reserver();
+
   // Remove PGs
 #ifdef PG_DEBUG_REFS
   service.dump_live_pgids();
diff --git a/src/osd/OSD.h b/src/osd/OSD.h
index cad51ed..5da0b8d 100644
--- a/src/osd/OSD.h
+++ b/src/osd/OSD.h
@@ -1100,6 +1100,7 @@ public:
   void init();
   void final_init();  
   void start_shutdown();
+  void shutdown_reserver();
   void shutdown();
 
 private:

We were failing to exit various wait states which held PGRefs. Error!

Fixes: http://tracker.ceph.com/issues/19931

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

Yeah okay. Pushing fresh patches shortly.

@gregsfortytwo gregsfortytwo force-pushed the wip-19931-snaptrim-pgref branch 2 times, most recently from 3300304 to fc68e70 Compare May 26, 2017 06:45
@jdurgin
Copy link
Member

jdurgin commented May 26, 2017

lgtm - no need to include 6ab960c and its revert though

This finisher thread has a lot of callbacks which can hold PGRefs. Make
sure we drain them out before checking that all the PGs have finished
and have no outstanding references.

Moving this should be safe; we've already stopped the op thread et al
and the only things still running are the OSDService's objecter_finisher,
recovery_request_timer, and snap_sleep_timer (which has definitely been emptied
by the time we get here as it's synchronously cleared out on PG shutdown).

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo
Copy link
Member Author

http://pulpito.ceph.com/gregf-2017-05-26_06:45:56-rados-wip-19931-snaptrim-pgs---basic-smithi/ shows 5 rados failures of 100 (none seem related).

http://pulpito.ceph.com/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/ shows rgw-multisite failing in ways that are not a part of this patch series, and @cbodley said it used to reproduce them reliably. Calling it good!

@gregsfortytwo gregsfortytwo changed the title DNM: osd: don't leak pgrefs or reservations in SnapTrimmer osd: don't leak pgrefs or reservations in SnapTrimmer May 26, 2017
@gregsfortytwo gregsfortytwo merged commit 032466d into ceph:master May 26, 2017
@gregsfortytwo gregsfortytwo deleted the wip-19931-snaptrim-pgref branch May 26, 2017 21:53
@liewegas
Copy link
Member

liewegas commented May 26, 2017 via email

@cbodley
Copy link
Contributor

cbodley commented May 27, 2017

this osd segfault was in both of those rgw:multisite runs:

src/common/shared_cache.hpp: In function 'SharedLRU<K, V, C, H>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less; H = std::hash]' thread 9618080 time 2017-05-26 07:03:31.013021
src/common/shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
ceph version 12.0.2-1359-g3300304 (3300304)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaef2f0]
2: (()+0x4fe97a) [0x60697a]
3: (OSDService::~OSDService()+0x17c) [0x59157c]
4: (OSD::~OSD()+0x133) [0x5df193]
5: (OSD::~OSD()+0x9) [0x5df7d9]
6: (main()+0x2fca) [0x4d8aba]
7: (__libc_start_main()+0xf5) [0xd7adb35]
8: (()+0x469c39) [0x571c39]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
2017-05-26 07:03:31.011502 9618080 -1 leaked refs:
dump_weak_refs 0xf69e910 weak_refs: 78 = 0x40fe8da0 with 98 refs
dump_weak_refs 0xf69e910 weak_refs: 84 = 0x32e8a9d0 with 98 refs

and this one was in http://qa-proxy.ceph.com/teuthology/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/1231015/teuthology.log:

src/mon/MonClient.cc: In function 'int MonClient::wait_auth_rotating(double)' thread 10008700 time 2017-05-26 07:03:04.381430
src/mon/MonClient.cc: 901: FAILED assert(auth != nullptr)
ceph version 12.0.2-1359-g3300304 (3300304)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaef2f0]
2: (MonClient::wait_auth_rotating(double)+0x63c) [0xb0015c]
3: (OSD::ms_get_authorizer(int, AuthAuthorizer**, bool)+0xbc) [0x57887c]
4: (AsyncConnection::handle_connect_reply(ceph_msg_connect&, ceph_msg_connect_reply&)+0x853) [0xcf7c73]
5: (AsyncConnection::_process_connection()+0x19d6) [0xcfd8e6]
6: (AsyncConnection::process()+0x708) [0xd01608]
7: (EventCenter::process_events(int)+0x814) [0xb7ea94]
8: (()+0xa790aa) [0xb810aa]
9: (()+0xb5230) [0xd020230]
10: (()+0x7dc5) [0xc737dc5]
11: (clone()+0x6d) [0xd88373d]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.

@gregsfortytwo
Copy link
Member Author

the first assert is the osd map cache, which doesn't seem like it could be related. The second is presumably http://tracker.ceph.com/issues/19900#change-91110

@cbodley
Copy link
Contributor

cbodley commented May 30, 2017

@gregsfortytwo thanks. looks like the second one is fixed in master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants