osd: don't leak pgrefs or reservations in SnapTrimmer #15214

gregsfortytwo · 2017-05-23T00:13:09Z

No description provided.

gregsfortytwo · 2017-05-23T00:13:35Z

Just compile-tested so far, will do some vstart testing and if that works out this should get rolled into somebody's integration run asap. :)

gregsfortytwo · 2017-05-23T01:26:47Z

Actually, hold up on this. I can't seem to reproduce the crash the way I expected to (give the OSD a lot to trim, shut it down while there are PGs pending ).

tchaikov · 2017-05-23T09:41:10Z

src/osd/PrimaryLogPG.h

      void finish(int) override {
-	pg->lock();
+	PrimaryLogPG *pgp = pg.get();
+	if (service->is_stopping() || !pgp)


i wonder why pgp could be nullptr? as pgp is a smart ptr and is holding a strong reference to the underlying PrimaryLogPG pointer, so even after pg_map.clear(), the pg is still valid, i think.

Because I added code that zaps the pg reference...

@gregsfortytwo could you please point me to the place where you zap the pg ref?

pg.reset() immediately below in cancel()?

You may be confused by the way "cancel"ing works on something we give the AsyncReserver. It does not guarantee the Context won't be finish()ed. :(

ahh, that explains it! i need to look at this part closer before sending out more noise. sorry and thanks !

tchaikov · 2017-05-23T09:44:48Z

src/osd/PrimaryLogPG.h

      }
      void cancel() {
 	assert(pg->is_locked());
 	assert(!canceled);
 	canceled = true;
+	pg->osd->snap_reserver.cancel_reservation(pg->get_pgid());
+	pg.reset();


pg will call pg->put("intptr") in its dtor, so it's not necessary to reset it explicitly, am i right?

The whole point of this code is clearing out the pg reference so that it goes away before the Context is deleted. That's because we can't do a reliable deletion when we cancel a Context that's been given to an AsyncReserver as happens with this one .

That said, the race is pretty small and I don't think it's actually responsible for the bug report(s) I've been given so I'm trying to figure out other ref counting issues we may have.

tchaikov · 2017-05-23T09:47:15Z

src/osd/PrimaryLogPG.h

 	if (!canceled)
-	  pg->snap_trimmer_machine.process_event(SnapTrimReserved());
-	pg->unlock();
+	  pgp->snap_trimmer_machine.process_event(SnapTrimReserved());


nit, could early return if (canceled).

We need to unlock and there aren't any Locker semantics around the extra PG lock interfaces that I'm aware of (not that it would be any shorter in this case).

i am thinking about something like this:

if (canceled) return; pgp->lock(); pgp->snap_trimmer_machine.process_event(SnapTrimReserved()); pgp->unlock(); }

not shorter, but it's clearer this way that cancel is not protected by pg lock.

It is protected by pg lock.

gregsfortytwo · 2017-05-24T08:40:11Z

Anybody have any idea why this build is failing? It's happening on my rex box but I cannot for the life of me figure out why the compiler is using a const-pointer-returning operator->() instead of the regular pointer...

/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc: In member function ‘virtual void C_CompleteSplits::finish(int)’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc:9132:22: error: invalid conversion from ‘const PG*’ to ‘PG*’ [-fpermissive]
PG *pg = i->get();

tchaikov · 2017-05-24T10:50:54Z

~~i am able to compile the OSD.cc after rebasing against master in my ubuntu 16.04 docker env. gcc version is 4:5.3.1-1ubuntu1.~~

ahh, i didn't enable PG_DEBUG_REFS.

tchaikov · 2017-05-24T11:22:35Z

retest this please. (i don't understand this either.)

tchaikov · 2017-05-24T11:49:38Z

i am able to reproduce this.

tchaikov · 2017-05-24T11:54:25Z

std::set
...
Constant BidirectionalIterator (since C++11)

i think this is why we have a const PG* here

gregsfortytwo · 2017-05-24T20:16:33Z

I thought it might be, but the typing is explicit (not auto) and it's set::iterator, not set::const_iterator!

And actually, there is another failure at PrimaryPGLog.cc:13818 where it is deref'ing a PGRef to get hte snap_trimmer_machine and it also expect that to be const, it isn't, and it's failing. So the issue is definitely something to do with tracked_int_ptr and const-resolving.

gregsfortytwo · 2017-05-24T20:34:35Z

and indeed PGRef pgr = *i; PG *pg = pgr.get(); works fine. But it still doesn't like PG *pg = i->get(). So it's definitely a type resolution issue...

gregsfortytwo · 2017-05-24T23:29:55Z

Okay, Yehuda says that iterators on sets always give you const elements back because you can't change them (it breaks sorting). That makes sense.
The problem here is that tracked_int_ptr is returning const PG* from its member-const functions, and that's not how we need the pointers to work. Easy enough to fix but I initially thought that was not an acceptable way to do it.

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2017-05-25T18:55:09Z

This is ready to be looked at again.

liewegas · 2017-05-25T19:04:27Z

src/osd/PrimaryLogPG.h

 	assert(!canceled);
 	canceled = true;
+	pg.reset();


It seems like this doesn't reliably make the ref go away. The asyncreserver could grab pgp, drop the local lock, then on_shutdown runs under pg lock and cancels. What ensures that asyncreserver is not still blocked on pg->lock (or scheduled away) when we assert there are no refs? Do we flush the asyncserver thread or something? (And if we do flush the asyncreserver, can't we make sure the Context goes away that way and avoid this change entirely?)

OSD::shutdown() loops through and locks the PGs twice. The first time it calls PrimaryPG::on_shutdown(), which invokes this cancel.
The second time, after doing some other work, is when we need the remaining PGRefs to be gone. So we're okay if the finish() function is actually racing live, since it will get the PG lock and drop the ref before we need it to be gone.

liewegas · 2017-05-25T21:05:28Z

As part of that "other stuff" can't it cancel all the AsyncReserver events? (And avoid this dance?)

jdurgin · 2017-05-25T22:21:57Z

why doesn't QueuePeeringEvt's PGRef cause issues? since asyncreserver doesn't flush callbacks when you cancel them, can't it still be live after the pg is shutdown?

gregsfortytwo · 2017-05-25T22:33:23Z

@liewegas, AsyncReserver cancellation doesn't actually guarantee the "finish" doesn't get triggered. There are internal races with taking stuff off of the queue and actually invoking them in the finisher thread. So we actually need this dance to prevent it racing to grab the PG lock while the PG is being destroyed anyway.

gregsfortytwo · 2017-05-25T22:44:44Z

@jdurgin I don't warrant against the presence of all PGRef leaks, but in the QueuePeeringEvt case we do drain its threadpool and clear the peering_wq early enough that it cleans up correctly.

I'm not sure how peering_wq is related to AsyncReserver, but it deletes any callbacks it has in its local set of waiting things.

jdurgin · 2017-05-25T23:25:56Z

ah, the reserver_finisher is emptied and stopped in OSDService::shutdown(), prior to clearing the peering_wq, so that prevents any QueuePeeringEvt callbacks from being in flight at the end of OSD::shutdown(). That same finisher is used for the snap_reserver, so the PGRefs will be cleaned up too. The problem is that this all happens after checking for PGRefs in the middle of OSD::shutdown().

Is moving the reserver_finisher cleanup to OSDService::start_shutdown() (called before individual pg shutdown) and making your last commit just Reset() the snap_trim_machine sufficient to cleanup the extra references?

gregsfortytwo · 2017-05-25T23:53:13Z

It might be, but I was nervous about changing the shutdown order that much. (Keep in mind we have to backport the fix to Jewel with high confidence; I'm a lot happier changing the code we just backported rather than something central to OSD function.) If we were going to do such a thing I'd like to take the time to move most all of OSDService::shutdown() up into start_shutdown. And I'm not sure we actually can adjust things that way (start_shutdown() precedes our draining the op wq); I briefly considered that but it seemed like documenting the pg ref counting interfaces and the ways every queue interacts on shutdown were going to be a much longer project than we're interested in doing pre-luminous.

Note that peering_wq is actually cleared much earlier than that; I'm not sure if the second clear you're looking at is redundant[1] or to deal with extra stuff that might come in via other operations being run. Certainly it's drained after the PGs have been notified about shutdown so I presume they won't be giving away any new refs to themselves.
[1]: actually, looking at the git history, I see Sam introduced both peering_wq.clear() calls without comment in the same commit he added the pg lock ref counting. I kinda think he was trying out a few locations and just missed.

jdurgin · 2017-05-26T00:23:33Z

For the short term issue for backport here, let's just disable that ceph_abort() during shutdown. It's not necessary outside of testing and it does seem to be a rabbit hole to fix all the causes.
It can hide behind a osd_debug_pg_refs config option enabled by teuthology/vstart.

liewegas · 2017-05-26T02:25:40Z

right, but it has a ref, so it'll be able to get the lock eventually. why not 1- cancel (or try) in on_shutdown 2- flush the asyncreserver finisher thread (so any uncancelled-but-queued contexts do their thing) 3- verify pgrefs are gone

gregsfortytwo · 2017-05-26T02:36:35Z

is what the Reset() signal does.
I guess maybe? More change to "external" code I didn't want to get into right now though.
I don't know what you mean other than what we already do.

gregsfortytwo · 2017-05-26T02:39:19Z

Is there actually a problem with the changes I've made? Feels like I'm being bikeshedded as we all rediscover that shutdown is annoying. If you'd like I can pull out all the patches that fixed up the ref counting so the PR is smaller; this is a pretty minimal change to fix a specific bug we introduced in the code being changed...

liewegas · 2017-05-26T02:49:40Z

If I'm reading your change right this narrows the race but doesn't remove it because the other thread might be waiting on pg->lock and we can't guarantee it will get flushed before we get around to the ref check. And it seems so much simpler to just flush the wq with the context+ref before verifying refs are gone... I like the idea of hiding the abort behind the config option; this is purely a code hygiene thing and doesn't affect end users at all (as long as we don't assert :).

gregsfortytwo · 2017-05-26T03:11:48Z

Oh, I did lose a check for is_canceled prior to grabbing the pg lock. I can put that back.

Even without that though, I really don't think it's a practical race. You'd have to have threads A and B

A holds lock X (pg lock)
B requests lock X
A drops lock X
A acquires and drops many other locks
A requests lock X and gets it before B does

for anything to break.

I'll look at flushing the work queues but it's a much more dangerous change to make on shutdown; we need to do everything in the right order to prevent putting pieces of work back into places they shouldn't be. And even that isn't enough. The Reset() message is needed to "shut down" the snap trimmer state machine and is probably sufficient on its own to prevent all the crashes users have seen. The cancel dance is just me being cautious because it's technically possible.

liewegas · 2017-05-26T03:22:27Z

Yeah, I get it's the conservative thing to do, but I'm not sure it's worth doing a conservative fix for what amounts to leak debug code. On jewel can we disable the ceph_abort() like @jdurgin suggested? And on master, we can do the simple and more robust/correct thing and not bloat a trivial callback with a weird lock dance.

It seems like something like this would do it? I can't think of a reason why shutting down the finisher used only for the reserver earlier (right before we assert that pg refs are gone) would break something. And if we don't see this again after a while we can backport this too.

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 87177a3..e82f620 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -480,10 +480,14 @@ void OSDService::start_shutdown()
   }
 }
 
-void OSDService::shutdown()
+void OSDService::shutdown_reserver()
 {
   reserver_finisher.wait_for_empty();
   reserver_finisher.stop();
+}
+
+void OSDService::shutdown()
+{
   {
     Mutex::Locker l(watch_lock);
     watch_timer.shutdown();
@@ -3175,6 +3179,8 @@ int OSD::shutdown()
     assert(pg_stat_queue.empty());
   }
 
+  service.shutdown_reserver();
+
   // Remove PGs
 #ifdef PG_DEBUG_REFS
   service.dump_live_pgids();
diff --git a/src/osd/OSD.h b/src/osd/OSD.h
index cad51ed..5da0b8d 100644
--- a/src/osd/OSD.h
+++ b/src/osd/OSD.h
@@ -1100,6 +1100,7 @@ public:
   void init();
   void final_init();  
   void start_shutdown();
+  void shutdown_reserver();
   void shutdown();
 
 private:

We were failing to exit various wait states which held PGRefs. Error! Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2017-05-26T05:39:17Z

Yeah okay. Pushing fresh patches shortly.

jdurgin · 2017-05-26T15:21:17Z

lgtm - no need to include 6ab960c and its revert though

This finisher thread has a lot of callbacks which can hold PGRefs. Make sure we drain them out before checking that all the PGs have finished and have no outstanding references. Moving this should be safe; we've already stopped the op thread et al and the only things still running are the OSDService's objecter_finisher, recovery_request_timer, and snap_sleep_timer (which has definitely been emptied by the time we get here as it's synchronously cleared out on PG shutdown). Signed-off-by: Greg Farnum <gfarnum@redhat.com>

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo · 2017-05-26T21:48:05Z

http://pulpito.ceph.com/gregf-2017-05-26_06:45:56-rados-wip-19931-snaptrim-pgs---basic-smithi/ shows 5 rados failures of 100 (none seem related).

http://pulpito.ceph.com/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/ shows rgw-multisite failing in ways that are not a part of this patch series, and @cbodley said it used to reproduce them reliably. Calling it good!

liewegas · 2017-05-26T21:56:53Z

\o/

cbodley · 2017-05-27T00:32:34Z

this osd segfault was in both of those rgw:multisite runs:

src/common/shared_cache.hpp: In function 'SharedLRU<K, V, C, H>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less; H = std::hash]' thread 9618080 time 2017-05-26 07:03:31.013021
src/common/shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
ceph version 12.0.2-1359-g3300304 (3300304)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaef2f0]
2: (()+0x4fe97a) [0x60697a]
3: (OSDService::~OSDService()+0x17c) [0x59157c]
4: (OSD::~OSD()+0x133) [0x5df193]
5: (OSD::~OSD()+0x9) [0x5df7d9]
6: (main()+0x2fca) [0x4d8aba]
7: (__libc_start_main()+0xf5) [0xd7adb35]
8: (()+0x469c39) [0x571c39]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
2017-05-26 07:03:31.011502 9618080 -1 leaked refs:
dump_weak_refs 0xf69e910 weak_refs: 78 = 0x40fe8da0 with 98 refs
dump_weak_refs 0xf69e910 weak_refs: 84 = 0x32e8a9d0 with 98 refs

and this one was in http://qa-proxy.ceph.com/teuthology/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/1231015/teuthology.log:

src/mon/MonClient.cc: In function 'int MonClient::wait_auth_rotating(double)' thread 10008700 time 2017-05-26 07:03:04.381430
src/mon/MonClient.cc: 901: FAILED assert(auth != nullptr)
ceph version 12.0.2-1359-g3300304 (3300304)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaef2f0]
2: (MonClient::wait_auth_rotating(double)+0x63c) [0xb0015c]
3: (OSD::ms_get_authorizer(int, AuthAuthorizer**, bool)+0xbc) [0x57887c]
4: (AsyncConnection::handle_connect_reply(ceph_msg_connect&, ceph_msg_connect_reply&)+0x853) [0xcf7c73]
5: (AsyncConnection::_process_connection()+0x19d6) [0xcfd8e6]
6: (AsyncConnection::process()+0x708) [0xd01608]
7: (EventCenter::process_events(int)+0x814) [0xb7ea94]
8: (()+0xa790aa) [0xb810aa]
9: (()+0xb5230) [0xd020230]
10: (()+0x7dc5) [0xc737dc5]
11: (clone()+0x6d) [0xd88373d]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.

gregsfortytwo · 2017-05-30T17:01:58Z

the first assert is the osd map cache, which doesn't seem like it could be related. The second is presumably http://tracker.ceph.com/issues/19900#change-91110

cbodley · 2017-05-30T18:06:42Z

@gregsfortytwo thanks. looks like the second one is fixed in master

gregsfortytwo added bug-fix core needs-qa needs-review labels May 23, 2017

gregsfortytwo requested a review from jdurgin May 23, 2017 00:13

tchaikov reviewed May 23, 2017

View reviewed changes

liewegas changed the title ~~Don't leak pgrefs or reservations in SnapTrimmer~~ osd: don't leak pgrefs or reservations in SnapTrimmer May 23, 2017

gregsfortytwo removed needs-qa needs-review labels May 23, 2017

gregsfortytwo changed the title ~~osd: don't leak pgrefs or reservations in SnapTrimmer~~ DNM: osd: don't leak pgrefs or reservations in SnapTrimmer May 24, 2017

gregsfortytwo force-pushed the wip-19931-snaptrim-pgref branch from 2fb03cd to cedd411 Compare May 25, 2017 01:11

gregsfortytwo added 4 commits May 25, 2017 11:39

osd: use PGRef instead of boost::intrusive_ptr<PG> everywhere

306ad85

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

osd: repair the PG_DEBUG_REFS build

ec4185d

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

osd: print out pg debug ref counts on acquire/drop

a3b028a

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

osd: make sure we drop Backoff PGRefs on shutdown

0bce2c6

Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo force-pushed the wip-19931-snaptrim-pgref branch from cedd411 to 138e246 Compare May 25, 2017 18:53

gregsfortytwo added needs-qa needs-review labels May 25, 2017

liewegas reviewed May 25, 2017

View reviewed changes

osd: Reset() the snaptrimmer on shutdown

b0e9dee

We were failing to exit various wait states which held PGRefs. Error! Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo force-pushed the wip-19931-snaptrim-pgref branch 2 times, most recently from 3300304 to fc68e70 Compare May 26, 2017 06:45

gregsfortytwo added 2 commits May 26, 2017 14:43

osd: do not default-abort on leaked pg refs

4caf2df

Signed-off-by: Greg Farnum <gfarnum@redhat.com>

gregsfortytwo force-pushed the wip-19931-snaptrim-pgref branch from fc68e70 to 4caf2df Compare May 26, 2017 21:48

gregsfortytwo changed the title ~~DNM: osd: don't leak pgrefs or reservations in SnapTrimmer~~ osd: don't leak pgrefs or reservations in SnapTrimmer May 26, 2017

gregsfortytwo merged commit 032466d into ceph:master May 26, 2017

gregsfortytwo deleted the wip-19931-snaptrim-pgref branch May 26, 2017 21:53

Navigation Menu

osd: don't leak pgrefs or reservations in SnapTrimmer #15214

osd: don't leak pgrefs or reservations in SnapTrimmer #15214

Conversation

gregsfortytwo commented May 23, 2017

gregsfortytwo commented May 23, 2017

gregsfortytwo commented May 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaikov May 23, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregsfortytwo commented May 24, 2017

tchaikov commented May 24, 2017 • edited

tchaikov commented May 24, 2017

tchaikov commented May 24, 2017

tchaikov commented May 24, 2017

gregsfortytwo commented May 24, 2017

gregsfortytwo commented May 24, 2017 • edited

gregsfortytwo commented May 24, 2017

gregsfortytwo commented May 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented May 25, 2017 via email

jdurgin commented May 25, 2017

gregsfortytwo commented May 25, 2017

gregsfortytwo commented May 25, 2017

jdurgin commented May 25, 2017

gregsfortytwo commented May 25, 2017

jdurgin commented May 26, 2017

liewegas commented May 26, 2017 via email

gregsfortytwo commented May 26, 2017

gregsfortytwo commented May 26, 2017

liewegas commented May 26, 2017 via email

gregsfortytwo commented May 26, 2017

liewegas commented May 26, 2017 • edited

gregsfortytwo commented May 26, 2017

jdurgin commented May 26, 2017

gregsfortytwo commented May 26, 2017

liewegas commented May 26, 2017 via email

cbodley commented May 27, 2017

gregsfortytwo commented May 30, 2017

cbodley commented May 30, 2017

tchaikov May 23, 2017 •

edited

tchaikov commented May 24, 2017 •

edited

gregsfortytwo commented May 24, 2017 •

edited

liewegas commented May 26, 2017 •

edited