New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: don't leak pgrefs or reservations in SnapTrimmer #15214

Merged
merged 7 commits into from May 26, 2017

Conversation

Projects
None yet
5 participants
@gregsfortytwo
Member

gregsfortytwo commented May 23, 2017

No description provided.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 23, 2017

Just compile-tested so far, will do some vstart testing and if that works out this should get rolled into somebody's integration run asap. :)

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 23, 2017

Actually, hold up on this. I can't seem to reproduce the crash the way I expected to (give the OSD a lot to trim, shut it down while there are PGs pending ).

void finish(int) override {
pg->lock();
PrimaryLogPG *pgp = pg.get();
if (service->is_stopping() || !pgp)

This comment has been minimized.

@tchaikov

tchaikov May 23, 2017

Contributor

i wonder why pgp could be nullptr? as pgp is a smart ptr and is holding a strong reference to the underlying PrimaryLogPG pointer, so even after pg_map.clear(), the pg is still valid, i think.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 23, 2017

Member

Because I added code that zaps the pg reference...

This comment has been minimized.

@tchaikov

tchaikov May 23, 2017

Contributor

@gregsfortytwo could you please point me to the place where you zap the pg ref?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 23, 2017

Member

pg.reset() immediately below in cancel()?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 23, 2017

Member

You may be confused by the way "cancel"ing works on something we give the AsyncReserver. It does not guarantee the Context won't be finish()ed. :(

This comment has been minimized.

@tchaikov

tchaikov May 23, 2017

Contributor

ahh, that explains it! i need to look at this part closer before sending out more noise. sorry and thanks !

}
void cancel() {
assert(pg->is_locked());
assert(!canceled);
canceled = true;
pg->osd->snap_reserver.cancel_reservation(pg->get_pgid());
pg.reset();

This comment has been minimized.

@tchaikov

tchaikov May 23, 2017

Contributor

pg will call pg->put("intptr") in its dtor, so it's not necessary to reset it explicitly, am i right?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 23, 2017

Member

The whole point of this code is clearing out the pg reference so that it goes away before the Context is deleted. That's because we can't do a reliable deletion when we cancel a Context that's been given to an AsyncReserver as happens with this one .

That said, the race is pretty small and I don't think it's actually responsible for the bug report(s) I've been given so I'm trying to figure out other ref counting issues we may have.

if (!canceled)
pg->snap_trimmer_machine.process_event(SnapTrimReserved());
pg->unlock();
pgp->snap_trimmer_machine.process_event(SnapTrimReserved());

This comment has been minimized.

@tchaikov

tchaikov May 23, 2017

Contributor

nit, could early return if (canceled).

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 23, 2017

Member

We need to unlock and there aren't any Locker semantics around the extra PG lock interfaces that I'm aware of (not that it would be any shorter in this case).

This comment has been minimized.

@tchaikov

tchaikov May 23, 2017

Contributor

i am thinking about something like this:

  if (canceled)
    return;
  pgp->lock();
  pgp->snap_trimmer_machine.process_event(SnapTrimReserved());
  pgp->unlock();
}

not shorter, but it's clearer this way that cancel is not protected by pg lock.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 23, 2017

Member

It is protected by pg lock.

@liewegas liewegas changed the title from Don't leak pgrefs or reservations in SnapTrimmer to osd: don't leak pgrefs or reservations in SnapTrimmer May 23, 2017

@gregsfortytwo gregsfortytwo changed the title from osd: don't leak pgrefs or reservations in SnapTrimmer to DNM: osd: don't leak pgrefs or reservations in SnapTrimmer May 24, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 24, 2017

Anybody have any idea why this build is failing? It's happening on my rex box but I cannot for the life of me figure out why the compiler is using a const-pointer-returning operator->() instead of the regular pointer...

/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc: In member function ‘virtual void C_CompleteSplits::finish(int)’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/osd/OSD.cc:9132:22: error: invalid conversion from ‘const PG*’ to ‘PG*’ [-fpermissive]
PG *pg = i->get();

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented May 24, 2017

i am able to compile the OSD.cc after rebasing against master in my ubuntu 16.04 docker env. gcc version is 4:5.3.1-1ubuntu1.

ahh, i didn't enable PG_DEBUG_REFS.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented May 24, 2017

retest this please. (i don't understand this either.)

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented May 24, 2017

i am able to reproduce this.

@tchaikov

This comment has been minimized.

Contributor

tchaikov commented May 24, 2017

std::set
...
Constant BidirectionalIterator (since C++11)

i think this is why we have a const PG* here

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 24, 2017

I thought it might be, but the typing is explicit (not auto) and it's set::iterator, not set::const_iterator!

And actually, there is another failure at PrimaryPGLog.cc:13818 where it is deref'ing a PGRef to get hte snap_trimmer_machine and it also expect that to be const, it isn't, and it's failing. So the issue is definitely something to do with tracked_int_ptr and const-resolving.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 24, 2017

and indeed PGRef pgr = *i; PG *pg = pgr.get(); works fine. But it still doesn't like PG *pg = i->get(). So it's definitely a type resolution issue...

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 24, 2017

Okay, Yehuda says that iterators on sets always give you const elements back because you can't change them (it breaks sorting). That makes sense.
The problem here is that tracked_int_ptr is returning const PG* from its member-const functions, and that's not how we need the pointers to work. Easy enough to fix but I initially thought that was not an acceptable way to do it.

gregsfortytwo added some commits May 24, 2017

osd: use PGRef instead of boost::intrusive_ptr<PG> everywhere
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
osd: repair the PG_DEBUG_REFS build
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
osd: make sure we drop Backoff PGRefs on shutdown
Fixes: http://tracker.ceph.com/issues/19931

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
osd: print out pg debug ref counts on acquire/drop
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 25, 2017

This is ready to be looked at again.

assert(!canceled);
canceled = true;
pg.reset();

This comment has been minimized.

@liewegas

liewegas May 25, 2017

Member

It seems like this doesn't reliably make the ref go away. The asyncreserver could grab pgp, drop the local lock, then on_shutdown runs under pg lock and cancels. What ensures that asyncreserver is not still blocked on pg->lock (or scheduled away) when we assert there are no refs? Do we flush the asyncserver thread or something? (And if we do flush the asyncreserver, can't we make sure the Context goes away that way and avoid this change entirely?)

This comment has been minimized.

@gregsfortytwo

gregsfortytwo May 25, 2017

Member

OSD::shutdown() loops through and locks the PGs twice. The first time it calls PrimaryPG::on_shutdown(), which invokes this cancel.
The second time, after doing some other work, is when we need the remaining PGRefs to be gone. So we're okay if the finish() function is actually racing live, since it will get the PG lock and drop the ref before we need it to be gone.

@liewegas

This comment has been minimized.

Member

liewegas commented May 25, 2017

@jdurgin

This comment has been minimized.

Member

jdurgin commented May 25, 2017

why doesn't QueuePeeringEvt's PGRef cause issues? since asyncreserver doesn't flush callbacks when you cancel them, can't it still be live after the pg is shutdown?

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 25, 2017

@liewegas, AsyncReserver cancellation doesn't actually guarantee the "finish" doesn't get triggered. There are internal races with taking stuff off of the queue and actually invoking them in the finisher thread. So we actually need this dance to prevent it racing to grab the PG lock while the PG is being destroyed anyway.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 25, 2017

@jdurgin I don't warrant against the presence of all PGRef leaks, but in the QueuePeeringEvt case we do drain its threadpool and clear the peering_wq early enough that it cleans up correctly.

I'm not sure how peering_wq is related to AsyncReserver, but it deletes any callbacks it has in its local set of waiting things.

@jdurgin

This comment has been minimized.

Member

jdurgin commented May 25, 2017

ah, the reserver_finisher is emptied and stopped in OSDService::shutdown(), prior to clearing the peering_wq, so that prevents any QueuePeeringEvt callbacks from being in flight at the end of OSD::shutdown(). That same finisher is used for the snap_reserver, so the PGRefs will be cleaned up too. The problem is that this all happens after checking for PGRefs in the middle of OSD::shutdown().

Is moving the reserver_finisher cleanup to OSDService::start_shutdown() (called before individual pg shutdown) and making your last commit just Reset() the snap_trim_machine sufficient to cleanup the extra references?

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 25, 2017

It might be, but I was nervous about changing the shutdown order that much. (Keep in mind we have to backport the fix to Jewel with high confidence; I'm a lot happier changing the code we just backported rather than something central to OSD function.) If we were going to do such a thing I'd like to take the time to move most all of OSDService::shutdown() up into start_shutdown. And I'm not sure we actually can adjust things that way (start_shutdown() precedes our draining the op wq); I briefly considered that but it seemed like documenting the pg ref counting interfaces and the ways every queue interacts on shutdown were going to be a much longer project than we're interested in doing pre-luminous.

Note that peering_wq is actually cleared much earlier than that; I'm not sure if the second clear you're looking at is redundant[1] or to deal with extra stuff that might come in via other operations being run. Certainly it's drained after the PGs have been notified about shutdown so I presume they won't be giving away any new refs to themselves.
[1]: actually, looking at the git history, I see Sam introduced both peering_wq.clear() calls without comment in the same commit he added the pg lock ref counting. I kinda think he was trying out a few locations and just missed.

@jdurgin

This comment has been minimized.

Member

jdurgin commented May 26, 2017

For the short term issue for backport here, let's just disable that ceph_abort() during shutdown. It's not necessary outside of testing and it does seem to be a rabbit hole to fix all the causes.
It can hide behind a osd_debug_pg_refs config option enabled by teuthology/vstart.

@liewegas

This comment has been minimized.

Member

liewegas commented May 26, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 26, 2017

  1. is what the Reset() signal does.
  2. I guess maybe? More change to "external" code I didn't want to get into right now though.
  3. I don't know what you mean other than what we already do.
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 26, 2017

Is there actually a problem with the changes I've made? Feels like I'm being bikeshedded as we all rediscover that shutdown is annoying. If you'd like I can pull out all the patches that fixed up the ref counting so the PR is smaller; this is a pretty minimal change to fix a specific bug we introduced in the code being changed...

@liewegas

This comment has been minimized.

Member

liewegas commented May 26, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 26, 2017

Oh, I did lose a check for is_canceled prior to grabbing the pg lock. I can put that back.

Even without that though, I really don't think it's a practical race. You'd have to have threads A and B

  1. A holds lock X (pg lock)
  2. B requests lock X
  3. A drops lock X
  4. A acquires and drops many other locks
  5. A requests lock X and gets it before B does

for anything to break.

I'll look at flushing the work queues but it's a much more dangerous change to make on shutdown; we need to do everything in the right order to prevent putting pieces of work back into places they shouldn't be. And even that isn't enough. The Reset() message is needed to "shut down" the snap trimmer state machine and is probably sufficient on its own to prevent all the crashes users have seen. The cancel dance is just me being cautious because it's technically possible.

@liewegas

This comment has been minimized.

Member

liewegas commented May 26, 2017

Yeah, I get it's the conservative thing to do, but I'm not sure it's worth doing a conservative fix for what amounts to leak debug code. On jewel can we disable the ceph_abort() like @jdurgin suggested? And on master, we can do the simple and more robust/correct thing and not bloat a trivial callback with a weird lock dance.

It seems like something like this would do it? I can't think of a reason why shutting down the finisher used only for the reserver earlier (right before we assert that pg refs are gone) would break something. And if we don't see this again after a while we can backport this too.

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 87177a3..e82f620 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -480,10 +480,14 @@ void OSDService::start_shutdown()
   }
 }
 
-void OSDService::shutdown()
+void OSDService::shutdown_reserver()
 {
   reserver_finisher.wait_for_empty();
   reserver_finisher.stop();
+}
+
+void OSDService::shutdown()
+{
   {
     Mutex::Locker l(watch_lock);
     watch_timer.shutdown();
@@ -3175,6 +3179,8 @@ int OSD::shutdown()
     assert(pg_stat_queue.empty());
   }
 
+  service.shutdown_reserver();
+
   // Remove PGs
 #ifdef PG_DEBUG_REFS
   service.dump_live_pgids();
diff --git a/src/osd/OSD.h b/src/osd/OSD.h
index cad51ed..5da0b8d 100644
--- a/src/osd/OSD.h
+++ b/src/osd/OSD.h
@@ -1100,6 +1100,7 @@ public:
   void init();
   void final_init();  
   void start_shutdown();
+  void shutdown_reserver();
   void shutdown();
 
 private:
osd: Reset() the snaptrimmer on shutdown
We were failing to exit various wait states which held PGRefs. Error!

Fixes: http://tracker.ceph.com/issues/19931

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 26, 2017

Yeah okay. Pushing fresh patches shortly.

@jdurgin

This comment has been minimized.

Member

jdurgin commented May 26, 2017

lgtm - no need to include 6ab960c and its revert though

gregsfortytwo added some commits May 26, 2017

osd: shutdown our reserver_finisher earlier in the process
This finisher thread has a lot of callbacks which can hold PGRefs. Make
sure we drain them out before checking that all the PGs have finished
and have no outstanding references.

Moving this should be safe; we've already stopped the op thread et al
and the only things still running are the OSDService's objecter_finisher,
recovery_request_timer, and snap_sleep_timer (which has definitely been emptied
by the time we get here as it's synchronously cleared out on PG shutdown).

Signed-off-by: Greg Farnum <gfarnum@redhat.com>
osd: do not default-abort on leaked pg refs
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 26, 2017

http://pulpito.ceph.com/gregf-2017-05-26_06:45:56-rados-wip-19931-snaptrim-pgs---basic-smithi/ shows 5 rados failures of 100 (none seem related).

http://pulpito.ceph.com/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/ shows rgw-multisite failing in ways that are not a part of this patch series, and @cbodley said it used to reproduce them reliably. Calling it good!

@gregsfortytwo gregsfortytwo changed the title from DNM: osd: don't leak pgrefs or reservations in SnapTrimmer to osd: don't leak pgrefs or reservations in SnapTrimmer May 26, 2017

@gregsfortytwo gregsfortytwo merged commit 032466d into ceph:master May 26, 2017

2 of 3 checks passed

default Build triggered. sha1 is merged.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@gregsfortytwo gregsfortytwo deleted the gregsfortytwo:wip-19931-snaptrim-pgref branch May 26, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented May 26, 2017

@cbodley

This comment has been minimized.

Contributor

cbodley commented May 27, 2017

this osd segfault was in both of those rgw:multisite runs:

src/common/shared_cache.hpp: In function 'SharedLRU<K, V, C, H>::~SharedLRU() [with K = unsigned int; V = const OSDMap; C = std::less; H = std::hash]' thread 9618080 time 2017-05-26 07:03:31.013021
src/common/shared_cache.hpp: 107: FAILED assert(weak_refs.empty())
ceph version 12.0.2-1359-g3300304 (3300304)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaef2f0]
2: (()+0x4fe97a) [0x60697a]
3: (OSDService::~OSDService()+0x17c) [0x59157c]
4: (OSD::~OSD()+0x133) [0x5df193]
5: (OSD::~OSD()+0x9) [0x5df7d9]
6: (main()+0x2fca) [0x4d8aba]
7: (__libc_start_main()+0xf5) [0xd7adb35]
8: (()+0x469c39) [0x571c39]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.
2017-05-26 07:03:31.011502 9618080 -1 leaked refs:
dump_weak_refs 0xf69e910 weak_refs: 78 = 0x40fe8da0 with 98 refs
dump_weak_refs 0xf69e910 weak_refs: 84 = 0x32e8a9d0 with 98 refs

and this one was in http://qa-proxy.ceph.com/teuthology/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/1231015/teuthology.log:

src/mon/MonClient.cc: In function 'int MonClient::wait_auth_rotating(double)' thread 10008700 time 2017-05-26 07:03:04.381430
src/mon/MonClient.cc: 901: FAILED assert(auth != nullptr)
ceph version 12.0.2-1359-g3300304 (3300304)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0xaef2f0]
2: (MonClient::wait_auth_rotating(double)+0x63c) [0xb0015c]
3: (OSD::ms_get_authorizer(int, AuthAuthorizer**, bool)+0xbc) [0x57887c]
4: (AsyncConnection::handle_connect_reply(ceph_msg_connect&, ceph_msg_connect_reply&)+0x853) [0xcf7c73]
5: (AsyncConnection::_process_connection()+0x19d6) [0xcfd8e6]
6: (AsyncConnection::process()+0x708) [0xd01608]
7: (EventCenter::process_events(int)+0x814) [0xb7ea94]
8: (()+0xa790aa) [0xb810aa]
9: (()+0xb5230) [0xd020230]
10: (()+0x7dc5) [0xc737dc5]
11: (clone()+0x6d) [0xd88373d]
NOTE: a copy of the executable, or objdump -rdS <executable> is needed to interpret this.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented May 30, 2017

the first assert is the osd map cache, which doesn't seem like it could be related. The second is presumably http://tracker.ceph.com/issues/19900#change-91110

@cbodley

This comment has been minimized.

Contributor

cbodley commented May 30, 2017

@gregsfortytwo thanks. looks like the second one is fixed in master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment