New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: don't leak pgrefs or reservations in SnapTrimmer #15214
osd: don't leak pgrefs or reservations in SnapTrimmer #15214
Conversation
Just compile-tested so far, will do some vstart testing and if that works out this should get rolled into somebody's integration run asap. :) |
Actually, hold up on this. I can't seem to reproduce the crash the way I expected to (give the OSD a lot to trim, shut it down while there are PGs pending ). |
src/osd/PrimaryLogPG.h
Outdated
void finish(int) override { | ||
pg->lock(); | ||
PrimaryLogPG *pgp = pg.get(); | ||
if (service->is_stopping() || !pgp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder why pgp could be nullptr
? as pgp is a smart ptr and is holding a strong reference to the underlying PrimaryLogPG
pointer, so even after pg_map.clear()
, the pg
is still valid, i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I added code that zaps the pg reference...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregsfortytwo could you please point me to the place where you zap the pg ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pg.reset() immediately below in cancel()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be confused by the way "cancel"ing works on something we give the AsyncReserver. It does not guarantee the Context won't be finish()ed. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, that explains it! i need to look at this part closer before sending out more noise. sorry and thanks !
src/osd/PrimaryLogPG.h
Outdated
} | ||
void cancel() { | ||
assert(pg->is_locked()); | ||
assert(!canceled); | ||
canceled = true; | ||
pg->osd->snap_reserver.cancel_reservation(pg->get_pgid()); | ||
pg.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pg
will call pg->put("intptr")
in its dtor, so it's not necessary to reset it explicitly, am i right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of this code is clearing out the pg reference so that it goes away before the Context is deleted. That's because we can't do a reliable deletion when we cancel a Context that's been given to an AsyncReserver as happens with this one .
That said, the race is pretty small and I don't think it's actually responsible for the bug report(s) I've been given so I'm trying to figure out other ref counting issues we may have.
src/osd/PrimaryLogPG.h
Outdated
if (!canceled) | ||
pg->snap_trimmer_machine.process_event(SnapTrimReserved()); | ||
pg->unlock(); | ||
pgp->snap_trimmer_machine.process_event(SnapTrimReserved()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could early return if (canceled)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to unlock and there aren't any Locker semantics around the extra PG lock interfaces that I'm aware of (not that it would be any shorter in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am thinking about something like this:
if (canceled)
return;
pgp->lock();
pgp->snap_trimmer_machine.process_event(SnapTrimReserved());
pgp->unlock();
}
not shorter, but it's clearer this way that cancel is not protected by pg lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is protected by pg lock.
Anybody have any idea why this build is failing? It's happening on my rex box but I cannot for the life of me figure out why the compiler is using a const-pointer-returning operator->() instead of the regular pointer...
|
ahh, i didn't enable PG_DEBUG_REFS. |
retest this please. (i don't understand this either.) |
i am able to reproduce this. |
i think this is why we have a |
I thought it might be, but the typing is explicit (not auto) and it's set::iterator, not set::const_iterator! And actually, there is another failure at PrimaryPGLog.cc:13818 where it is deref'ing a PGRef to get hte snap_trimmer_machine and it also expect that to be const, it isn't, and it's failing. So the issue is definitely something to do with tracked_int_ptr and const-resolving. |
and indeed |
Okay, Yehuda says that iterators on sets always give you const elements back because you can't change them (it breaks sorting). That makes sense. |
2fb03cd
to
cedd411
Compare
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com>
cedd411
to
138e246
Compare
This is ready to be looked at again. |
src/osd/PrimaryLogPG.h
Outdated
assert(!canceled); | ||
canceled = true; | ||
pg.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this doesn't reliably make the ref go away. The asyncreserver could grab pgp, drop the local lock, then on_shutdown runs under pg lock and cancels. What ensures that asyncreserver is not still blocked on pg->lock (or scheduled away) when we assert there are no refs? Do we flush the asyncserver thread or something? (And if we do flush the asyncreserver, can't we make sure the Context goes away that way and avoid this change entirely?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSD::shutdown() loops through and locks the PGs twice. The first time it calls PrimaryPG::on_shutdown(), which invokes this cancel.
The second time, after doing some other work, is when we need the remaining PGRefs to be gone. So we're okay if the finish() function is actually racing live, since it will get the PG lock and drop the ref before we need it to be gone.
As part of that "other stuff" can't it cancel all the AsyncReserver events? (And avoid this dance?)
|
why doesn't QueuePeeringEvt's PGRef cause issues? since asyncreserver doesn't flush callbacks when you cancel them, can't it still be live after the pg is shutdown? |
@liewegas, AsyncReserver cancellation doesn't actually guarantee the "finish" doesn't get triggered. There are internal races with taking stuff off of the queue and actually invoking them in the finisher thread. So we actually need this dance to prevent it racing to grab the PG lock while the PG is being destroyed anyway. |
@jdurgin I don't warrant against the presence of all PGRef leaks, but in the QueuePeeringEvt case we do drain its threadpool and clear the peering_wq early enough that it cleans up correctly. I'm not sure how peering_wq is related to AsyncReserver, but it deletes any callbacks it has in its local set of waiting things. |
ah, the reserver_finisher is emptied and stopped in OSDService::shutdown(), prior to clearing the peering_wq, so that prevents any QueuePeeringEvt callbacks from being in flight at the end of OSD::shutdown(). That same finisher is used for the snap_reserver, so the PGRefs will be cleaned up too. The problem is that this all happens after checking for PGRefs in the middle of OSD::shutdown(). Is moving the reserver_finisher cleanup to OSDService::start_shutdown() (called before individual pg shutdown) and making your last commit just Reset() the snap_trim_machine sufficient to cleanup the extra references? |
It might be, but I was nervous about changing the shutdown order that much. (Keep in mind we have to backport the fix to Jewel with high confidence; I'm a lot happier changing the code we just backported rather than something central to OSD function.) If we were going to do such a thing I'd like to take the time to move most all of OSDService::shutdown() up into start_shutdown. And I'm not sure we actually can adjust things that way (start_shutdown() precedes our draining the op wq); I briefly considered that but it seemed like documenting the pg ref counting interfaces and the ways every queue interacts on shutdown were going to be a much longer project than we're interested in doing pre-luminous. Note that peering_wq is actually cleared much earlier than that; I'm not sure if the second clear you're looking at is redundant[1] or to deal with extra stuff that might come in via other operations being run. Certainly it's drained after the PGs have been notified about shutdown so I presume they won't be giving away any new refs to themselves. |
For the short term issue for backport here, let's just disable that ceph_abort() during shutdown. It's not necessary outside of testing and it does seem to be a rabbit hole to fix all the causes. |
right, but it has a ref, so it'll be able to get the lock eventually. why not
1- cancel (or try) in on_shutdown
2- flush the asyncreserver finisher thread (so any uncancelled-but-queued contexts do their thing)
3- verify pgrefs are gone
|
|
Is there actually a problem with the changes I've made? Feels like I'm being bikeshedded as we all rediscover that shutdown is annoying. If you'd like I can pull out all the patches that fixed up the ref counting so the PR is smaller; this is a pretty minimal change to fix a specific bug we introduced in the code being changed... |
If I'm reading your change right this narrows the race but doesn't remove it because the other thread might be waiting on pg->lock and we can't guarantee it will get flushed before we get around to the ref check.
And it seems so much simpler to just flush the wq with the context+ref before verifying refs are gone...
I like the idea of hiding the abort behind the config option; this is purely a code hygiene thing and doesn't affect end users at all (as long as we don't assert :).
|
Oh, I did lose a check for is_canceled prior to grabbing the pg lock. I can put that back. Even without that though, I really don't think it's a practical race. You'd have to have threads A and B
for anything to break. I'll look at flushing the work queues but it's a much more dangerous change to make on shutdown; we need to do everything in the right order to prevent putting pieces of work back into places they shouldn't be. And even that isn't enough. The Reset() message is needed to "shut down" the snap trimmer state machine and is probably sufficient on its own to prevent all the crashes users have seen. The cancel dance is just me being cautious because it's technically possible. |
Yeah, I get it's the conservative thing to do, but I'm not sure it's worth doing a conservative fix for what amounts to leak debug code. On jewel can we disable the ceph_abort() like @jdurgin suggested? And on master, we can do the simple and more robust/correct thing and not bloat a trivial callback with a weird lock dance. It seems like something like this would do it? I can't think of a reason why shutting down the finisher used only for the reserver earlier (right before we assert that pg refs are gone) would break something. And if we don't see this again after a while we can backport this too. diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index 87177a3..e82f620 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -480,10 +480,14 @@ void OSDService::start_shutdown() } } -void OSDService::shutdown() +void OSDService::shutdown_reserver() { reserver_finisher.wait_for_empty(); reserver_finisher.stop(); +} + +void OSDService::shutdown() +{ { Mutex::Locker l(watch_lock); watch_timer.shutdown(); @@ -3175,6 +3179,8 @@ int OSD::shutdown() assert(pg_stat_queue.empty()); } + service.shutdown_reserver(); + // Remove PGs #ifdef PG_DEBUG_REFS service.dump_live_pgids(); diff --git a/src/osd/OSD.h b/src/osd/OSD.h index cad51ed..5da0b8d 100644 --- a/src/osd/OSD.h +++ b/src/osd/OSD.h @@ -1100,6 +1100,7 @@ public: void init(); void final_init(); void start_shutdown(); + void shutdown_reserver(); void shutdown(); private: |
We were failing to exit various wait states which held PGRefs. Error! Fixes: http://tracker.ceph.com/issues/19931 Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Yeah okay. Pushing fresh patches shortly. |
3300304
to
fc68e70
Compare
lgtm - no need to include 6ab960c and its revert though |
This finisher thread has a lot of callbacks which can hold PGRefs. Make sure we drain them out before checking that all the PGs have finished and have no outstanding references. Moving this should be safe; we've already stopped the op thread et al and the only things still running are the OSDService's objecter_finisher, recovery_request_timer, and snap_sleep_timer (which has definitely been emptied by the time we get here as it's synchronously cleared out on PG shutdown). Signed-off-by: Greg Farnum <gfarnum@redhat.com>
Signed-off-by: Greg Farnum <gfarnum@redhat.com>
http://pulpito.ceph.com/gregf-2017-05-26_06:45:56-rados-wip-19931-snaptrim-pgs---basic-smithi/ shows 5 rados failures of 100 (none seem related). http://pulpito.ceph.com/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/ shows rgw-multisite failing in ways that are not a part of this patch series, and @cbodley said it used to reproduce them reliably. Calling it good! |
fc68e70
to
4caf2df
Compare
\o/
|
this osd segfault was in both of those rgw:multisite runs:
and this one was in http://qa-proxy.ceph.com/teuthology/gregf-2017-05-26_06:42:33-rgw:multisite-wip-19931-snaptrim-pgs---basic-smithi/1231015/teuthology.log:
|
the first assert is the osd map cache, which doesn't seem like it could be related. The second is presumably http://tracker.ceph.com/issues/19900#change-91110 |
@gregsfortytwo thanks. looks like the second one is fixed in master |
No description provided.