Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: fast dispatch of peering events and pg_map + osd sharded wq refactor #19973

Merged
merged 126 commits into from Apr 6, 2018

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Jan 16, 2018

  • put everything in the fast dispatch path except for osd map processing and legacy pg
    create messages
  • move most of the guts of ShardedOpWQ into a top-level OSDShard
  • kill the global pg_map and instead rely on the pg_slots in each OSDShard
  • several cleansup along the way
  • some bug fixes that seem related to pg removal but didn't surface on master (since this will merge before mimic not bothering to try to pull them out)

TODO

  • write doc for op and peering event ordering constraints

Review guide: This is a long patch series that represents a transition between several different approaches, eventually settling on a more siginificant rewrite of the sharded wq than I had originally planned. Sorry! That means that the intermediate solutions aren't worth reviewing carefully for correctness, but are still useful to see separately to understand the evolution of the code.

This series goes roughly like this:

  • fast dispatch of peering events. The first several commits, up through "osd: fast dispatch peering events" (parts 1 and 2) simplify the OSD.cc code so that peering messages translate directly into one or more events that are fed into the normal sharded work queue. This simplifies the number of paths through the code, making it easier to reason about.
  • Several patches change the pg creation behavior. The overall takeaway here is that there is an event property that determines whether the event that trigger a pg creation. Several events do this, including notify, log, and of course the create message from the mon. We have a new MOSDPGCreate2 message that specifies those creations in terms of spg_t instead of pg_t. Unfortunatley we have to keep around the old creation path (hadnle_pg_create) to deal with old monitors. That will eventaully go away, butwe have to have a couple releases for that.
  • 'osd: prime pg_slots for to-be-split children' is notable for introducing a pg_slot-based way to deal with splits: instnatiate the slot for the child pgs and mark it so that we queue up events but don't process them yet. This will eventually avoids special check for "is this pg splitting?" that have to look things up, although those changes come a bit later.
  • Move scrub requests into teh work queue.
  • Kill the disk_tp! Yay! Now we only have two work queues: the big one, and one for commands (asok or tell). The latter is totally unrelated to PGs so for now let's leave it separate.
  • Here's where things get messy: several patches adjust (and adjust again) how the PG waiting works. This isn't worth looking at in detail; instead, look at the final result and the comments in OSD.h above _process that describe the rules around queueing events, peering vs op events, and how the requeuing behaves.
  • Several patches around "osd: store ec profile with final pool" resolve an issue with the idea that a PG survives until its data is deleted, even if the pool is gone. That means we have reinstantiate the PG on an OSD restart even if the pool info is missing from teh OSDmap. To that end, we stash info about deleted pools... namely the pg_num and the erasure-code-profile.
  • 'osd: do not release recovery_ops_reserved on requeue' is notable because the old behavior made no sense to me, so i "fixed" it.. but that old behavior is still present (I think) in luminous. Please see if I'm missing something!
  • lots of the lookup methos in OSD.h are simplified here, and even further later on when the number of ways we look up pgs continually shrinks (yay!).
  • 'osd: kill pg_map' is the next big shift: we remove pg_map, and rely entirely on each OSDShard's pg_slots table. This simplifies everything because we have less stuff to keep consistent in memory and one less mutex to deal with. It's a huge patch, and a lot of the code it touches gets changed later, so don't worry too much about correctness yet.
  • 'restructure consume_map' is the next big one. The concept is that the map comes into handle_osd_map, is written to disk,a nd gets to consume_map (same as before). At this point we push it down into each shard. It's at this point that we make sure the shard's OSDMapRef and the pg_slots are in sync in terms of priming splits: once the shard's ref is updated we know that any resulting split is already primed. Later, the advance_pg() bit that pulls PGs up to the shard map can thus be sure that any splits that result then are primed and the children can be installed as soon as they're ready. The old split tracking machinery (lock and a bunch of random maps) is cleaned out, yay!
  • _attach_pg and _detach_pg unify the pg linkage to pg_slots in one place. Also we now have a PG pointer back to its shard and its pg_slot.
  • pg_slots switch to unique_ptr... they're getting big so we don't want them to get copied around on hash inserts etc. (Also, we want the pg backpointer!)
  • The pg_epoch tracking (to figure out the min epoch for all pgs) is moved to be per-shard. Old machinery is removed.
  • RecoveryCtx completions handling is simplified (attach directly to txn, yay)
  • The pg counts are adjusted on PG removal. This is good enough for now, I think, but I'm definitely looking for ways to improve this later.
  • 'osd: kill broken _process optimization; simplify null pg flow' is the last biggish rewrite. The all-important _process flow is simplified with (I think) all remnants of complexity from the pg_map days removed. The checks that are needed after dropping and retaking the shard lock are also grouped and only applied when the pg is non-null; if it's null we skip it all.
  • The remaining patches are small fixups.

At the end, I would review the _process() code itself along with the description of ordering in OSD.h. This is the heart of it!

@liewegas
Copy link
Member Author

@gregsfortytwo killed off the disk_tp and recovery_gen_wq at the end, too!

void PG::update_store_on_load()
{
if (osd->store->get_type() == "filestore") {
// legacy filestore didn't store collection bit width; fix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has probably already been done, but I'm not sure which feature check prevents us from booting a pre-luminous store?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_preboot() has

  } else if (osdmap->require_osd_release < CEPH_RELEASE_LUMINOUS) {
    derr << "osdmap require_osd_release < luminous; please upgrade to luminous"
	 << dendl;

@@ -557,7 +557,7 @@ class PG : public DoutPrefixProvider {
return get_pgbackend()->get_is_recoverable_predicate();
}
protected:
OSDMapRef last_persisted_osdmap_ref;
epoch_t last_persisted_osdmap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how swapping these out helps us. If it's the latest persisted map we are certainly going to need the intervening maps when we start updating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osd/PG: keep epoch, not map ref, of last osdmap for lsat persisted epoch

No need to pin the map in memory!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the commit says that, but don't we need to refer to all the following maps?
...no, no we don't, because we've already done the processing and just haven't persisted that we've done so. Okay.

@@ -6421,6 +6397,50 @@ void OSD::ms_fast_dispatch(Message *m)
m->put();
return;
}

// peering event?
switch (m->get_type()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are none of the PG ops becoming proper OpRequests for tracking purposes?

src/osd/OSD.cc Outdated
MOSDPeeringOp *pm = static_cast<MOSDPeeringOp*>(m);
return enqueue_peering_evt(
pm->get_spg(),
PGPeeringEventRef(pm->get_event()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to have lost the security check that our peer is actually an OSD, along with a few other subtler things I'm not sure we need (OSD::require_self_aliveness should be replaced by the PG::old_peering_msg check, but I didn't trace through the is_active() check that does).


// trim log when the pg is recovered
pg->calc_min_last_complete_ondisk();
return discard_event();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we've replaced PG::handle_pg_trim() we need to actually remove that code.

src/osd/OSD.cc Outdated
handle_command(static_cast<MMonCommand*>(m));
return;
case MSG_COMMAND:
handle_command(static_cast<MCommand*>(m));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can immediately invoke con->send_message(). Embarrassingly, I've forgotten if we're allowed to do that inside of fast dispatch, but I don't think so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's safe, we do it all over the place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely don't do it all over the place, and I'm quite sure we couldn't safely send messages out of fast_dispatch in the initial implementation, which was part of why we had all the elaborate tooling around scheduling map sharing to happen later on (struct send_map_on_destruct put a C_SendMap in the work queue in Jewel, for instance). It might have become safe since then (especially with the Connection-based functions or AsyncMessenger), but those races were subtle and unpleasant so we'll need to either avoid doing so or work through the locking again to prove it.

The only place we've ever done so that I can find is an error condition we wouldn't have tested: we return ENXIO on misdirected ops in-line. That in fact was something I kept around, but I think I just missed it.

(Really kicking myself for not documenting what was allowed here, since I thought we'd done such a good job dispatching it. But I think I just assumed sending messages fell under "long-held locks", and there were a bunch of cyclical deadlocks that involved a PG lock held while sending off messages while we were trying to dispatch a message to them...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah I remember it being weird, but I also can't remember why. The first case I remember reviewing or doing this is the fast dispatch for osd heartbeats which went in a while back.. see handle_osd_ping(), which will also send replies.

Stepping back, I don't see any reason why it shouldn't be possible. If there are remaining reasons (like loopback causing deadlock/recursion) then IMO we should fix that instead!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global SimpleMessenger::lock is the obvious reason it would have caused trouble. The Connection-based interfaces are a lot safer, but still not guaranteed since they have to grab a bunch of extra locks if the pipe failed and needs to reconnect.

Handling reconnect means that guaranteeing a message dispatch is non-blocking is basically impossible unless we switch that model around completely; even if that doesn't cause a deadlock it's still distinctly undesirable — and it makes hard-to-identify deadlocks a lot more likely.

I'm not really sure I see a path that lets us generically guarantee that sending messages from fast_dispatch is safe. We could probably do it (by which I mean, it probably already works) if we restrict it to lossy connections. But I hate making use of interfaces which are that subtle and easy to break.

Fast dispatch works reasonably well because we made a simple interface: you're not allowed to do anything that might block. No matter how good we are, we can't send messages without doing something that might block. Therefore, we can't send messages in fast dispatch. It's the locking boundary that makes an interface like this between the OSD and Messenger possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will note that in a seastar world this is much easier because if we have to do something blocking like establishing a new connection, it's easy to queue it to happen later. We just don't have that infrastructure right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And oh yeah, pipe replacement requires us to stop the previous Pipe's fast dispatch. There we go, deadlock.

Since we already have an easy model for queuing messages to go out not in the fast dispatch path, that seems a better idea right now than a lot of work that will destabilize our existing code and interfaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh bother.. I remember those deadlocks now. Yeesh.

Okay... so I think the primary value here is that it is funneling everything through the queue, and the message handlers are just generating queue events. I'll move these back to slow dispatch but leave the rest the same. I don't think that will make things any easier or harder in the eventual seastar transition. The hard part there is going to be handle_osd_map and the various bits of OSDService that have their own special locks (and aren't sharded).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works too, and yes, I don't expect it to have any real impact on our future work!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and yeah MSG_COMMAND looks like the only one, so no biggie!

src/osd/OSD.h Outdated
@@ -1529,7 +1528,6 @@ class OSD : public Dispatcher,
assert(osd_lock.is_locked());
finished.splice(finished.end(), ls);
}
void do_waiters();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this machinery is dead, we need to remove the finished and take_waiters() members as well. But we're still pushing stuff on in OSD::activate_map(), from when OSD::handle_pg_create() pushes it on via require_same_or_newer_map(). So I don't think we can kill it quite yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, teh legacy pg create handler is still using it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means we still need to do_waiters() then.

src/osd/OSD.cc Outdated
ShardData* sdata = shard_list[shard_index];
Mutex::Locker l(sdata->sdata_op_ordering_lock);
ShardData::pg_slot& slot = sdata->pg_slots[pgid];
slot.waiting_for_pg = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not wrong, but the ShardedOpWQ sure is grossly aware of what's going on with PGs at this point. :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. the shardedwq becomes the complicated box that everything funnels through. my hope is that this will continue to be consolidated into an overall OSDShard-type concept (not just the queue), and ultimately be simpler to understand and replace in a seastar world than the hodge-podge of threads and locks we had before.

return;
}
header.version = HEAD_VERSION;
header.compat_version = COMPAT_VERSION;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're setting the HEAD/COMPAT_VERSION to their default values. I think you meant to set them to an old value if we don't have SERVER_MIMIC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep fixed

decode(options, p);
for (auto pg : pgs) {
// note: this only works with replicated pools. if a pre-mimic mon
// tries to force a mimic+ osd on an ec pool it will not work.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really understand how the forced recovery happens. Did it not previously work in EC pools at all? Or is this a regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously it encoded pg_t, which the OSD had to go and look up in the pg_map. Fast dispatch needs a full spg_t to queue the event. For replicated pools spg_t::shard is NO_SHARD so it's easy, but for EC pools we don't know which shard to send it to. Since forced recovery is a rare operator event I'm opting to simply break it for the mid-upgrade old mon + new osd case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmm, that's probably okay but will need to include documentation. People occasionally get in to a lot of trouble during upgrades (or upgrade because of trouble!) and we need to make it clear that they can't do this.

...actually, we need to figure out a way to block this sensibly. Probably throw a clog error?

src/osd/OSD.cc Outdated
PG *pg,
GenContext<ThreadPool::TPHandle&> *c)
{
epoch_t e = get_osdmap()->get_epoch();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use OSDService::get_osdmap_epoch() instead? The main benefit would be no shared_ptr dance.

       │     _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED4Ev():
       │           // Does not throw if __r._M_get_use_count() == 0, caller must
       │           explicit __shared_count(const __weak_count<_Lp>& __r, std::no
       │
       │           ~__shared_count() noexcept
       │           {
       │             if (_M_pi != nullptr)
       │       mov    -0x68(%rbp),%rdi
  1,59 │       mov    -0x70(%rbp),%rax
       │       test   %rdi,%rdi
       │       mov    0x10(%rax),%r13d
 26,29 │    ┌──je     4ba45e <OSDService::queue_recovery_context(PG*, GenContext
       │    │          _M_pi->_M_release();
       │    │→ callq  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2
       │    │_ZN10OSDService22queue_recovery_contextEP2PGP10GenContextIRN10Threa
       │    │  epoch_t e = get_osdmap()->get_epoch();
       │    │  enqueue_back(
       │    │    OpQueueItem(
  6,37 │ 6e:└─→callq  ceph_clock_now()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

src/osd/OSD.cc Outdated
enqueue_back(
OpQueueItem(
unique_ptr<OpQueueItem::OpQueueable>(
new PGRecoveryContext(pg->get_pgid(), c, e)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::make_unique<PGRecoveryContext>?

@liewegas liewegas force-pushed the wip-peering-fast-dispatch branch 2 times, most recently from 7db1a13 to 7cc290b Compare January 18, 2018 22:29
@liewegas liewegas force-pushed the wip-peering-fast-dispatch branch 4 times, most recently from 1fbea7c to 080e43e Compare January 29, 2018 22:54
@liewegas liewegas force-pushed the wip-peering-fast-dispatch branch 3 times, most recently from d639f1e to 32ce304 Compare February 12, 2018 21:51
@liewegas liewegas force-pushed the wip-peering-fast-dispatch branch 5 times, most recently from 2b57d4f to 43336f7 Compare February 22, 2018 03:19
@liewegas liewegas mentioned this pull request Feb 23, 2018
9 tasks
@liewegas
Copy link
Member Author

@gregsfortytwo this is getting pretty close to mergeable; want to review the new bits? last time you looked only the first half of the commits were here. the second half kills the global pg_map, moves most of the ShardedOpWQ bits into a top-level OSDShard class, and reimplements the split handling (much simpler, yay!).

@liewegas liewegas force-pushed the wip-peering-fast-dispatch branch 3 times, most recently from facf139 to 79e753d Compare February 26, 2018 20:51
@liewegas liewegas changed the title osd: fast dispatch of peering events osd: fast dispatch of peering events and pg_map + osd sharded wq refactor Feb 27, 2018
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
So that comment makes sense

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This is only used for waiting.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Comment rules and take lock during init/shutdown (nice but not necessary).

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This is cheaper and more explicit.

Signed-off-by: Sage Weil <sage@redhat.com>
If the entry already exists, we don't want to construct the PGSlot
and then throw it out.

Signed-off-by: Sage Weil <sage@redhat.com>
Output as a proper array this time.

Signed-off-by: Sage Weil <sage@redhat.com>
These are generally things to fix aspirationally someday or things that
we'll need to clean up as part of the move to seastar.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

liewegas commented Apr 4, 2018

Clarified all of the FIXME comments

@gregsfortytwo
Copy link
Member

gregsfortytwo commented Apr 4, 2018 via email

… is created"

This reverts commit 38319f8.

I want to reproduce the race to see if there is a better way to avoid it.

This breaks tests that set pg_num, and generally makes interacting with
the cluster awkward for anything non-human.

# Conflicts:
#	src/mon/CreatingPGs.h

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
For the legacy create path, we pregenerate the history and instantiate
the PG at the current epoch.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas
Copy link
Member Author

liewegas commented Apr 6, 2018

@gregsfortytwo gregsfortytwo merged commit 6d2e4c9 into ceph:master Apr 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants