New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: fast dispatch of peering events and pg_map + osd sharded wq refactor #19973
osd: fast dispatch of peering events and pg_map + osd sharded wq refactor #19973
Conversation
@gregsfortytwo killed off the disk_tp and recovery_gen_wq at the end, too! |
a0c1e49
to
e243ba9
Compare
void PG::update_store_on_load() | ||
{ | ||
if (osd->store->get_type() == "filestore") { | ||
// legacy filestore didn't store collection bit width; fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has probably already been done, but I'm not sure which feature check prevents us from booting a pre-luminous store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_preboot() has
} else if (osdmap->require_osd_release < CEPH_RELEASE_LUMINOUS) {
derr << "osdmap require_osd_release < luminous; please upgrade to luminous"
<< dendl;
@@ -557,7 +557,7 @@ class PG : public DoutPrefixProvider { | |||
return get_pgbackend()->get_is_recoverable_predicate(); | |||
} | |||
protected: | |||
OSDMapRef last_persisted_osdmap_ref; | |||
epoch_t last_persisted_osdmap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how swapping these out helps us. If it's the latest persisted map we are certainly going to need the intervening maps when we start updating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
osd/PG: keep epoch, not map ref, of last osdmap for lsat persisted epoch
No need to pin the map in memory!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the commit says that, but don't we need to refer to all the following maps?
...no, no we don't, because we've already done the processing and just haven't persisted that we've done so. Okay.
@@ -6421,6 +6397,50 @@ void OSD::ms_fast_dispatch(Message *m) | |||
m->put(); | |||
return; | |||
} | |||
|
|||
// peering event? | |||
switch (m->get_type()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are none of the PG ops becoming proper OpRequests for tracking purposes?
src/osd/OSD.cc
Outdated
MOSDPeeringOp *pm = static_cast<MOSDPeeringOp*>(m); | ||
return enqueue_peering_evt( | ||
pm->get_spg(), | ||
PGPeeringEventRef(pm->get_event())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to have lost the security check that our peer is actually an OSD, along with a few other subtler things I'm not sure we need (OSD::require_self_aliveness should be replaced by the PG::old_peering_msg check, but I didn't trace through the is_active() check that does).
|
||
// trim log when the pg is recovered | ||
pg->calc_min_last_complete_ondisk(); | ||
return discard_event(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we've replaced PG::handle_pg_trim() we need to actually remove that code.
src/osd/OSD.cc
Outdated
handle_command(static_cast<MMonCommand*>(m)); | ||
return; | ||
case MSG_COMMAND: | ||
handle_command(static_cast<MCommand*>(m)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can immediately invoke con->send_message(). Embarrassingly, I've forgotten if we're allowed to do that inside of fast dispatch, but I don't think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's safe, we do it all over the place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely don't do it all over the place, and I'm quite sure we couldn't safely send messages out of fast_dispatch in the initial implementation, which was part of why we had all the elaborate tooling around scheduling map sharing to happen later on (struct send_map_on_destruct
put a C_SendMap
in the work queue in Jewel, for instance). It might have become safe since then (especially with the Connection-based functions or AsyncMessenger), but those races were subtle and unpleasant so we'll need to either avoid doing so or work through the locking again to prove it.
The only place we've ever done so that I can find is an error condition we wouldn't have tested: we return ENXIO on misdirected ops in-line. That in fact was something I kept around, but I think I just missed it.
(Really kicking myself for not documenting what was allowed here, since I thought we'd done such a good job dispatching it. But I think I just assumed sending messages fell under "long-held locks", and there were a bunch of cyclical deadlocks that involved a PG lock held while sending off messages while we were trying to dispatch a message to them...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah I remember it being weird, but I also can't remember why. The first case I remember reviewing or doing this is the fast dispatch for osd heartbeats which went in a while back.. see handle_osd_ping(), which will also send replies.
Stepping back, I don't see any reason why it shouldn't be possible. If there are remaining reasons (like loopback causing deadlock/recursion) then IMO we should fix that instead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global SimpleMessenger::lock is the obvious reason it would have caused trouble. The Connection-based interfaces are a lot safer, but still not guaranteed since they have to grab a bunch of extra locks if the pipe failed and needs to reconnect.
Handling reconnect means that guaranteeing a message dispatch is non-blocking is basically impossible unless we switch that model around completely; even if that doesn't cause a deadlock it's still distinctly undesirable — and it makes hard-to-identify deadlocks a lot more likely.
I'm not really sure I see a path that lets us generically guarantee that sending messages from fast_dispatch is safe. We could probably do it (by which I mean, it probably already works) if we restrict it to lossy connections. But I hate making use of interfaces which are that subtle and easy to break.
Fast dispatch works reasonably well because we made a simple interface: you're not allowed to do anything that might block. No matter how good we are, we can't send messages without doing something that might block. Therefore, we can't send messages in fast dispatch. It's the locking boundary that makes an interface like this between the OSD and Messenger possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will note that in a seastar world this is much easier because if we have to do something blocking like establishing a new connection, it's easy to queue it to happen later. We just don't have that infrastructure right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And oh yeah, pipe replacement requires us to stop the previous Pipe's fast dispatch. There we go, deadlock.
Since we already have an easy model for queuing messages to go out not in the fast dispatch path, that seems a better idea right now than a lot of work that will destabilize our existing code and interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh bother.. I remember those deadlocks now. Yeesh.
Okay... so I think the primary value here is that it is funneling everything through the queue, and the message handlers are just generating queue events. I'll move these back to slow dispatch but leave the rest the same. I don't think that will make things any easier or harder in the eventual seastar transition. The hard part there is going to be handle_osd_map and the various bits of OSDService that have their own special locks (and aren't sharded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works too, and yes, I don't expect it to have any real impact on our future work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and yeah MSG_COMMAND looks like the only one, so no biggie!
src/osd/OSD.h
Outdated
@@ -1529,7 +1528,6 @@ class OSD : public Dispatcher, | |||
assert(osd_lock.is_locked()); | |||
finished.splice(finished.end(), ls); | |||
} | |||
void do_waiters(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this machinery is dead, we need to remove the finished and take_waiters() members as well. But we're still pushing stuff on in OSD::activate_map(), from when OSD::handle_pg_create() pushes it on via require_same_or_newer_map(). So I don't think we can kill it quite yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, teh legacy pg create handler is still using it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means we still need to do_waiters() then.
src/osd/OSD.cc
Outdated
ShardData* sdata = shard_list[shard_index]; | ||
Mutex::Locker l(sdata->sdata_op_ordering_lock); | ||
ShardData::pg_slot& slot = sdata->pg_slots[pgid]; | ||
slot.waiting_for_pg = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not wrong, but the ShardedOpWQ sure is grossly aware of what's going on with PGs at this point. :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. the shardedwq becomes the complicated box that everything funnels through. my hope is that this will continue to be consolidated into an overall OSDShard-type concept (not just the queue), and ultimately be simpler to understand and replace in a seastar world than the hodge-podge of threads and locks we had before.
return; | ||
} | ||
header.version = HEAD_VERSION; | ||
header.compat_version = COMPAT_VERSION; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're setting the HEAD/COMPAT_VERSION to their default values. I think you meant to set them to an old value if we don't have SERVER_MIMIC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep fixed
decode(options, p); | ||
for (auto pg : pgs) { | ||
// note: this only works with replicated pools. if a pre-mimic mon | ||
// tries to force a mimic+ osd on an ec pool it will not work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't really understand how the forced recovery happens. Did it not previously work in EC pools at all? Or is this a regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously it encoded pg_t, which the OSD had to go and look up in the pg_map. Fast dispatch needs a full spg_t to queue the event. For replicated pools spg_t::shard is NO_SHARD so it's easy, but for EC pools we don't know which shard to send it to. Since forced recovery is a rare operator event I'm opting to simply break it for the mid-upgrade old mon + new osd case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmm, that's probably okay but will need to include documentation. People occasionally get in to a lot of trouble during upgrades (or upgrade because of trouble!) and we need to make it clear that they can't do this.
...actually, we need to figure out a way to block this sensibly. Probably throw a clog error?
src/osd/OSD.cc
Outdated
PG *pg, | ||
GenContext<ThreadPool::TPHandle&> *c) | ||
{ | ||
epoch_t e = get_osdmap()->get_epoch(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use OSDService::get_osdmap_epoch()
instead? The main benefit would be no shared_ptr
dance.
│ _ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EED4Ev():
│ // Does not throw if __r._M_get_use_count() == 0, caller must
│ explicit __shared_count(const __weak_count<_Lp>& __r, std::no
│
│ ~__shared_count() noexcept
│ {
│ if (_M_pi != nullptr)
│ mov -0x68(%rbp),%rdi
1,59 │ mov -0x70(%rbp),%rax
│ test %rdi,%rdi
│ mov 0x10(%rax),%r13d
26,29 │ ┌──je 4ba45e <OSDService::queue_recovery_context(PG*, GenContext
│ │ _M_pi->_M_release();
│ │→ callq std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2
│ │_ZN10OSDService22queue_recovery_contextEP2PGP10GenContextIRN10Threa
│ │ epoch_t e = get_osdmap()->get_epoch();
│ │ enqueue_back(
│ │ OpQueueItem(
6,37 │ 6e:└─→callq ceph_clock_now()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
src/osd/OSD.cc
Outdated
enqueue_back( | ||
OpQueueItem( | ||
unique_ptr<OpQueueItem::OpQueueable>( | ||
new PGRecoveryContext(pg->get_pgid(), c, e)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::make_unique<PGRecoveryContext>
?
7db1a13
to
7cc290b
Compare
1fbea7c
to
080e43e
Compare
d639f1e
to
32ce304
Compare
2b57d4f
to
43336f7
Compare
d23dee6
to
af24310
Compare
@gregsfortytwo this is getting pretty close to mergeable; want to review the new bits? last time you looked only the first half of the commits were here. the second half kills the global pg_map, moves most of the ShardedOpWQ bits into a top-level OSDShard class, and reimplements the split handling (much simpler, yay!). |
facf139
to
79e753d
Compare
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
So that comment makes sense Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This is only used for waiting. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Comment rules and take lock during init/shutdown (nice but not necessary). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This is cheaper and more explicit. Signed-off-by: Sage Weil <sage@redhat.com>
If the entry already exists, we don't want to construct the PGSlot and then throw it out. Signed-off-by: Sage Weil <sage@redhat.com>
116ea62
to
d14221d
Compare
Output as a proper array this time. Signed-off-by: Sage Weil <sage@redhat.com>
These are generally things to fix aspirationally someday or things that we'll need to clean up as part of the move to seastar. Signed-off-by: Sage Weil <sage@redhat.com>
Clarified all of the FIXME comments |
Ugh, sorry, my brain's getting mushy around these.
…On Wed, Apr 4, 2018 at 8:18 AM Sage Weil ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/osd/OSD.cc
<#19973 (comment)>:
> + }
+ if (!slot.waiting.empty()) {
+ if (osdmap->is_up_acting_osd_shard(pgid, osd->get_nodeid())) {
+ dout(20) << __func__ << " " << pgid << " maps to us, keeping"
+ << dendl;
+ ++p;
+ continue;
+ }
+ while (!slot.waiting.empty() &&
+ slot.waiting.front().get_map_epoch() <= osdmap->get_epoch()) {
+ auto& qi = slot.waiting.front();
+ dout(20) << __func__ << " " << pgid
+ << " waiting item " << qi
+ << " epoch " << qi.get_map_epoch()
+ << " <= " << osdmap->get_epoch()
+ << ", stale, dropping" << dendl;
That second loop is discarding waiting items for this same pgid slot
(slot.waiting is a deque iirc)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19973 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA3cv-5iY1rMyZ_HvqkqtF0rFNNj5aIRks5tlOMUgaJpZM4RgfsU>
.
|
… is created" This reverts commit 38319f8. I want to reproduce the race to see if there is a better way to avoid it. This breaks tests that set pg_num, and generally makes interacting with the cluster awkward for anything non-human. # Conflicts: # src/mon/CreatingPGs.h Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
For the legacy create path, we pregenerate the history and instantiate the PG at the current epoch. Signed-off-by: Sage Weil <sage@redhat.com>
create messages
TODO
Review guide: This is a long patch series that represents a transition between several different approaches, eventually settling on a more siginificant rewrite of the sharded wq than I had originally planned. Sorry! That means that the intermediate solutions aren't worth reviewing carefully for correctness, but are still useful to see separately to understand the evolution of the code.
This series goes roughly like this:
At the end, I would review the _process() code itself along with the description of ordering in OSD.h. This is the heart of it!