New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: faster dispatch #13343

Merged
merged 2 commits into from Mar 1, 2017

Conversation

Projects
None yet
5 participants
@liewegas
Member

liewegas commented Feb 9, 2017

No description provided.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 9, 2017

@jdurgin @gregsfortytwo i think this is ready for review! still doing some testing but it's pretty much complete, modulo conflicts with the backoff stuff (depends which we merge first.. probably backoff!).

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 10, 2017

it's a little hard to understand. we make "resend op behavior" become special, so remove map reserve in the last commit? may be a brief description is helpful!!!

@gregsfortytwo gregsfortytwo self-requested a review Feb 11, 2017

@gregsfortytwo

I'm still working my way through the _process() function and want to ponder the higher-level stuff a bit more, but here's some comments to be getting on with.

And in case it's not obvious, please make sure any changes get pushed as follow-on commits instead of a replacement one or I'm liable to run my head through my monitor trying to check them. ;)

pg->unlock();
wake_pg_waiters(pgid);

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

Moving these locks around makes a comment on PG *_create_lock_pg() incorrect; need to update it.

std::unique_ptr<OpQueue< pair<spg_t, PGQueueable>, entity_inst_t>> pqueue;
void _enqueue_front(pair<spg_t, PGQueueable> item, unsigned cutoff) {

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

Why are we passing the cutoff as a param? I don't see it being set to anything but osd->op_prio_cutoff and making it a param just seems weird.

This comment has been minimized.

@liewegas

liewegas Feb 16, 2017

Member

we just don't have a local pointer back to osd in ShardData. an arg here was cleaner than adding an extra field

@@ -2744,7 +2744,8 @@ int OSD::shutdown()
service.start_shutdown();
clear_waiting_sessions();
// stop sending work to pgs
op_shardedwq.drain();

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

So we drain here, signal shutdown, and then drain again. Why not just signal shutdown and then drain? Was this an attempt to replicate the way clear_waiting_sessions() used to stop new IO from getting into the queue?

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

Yeah, it looks to me like the call to op_shardedwq.clear_pg_waiters() below should just come up here? There's no ordering to preserve between ops which are in that list and those already in the work queue.

* 2. When an obc rwlock is released, we check for a scrub block and requeue
* the op there if it applies. We ignore the unreadable/degraded/blocked
* queues because we assume they cannot apply at that time (this is
* probably mostly true).

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

"probably most true" is a pretty distressing thing to read when discussing requeueing order. Can you expand on that, or are you just hoping?
In particular, I have a feeling this will run us into trouble with the upcoming object repair stuff if it isn't already an issue.

* These three behaviors are generally sufficient to maintain ordering, with
* the possible exception of cases where we make an object degraded or
* unreadable that was previously okay, e.g. when scrub or op processing
* encounter an unexpected error. FIXME.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

I actually don't see how that causes trouble, unless you're saying we can have ops in the waitlists past unreadable_object and don't waitlist them at the same time.
In which case, why not? Seems like it's not so complicated to string those together.
(I'm also not sure yet if these are new behaviors or just you codifying ones which already exist.)

This comment has been minimized.

@liewegas

liewegas Feb 16, 2017

Member

it's not that complicated, it's just not implemented. right now this is handled in release_object_locks; we need to add additional checks there for unreadable and unrecoverable. and then figure out if there are other scnearios (i.e. was readable but degraded, but later becomes unreadable, which means the degraded requeue needs to check for unreadable). in general, at any requeue point we need to check whether we qualify for any prior wait list, which is sort of begging for a mroe general solution than hand-coding checks for each requeue case.

we should obviously fix, esp since we're going to be adding hte improved EIO handling which will trigger this all the time, but it's not a new bug, so i'm just documenting it for now.

OSDMapRef waiting_for_pg_osdmap;
struct pg_slot {
PGRef pg;
list<PGQueueable> ls;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

"ls" is a terrible name for something we reference in other scopes. I think you tried to keep names around it consistent at pg_slot, but it's still totally ungrepable since it's used in the depths of iterator dereferencing. "slot_ops_list" or something is probably fine?

assert(ms_can_fast_dispatch(op->get_req()));
MOSDFastDispatchOp *m = static_cast<MOSDFastDispatchOp*>(op->get_req());
if (m->get_map_epoch() < osdmap->get_epoch()) {
maybe_share_map(session, op, osdmap);

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

There's no reason to invoke this on every op in sequence; the maybe_share_map() call is kind of expensive with the spinlock twiddling.

service.maybe_inject_dispatch_delay();
if (m->get_type() != CEPH_MSG_OSD_OP ||

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

Hmm, I'm worried about letting us split the streams with this logic here.

  1. I suspect CEPH_MSG_OSD_BACKOFF should not go through the slow path?
  2. Well, I guess OSDs use a separate messenger for their OSD_OP (so, cache op) messages and their OSD protocol messages. But that was another concern that popped into my head.

This comment has been minimized.

@liewegas

liewegas Feb 16, 2017

Member

MOSDOp is the only legacy message that didn't have an explicit spg_t, and also the only message that comes from clients. so all osds (peer or non-peer) will take the fast path, except the Objecter, which is ordered independently.

assert(NULL != sdata);
// peek at spg_t
sdata->sdata_op_ordering_lock.Lock();

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

Is there a reason the lock handling isn't done with a Locker? The return cases are legion and all preceded by an Unlock() call.

pg->lock();
}
osd->service.maybe_inject_dispatch_delay();

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 16, 2017

Member

Are we deliberately giving a chance to inject delays both before and after locking the PG?

This comment has been minimized.

@liewegas
@liewegas

This comment has been minimized.

Member

liewegas commented Feb 16, 2017

Thanks for the thorough review! see new commits

@gregsfortytwo

Few more code comments and some more general ones to follow as individual things.

}
enqueue_op(pgid, op, m->get_map_epoch());

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

I think we still need to send a map before queuing -- if they only send outdated messages on a removed PG, they just get tossed and nothing updates the client, right?
But we can track that the client messages are outdated and just invoke maybe_share_map() once rather than per-message.

This comment has been minimized.

@liewegas

liewegas Feb 17, 2017

Member

good point. i think we can still do that in _process instead of here, tho, to keep the dispatch path fast.

dout(20) << __func__ << " " << item.first << " no pg, item epoch is "
<< qi->get_map_epoch() << " > " << osdmap->get_epoch()
<< ", will wait on " << *qi << dendl;
slot.to_process.push_front(*qi);

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Shouldn't any ops for missing PGs here get pushed onto the back of the list to preserve ordering when they're dequeued from the front?

This comment has been minimized.

@liewegas

liewegas Feb 17, 2017

Member

Nope.. things come from pqueue -> to_process (back). we just pulled this off the front, realized it was too new for us, so we need to shove it back on the front. to_process doubles as a waiting_for_pg. once the pg appears, wake_pg_waiters will shove it back up into pqueue (in back to front) to preserve ordering.

if (p != sdata->pg_slots.end() && !p->second.to_process.empty()) {
// we may be racing with _process, which has dequeued an item from
// pqueue, put it on to_process, and is now busy taking the pg
// lock. ensure this op is ordered before it/them.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Your referents are ambiguous here. You're trying to guarantee that an already-dequeued op orders ahead of us even though we're pushing something on the front, right?

This logic only works if to_process has either 0 or 1 items, which is not the case if it's for a missing PG.

This comment has been minimized.

@liewegas

liewegas Feb 17, 2017

Member

is this clearer?

// we may be racing with _process, which has dequeued an item from
// pqueue, put it on to_process, and is now busy taking the pg
// lock.  ensure this op is ordered before any such op in to_process.

The ordering chain is:

fast dispatch -> pqueue back ... pqueue front <-> to_process back ... to_process front <-> PG::do_request() or PG::requeue_op()

I'll put this in the header.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

"this op" is the problem I'm having.

}
// take next item
qi = slot.to_process.front();

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

We don't seem to have any guarantee here that it's the same item we pushed on the list earlier. Shouldn't we be checking that?

This comment has been minimized.

@liewegas

liewegas Feb 17, 2017

Member

It doesn't have to be.. we may have multiple threads working on this shard. That's one of the main reasons why to_process exists in the first place.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Okay, so that's a way for client messages to the same PG/object to get mis-ordered then, isn't it? :/

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Oh, we are protected from that by the sdata_op_ordering_lock, aren't we.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 17, 2017

We've got a problem here with the way we're handling missing PGs. Best I can tell, the way we queue them up as a list inside of the ShardedWorkQueue, and then pass it back through, means that every time we wake them up we force the Queue to pay to pass them through, even though no real work is done. That's going to be a problem keeping our op weighting correct, isn't it?

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 17, 2017

Assuming I'm correct about the bugs in _process(), I submit that it's too complicated to maintain in its current form.
If I'm wrong, it's definitely too complicated to have no documentation about its invariants and locking design.

@@ -9180,7 +9180,11 @@ void OSD::ShardedOpWQ::_process(uint32_t thread_index, heartbeat_handle_d *hb)
osd->service.maybe_inject_dispatch_delay();
boost::optional<PGQueueable> qi;
// we don't use a Mutex::Locker here because of the
// osd->service.release_reserved_pushes() call below

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Hmm, I guess this works — if somebody forgets to do an unlock it ought to become apparent pretty quickly. But if I ever have to touch this code I'll be setting up an RAII structure to call release_reserved_pushes when we need it. ;)

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

I don't follow the waiting_for_pgs question. There is only a single wake-up.. when the PG is created and registered in pg_map. This does bypass the prioritization stuff in pqueue (anything we push back is put at the highest priority to preserved the previously-established ordering). That's not new, though, and I'm not sure how it would be fixed. In any case, though, it only affects just-created PGs, so I'm not too concerned.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 17, 2017

I don't think we want to pay the cost of the operation if we're not actually going to do it now, since it will throw the correct proportions of client/recovery/etc IO out of whack.
Perhaps we do since once the PG appears we're going to process it immediately, though?

This is a new problem since we used to block ops for missing PGs well before they reached the priority queue.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

@@ -1657,6 +1657,7 @@ class OSD : public Dispatcher,
PGRef pg;
list<PGQueueable> to_process;
bool waiting_for_pg = false;
int num_running = 0;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Document its purpose, please :)

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

N/m, later patch does that.

<< " no pg_slot or nothing pending" << dendl;
assert(q != sdata->pg_slots.end());
auto& slot = q->second;
--slot.num_running;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

This and the increment are placed bizarrely. Why not just do it right next to the lock twiddling, preferably within a scope that makes it really obvoius?

@@ -9081,6 +9081,7 @@ void OSD::ShardedOpWQ::prune_pg_waiters(OSDMapRef osdmap, int whoami)
}
}
if (slot.to_process.empty() &&
slot.num_running == 0 &&

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Don't we need to somehow invoke prune_pg_waiters() at a later time? I don't see a mechanism that actually cleans up.

// we may be racing with _process, which has dequeued a new item
// from pqueue, put it on to_process, and is now busy taking the
// pg lock. ensure this old equeued item is ordered before any
// such newer item in to_process.

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 17, 2017

Member

Yeah, I still think this is broken with respect to a to_process list of >1 entry. Need to take a reference prior to pushing front, or traverse it, or something?

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 17, 2017

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 17, 2017

It's 3 lines down from retaking the lock, right after our sanity check
and convenience ref. Where else would it go?

Incrementing the value is the more confusing one, since it's in a separate block from dropping the lock. But since the sole purpose of num_running is to indicate there's a thread in the in-between state, I'd expect it to be literally adjacent — mostly, so that nobody accidentally pushes it farther away from the lock in future.

I dunno, maybe it doesn't matter much as long as it goes away while the lock is held, but something about it seemed disjoint to me.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 17, 2017

I don't think I can associate your other comments with the code — they seem to be losing the comment threads when you reply by email or something. :(

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 18, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 20, 2017

Okay, I think this is close. Last run (http://pulpito.ceph.com/sage-2017-02-20_02:50:49-rados-wip-faster-dispatch---basic-smithi/) only turned up failures from the monc hunting bug, fixed in master but not in the tested branch. I'm rebasing for another run through teuthology; is there anything else I should address before squashing this down?

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 20, 2017

As discussed verbally
Reviewed-by: Greg Farnum gfarnum@redhat.com

(Once squashed, obviously)

msg/Dispatcher: pass const Message* to ms_can_fast_dispatch
Signed-off-by: Sage Weil <sage@redhat.com>
osd: restructure op_shardedwq
This is difficult to break into pieces, so one big fat commit it is.

A few trivial bits

- include epoch in PGQueueable.
- PGQueuable operator<<
- remove op_wq ref from OSDService; use simple set of queue methods instead

The big stuff:

- Fast dispatch now passes messages directly to the queue based on an
spg_t.  The exception is MOSDOp's from legacy clients.  We add a
waiting_for_map mechanism on the front-side that is similar to but simpler
than the previous one so that we can map those legacy requests to an
accurate spg_t.
- The dequeue path now has a waiting_for_pg mechanism.  It also uses a
much simpler set of data structures that should make it much faster than
the previous incarnation.
- Shutdown works a bit differently; we drain the queue instead of trying
to remove work for individual PGs.  This lets us remove the dequeue_pg
machinery.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas changed the title from DNM: osd: faster dispatch to osd: faster dispatch Feb 25, 2017

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 26, 2017

rebased again, and

http://pulpito.ceph.com/sage-2017-02-25_20:12:55-rados-wip-faster-dispatch---basic-smithi/

lots of noise but it's all in master too. will hold off until @markhpc shows there isn't any regression.

@markhpc

This comment has been minimized.

Member

markhpc commented Mar 1, 2017

Cursory test results of wip-faster-dispatch are... strange. There appears to be some movement in single-OSD tests, but everything normalizes out in 16 OSD cluster tests (at least in the 4MB and some of the 128KB tests we may be network limited). I would have expected to see gains with small 4K random reads but those didn't really move either way.
wip-faster-dispatch_rbd_tests.pdf

@liewegas liewegas merged commit e6e8365 into ceph:master Mar 1, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-faster-dispatch branch Mar 1, 2017

@jjhuo

This comment has been minimized.

Contributor

jjhuo commented Mar 1, 2017

@markhpc Thanks for sharing your results, did you see any difference in terms of IO latency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment