Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd,osdc: pg and osd-based backoff #12342

Merged
merged 14 commits into from Feb 11, 2017

Conversation

Projects
None yet
5 participants
@liewegas
Copy link
Member

liewegas commented Dec 6, 2016

If a PG is blocked (down, incomplete) or an object is blocked (unfound), the OSD
can send a MOSDBackoff message to the client to ask it to stop sending requests
for that PG or object. Later, when the OSD recovers (PG peers or object is found)
it can send an unblock message, at which point the client will resend all blocked
messages.

The primary benefit is that the OSD doesn't need to keep blocked messages around in
memory (the client will resend them later). This prevents a blocked PG or object
from accumulating blocked requests that then fill up the OSD's memory limit on client
request memory.

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from a092378 to b973b90 Dec 6, 2016

@gregsfortytwo

This comment has been minimized.

Copy link
Member

gregsfortytwo commented Dec 7, 2016

Didn't look at most of this, but I think we need docs describing the invariants and rules before this goes any farther — I'm sure something can be divined based on the code but we really need specific, explainable rules for how this interacts with stuff like flushing (at the user API level) and throttles (for the internal developer) or it'll be completely unsupportable.

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Dec 9, 2016

@byo please test/review

@tchaikov tchaikov self-assigned this Dec 9, 2016

@byo
Copy link
Contributor

byo left a comment

@liewegas: Started a review and test of this code, most of the stuff is way beyond my current understanding of ceph codebase so I did rather brief reading there.

I had to do one patch to unbreak make_debs.sh:

diff --git a/src/test/mon/test_mon_workloadgen.cc b/src/test/mon/test_mon_workloadgen.cc
index c1b26ef..1930689 100644
--- a/src/test/mon/test_mon_workloadgen.cc
+++ b/src/test/mon/test_mon_workloadgen.cc
@@ -32,6 +32,7 @@
 
 #include "osd/osd_types.h"
 #include "osd/OSD.h"
+#include "osd/Session.h"
 #include "osdc/Objecter.h"
 #include "mon/MonClient.h"
 #include "msg/Dispatcher.h"
@@ -900,7 +901,7 @@ class OSDStub : public TestStub
 
   bool ms_handle_reset(Connection *con) {
     dout(1) << __func__ << dendl;
-    OSD::Session *session = (OSD::Session *)con->get_priv();
+    Session *session = (Session *)con->get_priv();
     if (!session)
       return false;
     session->put();

I will deploy this in my test environment now, let's see how it works.


Ordinarily the OSD will simply queue any requests it can't immeidately
process in memory until such time as it can. This can become
problematic because the OSD limits the total amount of RAM consuemd by

This comment has been minimized.

Copy link
@byo

byo Dec 12, 2016

Contributor

Typo "consuemd" => "consumed"

Ordinarily the OSD will simply queue any requests it can't immeidately
process in memory until such time as it can. This can become
problematic because the OSD limits the total amount of RAM consuemd by
incoming messages: if the threshold is reached, new messages will not

This comment has been minimized.

Copy link
@byo

byo Dec 12, 2016

Contributor

Shouldn't we also say about the limit of number of simultaneously processed messages (independently of their size)?

/*
* Ceph - scalable distributed file system
*
* Copyright (C) 2004-2006 Sage Weil <sage@newdream.net>

This comment has been minimized.

Copy link
@byo

byo Dec 12, 2016

Contributor

I guess this should be updated ;)

Either way, the request ultimately targets a PG. The client sends the
request to the primary for the assocated PG.

Each request is assigned a unique tid.

This comment has been minimized.

Copy link
@byo

byo Dec 12, 2016

Contributor

Does this has to increase at each request? It looks like the code does require this: https://github.com/ceph/ceph/pull/12342/files#diff-dfb9ddca0a3ee32b266623e8fa489626R2408

@byo

This comment has been minimized.

Copy link
Contributor

byo commented Dec 13, 2016

@liewegas: I'm getting a client crush when running rados bench after applying this patch. Currently trying to find a test scenario to easily reproduce that.

@byo

This comment has been minimized.

Copy link
Contributor

byo commented Dec 14, 2016

The crush I'm getting is pretty awkward, what I'm doing is: take 3 OSDs down, all from different racks (different failure domains) so that there are blocked objects, then startrados bench -p rbd -t 1200 1200 write, wait for it to initialize, then while it's still running, start bringing OSDs up.

Most of the time I'm getting stack traces like:

(gdb) bt
#0 0x00007fffef5ca38a in Objecter::_check_op_pool_dne (this=this@entry=0x5555558c24b0,
op=op@entry=0x555555918240, sl=...)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1512
#1 0x00007fffef5d049e in Objecter::_scan_requests (this=this@entry=0x5555558c24b0,
s=s@entry=0x5555558e6d10, force_resend=force_resend@entry=false, cluster_full=cluster_full@entry=false,
pool_full_map=pool_full_map@entry=0x7fffe0d5de50, need_resend=..., need_resend_linger=...,
need_resend_command=..., sul=...)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1091
#2 0x00007fffef5dcb81 in Objecter::handle_osd_map (this=this@entry=0x5555558c24b0,
m=m@entry=0x7fffdc0131b0)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1270
#3 0x00007fffef5de957 in Objecter::ms_dispatch (this=0x5555558c24b0, m=0x7fffdc0131b0)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:1010
#4 0x00007fffef81a8db in ms_deliver_dispatch (m=0x7fffdc0131b0, this=0x555555835900)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/Messenger.h:593
---Type to continue, or q to quit---
#5 DispatchQueue::entry (this=0x555555835a50)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/DispatchQueue.cc:197
#6 0x00007fffef6b971d in DispatchQueue::DispatchThread::entry (this=)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/msg/DispatchQueue.h:103
#7 0x00007fffe68db182 in start_thread (arg=0x7fffe0d60700) at pthread_create.c:312
#8 0x00007fffe5be400d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I've also seen this:

(gdb) bt
#0 0x00007fffe5b20c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fffe5b24028 in __GI_abort () at abort.c:89
#2 0x00007fffef62ec85 in ceph::__ceph_assert_fail (assertion=, file=, line=2110,
func=0x7fffef8a9e30 Objecter::tick()::__PRETTY_FUNCTION__ "void Objecter::tick()") at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/assert.cc:78
#3 0x00007fffef5d214d in Objecter::tick (this=0x5555558c2650) at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/osdc/Objecter.cc:2110
#4 0x00007fffef5a7831 in operator() (this=0x5555558cc6f0) at /usr/include/c++/4.8/functional:2471
#5 ceph::timer_detail::timerceph::time_detail::mono_clock::timer_thread (this=0x5555558c2850)
at /tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/ceph_timer.h:137
#6 0x00007fffe647ca60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fffe68db184 in start_thread (arg=0x7fffe155d700) at pthread_create.c:312
#8 0x00007fffe5be437d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Worth noting is that I was running rados bench from the mon node. This mon is running in docker container (old style, lxc-based) on ubuntu 14.04. I was using some older packages first but same happened once I fully upgraded to newest 14.04 stuff.

I also tested it on a totally separate VM, started with clean up-to-date 14.04 and installed packages I compiled and used on the cluster, guess what: it works flawlessly, couldn't reproduce the crash. This VM will definitely have much higher network latency when talking to the cluster, maybe that's the reason?

On the other hand, I also tried to test this version using fio but was unable to run it at:

/tmp/release/Ubuntu/WORKDIR/ceph-11.0.2-2399-gee35756/src/common/ceph_crypto.cc: 77: FAILED assert(crypto_context != __null)
I was running fio 2.1.11, didn't yet check the newest version.

@liewegas liewegas force-pushed the liewegas:wip-backoff branch 4 times, most recently from 4f5133c to 4b8b66d Dec 15, 2016

@liewegas liewegas force-pushed the liewegas:wip-backoff branch 3 times, most recently from 0a274cc to 0f468d3 Dec 22, 2016

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from c354f07 to a2486f2 Jan 9, 2017

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from 464cedb to a6b8cc6 Jan 18, 2017

@liewegas liewegas force-pushed the liewegas:wip-backoff branch 2 times, most recently from f41c4fc to 5912fa3 Jan 30, 2017

@liewegas liewegas changed the title DNM: osd,osdc: pg and osd-based backoff osd,osdc: pg and osd-based backoff Feb 2, 2017

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from 5912fa3 to 3b40fcc Feb 2, 2017

@liewegas liewegas requested a review from jdurgin Feb 2, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Feb 2, 2017

http://pulpito.ceph.com/sage-2017-02-02_14:58:21-rados:thrash-wip-backoff---basic-smithi looking good! i made a small change and need to send it through one more time, but don't expect problems.

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from 3b40fcc to 03be6e4 Feb 3, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Feb 6, 2017

http://pulpito.ceph.com/sage-2017-02-05_18:49:00-rados:thrash-wip-backoff---basic-smithi/ is pretty clean (failures releated to lab cluster). the exception is one assert from a live obc in on_flushed, which is either a new regression in master or related to backoff... :/

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Feb 6, 2017

Nevermind.. that failure is an unrelated bug (that is fixed). This is ready for final review and (hopefully) merge!

@byo

This comment has been minimized.

Copy link
Contributor

byo commented Feb 6, 2017

Great to see this, we'll try to put it into our testing machinery soon

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Feb 6, 2017

}
spg_t pgid;
if (!osdmap->get_primary_shard(_pgid, &pgid)) {
// missing pool or acting set empty -- drop

This comment has been minimized.

Copy link
@byo

byo Feb 9, 2017

Contributor

Is this a normal situation? If the backoff spans across multiple PGs, this could break the loop in the middle before ops to all PGs are enqueued.

This comment has been minimized.

Copy link
@liewegas

liewegas Feb 9, 2017

Author Member

good point, fixing

// map hobject range to PG(s)
bool queued = false;
hobject_t pos = m->begin;
while (true) {

This comment has been minimized.

Copy link
@byo

byo Feb 9, 2017

Contributor

This could be rephrased as do...while

@byo

This comment has been minimized.

Copy link
Contributor

byo commented Feb 9, 2017

@liewegas I did test in our infrastructure, so far looks really good, works as expected and no issues under load. I briefly checked the code, left few minor comments (few old ones also still apply).
Hope this will end up in the master soon, is it going to be a part of luminous?

@liewegas

This comment has been minimized.

Copy link
Member Author

liewegas commented Feb 9, 2017

@byo thanks, updated! This will go into master shortly and will definitely be part of luminous.

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from 03be6e4 to ec5f85c Feb 9, 2017

@jdurgin
Copy link
Member

jdurgin left a comment

You got all the corner cases for resetting backoffs that I could think of. RefCountedObject priv object ownership is confusingly asymmetric, but not worth changing now.

liewegas added some commits Oct 17, 2016

osd: add MOSDBackoff message type
Signed-off-by: Sage Weil <sage@redhat.com>
osd: move Session class out of OSD
Signed-off-by: Sage Weil <sage@redhat.com>
common/hobject: use hash field for hash<>
No reason to waste CPU recalculating a hash value!

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: fix typo
Signed-off-by: Sage Weil <sage@redhat.com>
include/ceph_features: RADOS_BACKOFF feature bit
Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: attach OSDSession to Connection
This lets us avoid an rbtree lookup.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: add pg helpers for managing backoffs
Link Backoff into Session, PG.  Tear them down on session
reset.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: split backoffs on PG split
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from ec5f85c to 9de7ed7 Feb 10, 2017

liewegas added some commits Feb 10, 2017

osd: dispatch OSDBackoff messages into PG workqueue
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: PG-wide backoffs
Issue at top of do_request.  Release on activation or peering
interval change.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: respect backoffs
Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: backoffs on individual objects
Do these midway down do_op.  Reorder the scrub waitlist after the
degraded and unreadable waitlists.

Signed-off-by: Sage Weil <sage@redhat.com>
qa/suites/rados/thrash: vary backoff settings
Signed-off-by: Sage Weil <sage@redhat.com>
doc/dev/rados-client-protocol: describe some of rados protocol, inclu…
…ding backoffs

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-backoff branch from 9de7ed7 to 46ade14 Feb 10, 2017

@liewegas liewegas merged commit d4db8e0 into ceph:master Feb 11, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-backoff branch Feb 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.